This patch implements the new kmp_sch_static_balanced_chunked schedule kind that
the compiler will generate when it encounters schedule(simd: static). It just
adds the new constant and the new switch case __kmp_for_static_init.
Patch by Alex Duran.
Differential Revision: http://reviews.llvm.org/D20699
llvm-svn: 271320
When an asynchronous offload task is completed, COI calls the runtime to queue
a "destructor task". When the task deques are full, a dead-lock situation
arises where the OpenMP threads are inside but cannot progress because the COI
thread is stuck inside the runtime trying to find a slot in a deque.
This patch implements the solution where the task deques doubled in size when
a task is being queued from a COI thread.
Differential Revision: http://reviews.llvm.org/D20733
llvm-svn: 271319
The problem is the lack of dispatch buffers when thousands of loops with nowait,
about 10 iterations each, are executed by hundreds of threads. We only have
built-in 7 dispatch buffers, but there is a need in dozens or hundreds of
buffers.
The problem can be fixed by setting KMP_MAX_DISP_BUF to bigger value. In order
to give users same possibility I changed build-time control into run-time one,
adding API just in case.
This change adds an environment variable KMP_DISP_NUM_BUFFERS and a new API
function kmp_set_disp_num_buffers(int num_buffers).
The KMP_DISP_NUM_BUFFERS envirable works only before serial initialization,
because during the serial initialization we already allocate buffers for the hot
team, so it is too late to change the number of buffers later (or we need to
reallocate buffers for all teams which sounds too complicated). The
kmp_set_defaults() routine does not work for this envirable, because it calls
serial initialization before reading the parameter string. So a new routine,
kmp_set_disp_num_buffers(), is created so that it can set our internal global
variable before the library initialization. If both the envirable and API used
the envirable wins.
Differential Revision: http://reviews.llvm.org/D20697
llvm-svn: 271318
The OMP_PROC_BIND=spread strategy fails to assign the master thread the
correct place partition after the first parallel region. Other threads in the
hot team will remember their place_partition, but the master's place partition
is restored to what it was before entering the parallel region. So when the hot
team is used for subsequent parallel regions, the master has lost this info.
This fix calls __kmp_partition_places to update only the master thread's place
partition in the spread case when there are no other changes to the hot team.
Patch by Terry Wilmarth
Differential Revision: http://reviews.llvm.org/D20539
llvm-svn: 270890
On Blue Gene/Q, having LIBOMP_USE_ITT_NOTIFY support compiled into a
statically-linked binary causes a failure at runtime because dlopen fails.
This patch changes LIBOMP_USE_ITT_NOTIFY to a cacheable configuration setting
that can be disabled.
Patch by John Mellor-Crummey
Differential Revision: http://reviews.llvm.org/D20517
llvm-svn: 270884
Clang no longer restricts itself to generating microtasks with a small number
of arguments, and so an assembly implementation is required to prevent hitting
the parameter limit present in the C implementation. This adds an
implementation for ppc64[le].
llvm-svn: 270821
Most of this is modifications to check for differences before updating data
fields in team struct. There is also some rearrangement of the team struct.
Patch by Diego Caballero
Differential Revision: http://reviews.llvm.org/D20487
llvm-svn: 270468
These changes allow testing on Windows using clang.exe.
There are two main changes:
1. Only link to -lm when it actually exists on the system
2. Create basic versions of pthread_create() and pthread_join() for windows.
They are not POSIX compliant by any stretch but will allow any existing
and future tests to use pthread_create() and pthread_join() for testing
interactions of libomp with os threads.
Differential Revision: http://reviews.llvm.org/D20391
llvm-svn: 270464
KMP_USE_FUTEX preprocessor definition defined in kmp_lock.h is used
inconsequently throughout LLVM libomp code.
* some .c files that use this define do not include kmp_lock.h file,
in effect guarded part of code are never compiled
* some places in code use architecture-depending preprocessor
logic expressions which effectively disable use of Futex for
AArch64 architecture, all these places should use
'#if KMP_USE_FUTEX' instead to avoid any further confusions
* some places use KMP_HAS_FUTEX which is nowhere defined,
KMP_USE_FUTEX should be used instead
Differential Revision: http://reviews.llvm.org/D19629
llvm-svn: 269642
This patch solves 'Too many args to microtask' problem which occurs
while executing lulesh2.0.3 benchmark on AArch64.
To solve this I had to wrtite AArch64 assembly version of
__kmp_invoke_microtask() function, similar to x86 and x86_64
implementations.
Differential Revision: http://reviews.llvm.org/D19879
llvm-svn: 269399
This change adds a new entry point,
kmp_aligned_malloc(size_t size, size_t alignment), an entry point corresponding
to kmp_malloc() but with the capability to return aligned memory as well.
Other allocator routines have been adjusted so that kmp_free() can be used for
freeing memory blocks allocated by any kmp_*alloc() routine, including the new
kmp_aligned_malloc() routine.
Differential Revision: http://reviews.llvm.org/D19814
llvm-svn: 269365
After hot teams were enabled by default, the library started using levels kept
in the team structure. The levels are broken in case foreign thread exits and
puts its team into the pool which is then re-used by another foreign thread.
The broken behavior observed is when printing the levels for each new team, one
gets 1, 2, 1, 2, 1, 2, etc. This makes the library believe that every other
team is nested which is incorrect. What is wanted is for the levels to be
1, 1, 1, etc.
Differential Revision: http://reviews.llvm.org/D19980
llvm-svn: 269363
This reverts a presumaby-unintentional change in:
r268640 - [STATS] Use partitioned timer scheme
and fixes segfaults in an x86_64 debug build of the runtime library.
llvm-svn: 269259
This patch introduces following:
* TCI_* and TCD_* macros for incrementation and decrementation
* Fix for invalid use of TCR_8 in one expression
Differential Revision: http://reviews.llvm.org/D19880
llvm-svn: 268826
This change removes the current timers with ones that partition time properly.
The current timers are nested, so that if a new timer, B, starts when the
current timer, A, is already timing, A's time will include B's. To eliminate
this problem, the partitioned timers are designed to stop the current timer (A),
let the new timer run (B), and when the new timer is finished, restart the
previously running timer (A). With this partitioning of time, a threads' timers
all sum up to the OMP_worker_thread_life time and can now easily show the
percentage of time a thread is spending in different parts of the runtime or
user code.
There is also a new state variable associated with each thread which tells where
it is executing a task. This corresponds with the timers: OMP_task_*, e.g., if
time is spent in OMP_task_taskwait, then that thread executed tasks inside a
#pragma omp taskwait construct.
The changes are mostly changing the MACROs to use the new PARITIONED_* macros,
the new partitionedTimers class and its methods, and new state logic.
Differential Revision: http://reviews.llvm.org/D19229
llvm-svn: 268640
This debug sections's functionality can be replicated using the environment
variable KMP_TOPOLOGY_METHOD with different values and KMP_AFFINITY=verbose
llvm-svn: 267472
This change has the hwloc_bitmap_list_snprintf() function use the entire buffer
to print the mask. There is no need to shorten the buffer length by 7. It only
needs to be shortened by one byte.
llvm-svn: 267470
The trip count calculation was incorrect for loops with large bounds. For example,
for(int i=-2,000,000,000; i < 2,000,000,000; i+=50000000), the trip count
calculation had overflow (trying to calculate 2,000,000,000 + 2,000,000,000 with
signed integers) and wasn't giving the right value. This patch fixes this error
in the runtime by using unsigned integers instead. There is still a bug in the
clang compiler component because it warns that there is overflow in the
test case file when there isn't. This error isn't there for the Intel Compiler.
So for now, the test case is designated as XFAIL.
Differential Revision: http://reviews.llvm.org/D19078
llvm-svn: 266677
Introduced a counter of parts of an untied task submitted for execution. The
counter controls whether all parts of the task are already finished. The
compiler should generate re-submission of partially executed untied task by
itself before exiting of each task part except for the lexical last part.
Differential Revision: http://reviews.llvm.org/D19026
llvm-svn: 266675
Some codes that use TLS fail intermittently because one thread tries to write
TLS values after the TLS key has been destroyed by another thread. This happens
when one thread executes library shutdown (and destroys TLS keys), while another
thread starts to execute the TLS key destructor routine. Before this change, the
kmp_init_runtime flag was checked before calling pthread_* TLS functions, but
this flag is set to FALSE later than the destruction of the TLS keys, which
leads to failure. The fix is to check kmp_init_gtid instead, as this flag is
unset *before* the destruction of TLS keys.
Differential Revision: http://reviews.llvm.org/D19022
llvm-svn: 266674
ittnotify fix for barrier imbalance time in case tasks exist. In the current
implementation, task execution time is included into aggregated time on a
barrier. This fix calculates task execution time and corrects the arrive time
by subtracting the task execution time.
Since __kmp_invoke_task() can not only be called on a barrier, the field
th.th_bar_arrive_time is used to check if the function was called at the
barrier (th.th_bar_arrive_time != 0). So for this check, th_bar_arrive_time
is set to zero right after the value is used on the barrier.
Differential Revision: http://reviews.llvm.org/D19030
llvm-svn: 266332
This change adds back off logic in the test and set lock for better contended
lock performance. It uses a simple truncated binary exponential back off
function. The default back off parameters are tuned for x86.
The main back off logic has a two loop structure where each is controlled by a
user-level parameter:
max_backoff - limits the outer loop number of iterations.
This parameter should be a power of 2.
min_ticks - the inner spin wait loop number of "ticks" which is system
dependent and should be tuned for your system if you so choose.
The "ticks" on x86 correspond to the time stamp counter,
but on other architectures ticks is a timestamp derived
from gettimeofday().
The user can modify these via the environment variable:
KMP_SPIN_BACKOFF_PARAMS=max_backoff[,min_ticks]
Currently, since the default user lock is a queuing lock,
one would have to also specify KMP_LOCK_KIND=tas to use the test-and-set locks.
Differential Revision: http://reviews.llvm.org/D19020
llvm-svn: 266329
This change has OMP_WAIT_POLICY=active to mean that threads will busy-wait in
spin loops and virtually never go to sleep. OMP_WAIT_POLICY=passive now means
that threads will immediately go to sleep inside a spin loop. KMP_BLOCKTIME was
the previous mechanism to specify this behavior via KMP_BLOCKTIME=0 or
KMP_BLOCKTIME=infinite, but the standard OpenMP environment variable should
also be able to specify this behavior.
Differential Revision: http://reviews.llvm.org/D18577
llvm-svn: 265339
#endif was one line too low. If KMP_USE_ADAPTIVE_LOCKS is 0,
then queuing locks would incorrectly use drdpa lock mechanism.
This is a fix for https://llvm.org/bugs/show_bug.cgi?id=26649
llvm-svn: 264934
Removed reference to "ref ct" in a comment, as ref_ct no longer exists. Also
moved the comment to where the task_team is about to be tested if NULL.
llvm-svn: 264786
The problem is that the definition of kmp_cpuinfo_t contains:
char name [3*sizeof (kmp_cpuid_t)]; // CPUID(0x80000002,0x80000003,0x80000004)
and kmp_cpuid_t is only defined when compiling for x86.
Differential Revision: http://reviews.llvm.org/D18245
llvm-svn: 264535
For serialized parallel regions, wrong ids were reported. Now the same code is
used as in kmp_dispatch.cpp which emits the correct ids.
Differential Revision: http://reviews.llvm.org/D18348
llvm-svn: 264266
For non-serialized parallel regions the master thread issued two callbacks:
The first one in kmp_gsupport.c and the second in __kmp_join_call. Therefore
only trigger the callback in kmp_gsupport.c for serialized parallel regions.
Differential Revision: http://reviews.llvm.org/D16716
llvm-svn: 264264
Have Visual Studio use MemoryBarrier() instead of _mm_mfence() and remove
__declspec align attribute from function parameters in kmp_atomic.h
llvm-svn: 264166
Some basic checks next to the implementation should futher lower the
possibility to introduce regressions. (Note that this would have catched
the ordering issue fixed in rL258866 and pointed to rL263940.)
The tests are implementation dependent in one point because they assume that
thread ids are assigned in ascending order. This is not defined by the standard
but currently ensured in libomp. We have to think about another way of ordering
the threads should this ever be subject to change...
Note that this isn't aiming at replacing the implementation independent
test-suite at https://github.com/OpenMPToolsInterface/ompt-test-suite!
Differential Revision: http://reviews.llvm.org/D16715
llvm-svn: 264027
This change logically separates the stats_flags_e::noTotal bit flag from the
stats_flags_e::onlyInMaster and stats_flags_e::noUnits bit flags. If no
TOTAL_foo output is wanted for a particular statistic, the flag must be
explicitly included in that statistic's flags.
Differential Revision: http://reviews.llvm.org/D18198
llvm-svn: 263954
Building libomp using CMake versions < 3.3 caused a link time error. These
errors occurred because when assembling z_Windows_NT-586_asm.asm, the
definitions: OMPT_SUPPORT, _M_AMD64|_M_IA32 weren't defined on the command line.
To fix the problem, the COMPILE_FLAGS property for the assembly file is appended
to instead of the COMPILE_DEFINITIONS property being set. For whatever reason, the
COMPILE_DEFINITIONS property doesn't pick up the definitions for assembly files
for the older CMake versions.
llvm-svn: 263651
This change adds a header to the printout of the statistics which includes the
time, machine name, and processor info if available. This change also includes
some cosmetic changes like using enum casting for timer and counter iteration.
Differential Revision: http://reviews.llvm.org/D18153
llvm-svn: 263580
This change removes synthesized stats and instead has all timers print out a
total which is the aggregate statistics across threads. This is displayed as
"Total_foo" at the end of program. The stats_flags_e::synthesized flag is
removed and the printStats() function is split into two separate functions:
printTimerStats() which can display the aggregate total and printCounterStats().
Differential Revision: http://reviews.llvm.org/D17869
llvm-svn: 263290
From the standard: The taskloop construct specifies that the iterations of one
or more associated loops will be executed in parallel using OpenMP tasks. The
iterations are distributed across tasks created by the construct and scheduled
to be executed.
This initial implementation uses a simple linear tasks distribution algorithm.
Later we can add other algorithms to speedup generation of huge number of tasks
(i.e., tree-like tasks generation should be faster).
This needs to be put into the OpenMP runtime library in order for the
compiler team to develop the compiler side of the implementation.
Differential Revision: http://reviews.llvm.org/D17404
llvm-svn: 262535
From the standard: A doacross loop nest is a loop nest that has cross-iteration
dependence. An iteration is dependent on one or more lexicographically earlier
iterations. The ordered clause parameter on a loop directive identifies the
loop(s) associated with the doacross loop nest.
The init/fini routines allocate/free doacross buffer(s) for each loop for each
thread. The wait routine waits for a flag designated by the dependence vector.
The post routine sets the flag designated by current iteration vector. We use
a similar technique of shared buffer indices that covers up to 7 nowait loops
executed simultaneously by different threads (number 7 has no real meaning,
just heuristic value). Also, the size of structures are kept intact via
reducing dummy arrays.
This needs to be put into the OpenMP runtime library in order for the compiler
team to develop the compiler side of the implementation.
Differential Revision: http://reviews.llvm.org/D17399
llvm-svn: 262532
This change introduces the new OpenMP 4.5 affinity api surrounding
OpenMP Places. There are six new entry points:
Typically called in serial region:
* omp_get_num_places - returns the number of places available to the execution
environment in the place list.
* omp_get_place_num_procs - returns the number of processors available to the
execution environment in the specified place.
* omp_get_place_proc_ids - returns the numerical identifiers of the processors
available to the execution environment in the specified place.
Typically called inside parallel region:
* omp_get_place_num - returns the place number of the place to which the
encountering thread is bound.
* omp_get_partition_num_places - returns the number of places in the place
partition of the innermost implicit task.
* omp_get_partition_place_nums - returns the list of place numbers
corresponding to the places in the place-var ICV of the innermost
implicit task.
Differential Revision: http://reviews.llvm.org/D17417
llvm-svn: 261915
The maximum task priority value is read from envirable: OMP_MAX_TASK_PRIORITY.
But as of now, nothing is done with it. We just handle the environment variable
and add the new api: omp_get_max_task_priority() which returns that value or
zero if it is not set.
Differential Revision: http://reviews.llvm.org/D17411
llvm-svn: 261908
The monotonic/non-monotonic flags are sent to the runtime via the sched_type by
setting the 30th (non-monotonic) or 29th (monotonic) bit in the sched_type.
Macros are added to probe if monotonic or non-monotonic is specified
(SCHEDULE_HAS_[NON]MONOTONIC & SCHEDULE_HAS_NO_MODIFIERS)
and also to to get the base sched_type (SCHEDULE_WITHOUT_MODIFIERS)
Currently, nothing is done with the modifiers.
Also, this patch adds some comments on the use of the enumerations in at least
one place where it is subtle.
Differential Revision: http://reviews.llvm.org/D17406
llvm-svn: 261906
For pragma omp taskwait the runtime is called from the task context.
Therefore, the reentry frame information should be updated.
The information should be available for both taskwait event calls; therefore,
set before the first event and reset after the last event.
Patch by Joachim Protze
Differential Revision: http://reviews.llvm.org/D17145
llvm-svn: 260674
When a target task finishes and it tries to access the th_task_team from the
threads in the team where it was created, th_task_team can be NULL or point to
a different place when that thread started a nested region that is still
running. Finding the exact task_team that the threads were using is difficult
as it would require to unwind the task_state_memo_stack. So a new field was added
in the taskdata structure to point to the active task_team when the task was
created.
llvm-svn: 260615
The problem is that the master's thread state was not saved before entering a
parallel region so it does not remember tasks when it returns.
llvm-svn: 260306
The -install_name linker flag will use "@rpath/" when supported in CMake
which is the recommended usage for dynamic libraries on Mac OSX.
llvm-svn: 260300
(libgomp has bool as well)
This was causing a test failure in omp_test_if.c when building with GCC in
Debug mode. I have verified that GCC versions 4.9.2 and 5.3.0 now work and
compile-tested this change with clang 3.7.1 and Intel Compiler 16.0.
Differential Revision: http://reviews.llvm.org/D16921
llvm-svn: 260204
This will be used in a later patch to find additional LLVM tools for tests and
enables reusability for libomptarget that is currently under review.
Differential Revision: http://reviews.llvm.org/D16713
llvm-svn: 259876
When building executables for Cray supercomputers, statically-linked executables
are preferred. This patch makes it possible to build the OpenMP runtime as an
archive for building statically-linked executables. The patch adds the flag
LIBOMP_ENABLE_SHARED, which defaults to true. When true, a build of the OpenMP
runtime yields dynamic libraries. When false, a build of the OpenMP runtime
yields static libraries. There is no setting that allows both kinds of libraries
to be built.
Patch by John Mellor-Crummey
Differential Revision: http://reviews.llvm.org/D16525
llvm-svn: 259817
In: http://lists.llvm.org/pipermail/openmp-dev/2015-August/000858.html, a
performance issue was found with libomp's task dependencies. The task
dependencies hash table has an issue with collisions. The current table size is
a power of two. This combined with the current hash function causes a large
number of collisions to occurr. Also, the current size (64) is too small for
larger applications so the table size is increased.
This patch creates a two level hash table approach for task dependencies. The
implicit task is considered the "master" or "top-level" task which has a large
static sized hash table (997), and nested tasks will have smaller hash
tables (97). Prime numbers were chosen to help reduce collisions.
Differential Revision: http://reviews.llvm.org/D16640
llvm-svn: 259113
The attached patch adds support for ompt_event_task_dependences and
ompt_event_task_dependence_pair events from the OMPT specification [1]. These
events only apply to OpenMP 4.0 and 4.1 (aka 4.5) because task dependencies
were introduced in 4.0.
With respect to the changes:
ompt_event_task_dependences
According to the specification, this event is raised after the task has been
created, thefore this event needs to be raised after ompt_event_task_begin
(in __kmp_task_start). However, the dependencies are known at
__kmpc_omp_task_with_deps which occurs before __kmp_task_start. My modifications
extend the ompt_task_info_t struct in order to store the dependencies of the
task when _kmpc_omp_task_with_deps occurs and then they are emitted in
__kmp_task_start just after raising the ompt_event_task_begin. The deps field
is allocated and valid until the event is raised and it is freed and set
to null afterwards.
ompt_event_task_dependence_pair
The processing of the dependences (i.e. checking whenever a dependence is
already satisfied) is done within __kmp_process_deps. That function checks
every dependence and calls the __kmp_track_dependence routine which gives some
support for graphical output. I used that routine to emit the dependence pair
but I also needed to know the sink_task. Despite the fact that the code within
KMP_SUPPORT_GRAPH_OUTPUT refers to task_sink it may be null because
sink->dn.task (there's a comment regarding this) and in fact it does not point
to a proper pointer value because the value is set in node->dn.task = task;
after the __kmp_process_deps calls in __kmp_check_deps. I have extended the
__kmp_process_deps and __kmp_track_dependence parameter list to receive the
sink_task.
[1] https://github.com/OpenMPToolsInterface/OMPT-Technical-Report/blob/target/ompt-tr.pdf
Patch by Harald Servat
Differential Revision: http://reviews.llvm.org/D14746
llvm-svn: 259038
When the code behind the barrier is executed, the master thread may have
already resumed execution. That's why we cannot safely assume that *pteam
is not yet freed.
This has been introduced by r258866.
llvm-svn: 259037
Current clang trunk reports _OPENMP to be 201307 = OpenMP 4.0. It doesn't
recognize '#pragma omp declare target' though (patch still pending) and
therefore fails compilation.
Differential Revision: http://reviews.llvm.org/D16631
llvm-svn: 259026
For implcit barriers in simple parallel for loops, the order of the OMPT events
was wrong. The barrier_{begin,end} events came after the implcit_task_end
event for the implcit barrier at the end of the parallel region. This is wrong
because the implicit task executes the barrier before ending. This patch fixes
the order of the event: It will be triggerd now just before
__kmp_pop_current_task_from_thread() is called.
Patch by Tim Cramer
Differential Revision: http://reviews.llvm.org/D16347
llvm-svn: 258866
This change fixes the bug: https://llvm.org/bugs/show_bug.cgi?id=25975
by bypassing the perl module files which try to deduce system information.
These perl modules files don't offer useful information and are from the
original build system. They can be removed after this change.
llvm-svn: 258843
This change fixes one issue reported at https://llvm.org/bugs/show_bug.cgi?id=26184
There was missing cleanup code for the cached indirect lock pool. The change
will fix the reported case where it tries to initialize a lock after runtime
cleanup/reinitialization, but it is still possible that the user program runs
into another problem because most test programs have a call to __kmpc_set_lock
after cleanup/reinitialization without calling __kmpc_init_lock causing a crash/hang.
llvm-svn: 258528
The release builds are configured to be reproducible, so that the
binaries compare equal between bootstrap iterations. The OpenMP
run-time build was failing like this:
runtime/src/kmp_version.c:108:79: error: expansion of date or time macro is not reproducible [-Werror,-Wdate-time]
char const __kmp_version_build_time[] = KMP_VERSION_PREFIX "build time: " __DATE__ " " __TIME__;
Figuring as the build currently doesn't set LIBOMP_DATE, it's probably
OK to skip setting the build time here too.
llvm-svn: 257833