This patch introduces the logic implementing hierarchical scheduling.
First and foremost, hierarchical scheduling is off by default
To enable, use -DLIBOMP_USE_HIER_SCHED=On during CMake's configure stage.
This work is based off if the IWOMP paper:
"Workstealing and Nested Parallelism in SMP Systems"
Hierarchical scheduling is the layering of OpenMP schedules for different layers
of the memory hierarchy. One can have multiple layers between the threads and
the global iterations space. The threads will go up the hierarchy to grab
iterations, using possibly a different schedule & chunk for each layer.
[ Global iteration space (0-999) ]
(use static)
[ L1 | L1 | L1 | L1 ]
(use dynamic,1)
[ T0 T1 | T2 T3 | T4 T5 | T6 T7 ]
In the example shown above, there are 8 threads and 4 L1 caches begin targeted.
If the topology indicates that there are two threads per core, then two
consecutive threads will share the data of one L1 cache unit. This example
would have the iteration space (0-999) split statically across the four L1
caches (so the first L1 would get (0-249), the second would get (250-499), etc).
Then the threads will use a dynamic,1 schedule to grab iterations from the L1
cache units. There are currently four supported layers: L1, L2, L3, NUMA
OMP_SCHEDULE can now read a hierarchical schedule with this syntax:
OMP_SCHEDULE='EXPERIMENTAL LAYER,SCHED[,CHUNK][:LAYER,SCHED[,CHUNK]...]:SCHED,CHUNK
And OMP_SCHEDULE can still read the normal SCHED,CHUNK syntax from before
I've kept most of the hierarchical scheduling logic inside kmp_dispatch_hier.h
to try to keep it separate from the rest of the code.
Differential Revision: https://reviews.llvm.org/D47962
llvm-svn: 336571
This patch reorganizes the loop scheduling code in order to allow hierarchical
scheduling to use it more effectively. In particular, the goal of this patch
is to separate the algorithmic parts of the scheduling from the thread
logistics code.
Moves declarations & structures to kmp_dispatch.h for easier access in
other files. Extracts the algorithmic part of __kmp_dispatch_init() and
__kmp_dispatch_next() into __kmp_dispatch_init_algorithm() and
__kmp_dispatch_next_algorithm(). The thread bookkeeping logic is still kept in
__kmp_dispatch_init() and __kmp_dispatch_next(). This is done because the
hierarchical scheduler needs to access the scheduling logic without the
bookkeeping logic. To prepare for new pointer in dispatch_private_info_t, a
new flags variable is created which stores the ordered and nomerge flags instead
of them being in two separate variables. This will keep the
dispatch_private_info_t structure the same size.
Differential Revision: https://reviews.llvm.org/D47961
llvm-svn: 336568
These are preliminary changes that attempt to use C++11 Atomics in the runtime.
We are expecting better portability with this change across architectures/OSes.
Here is the summary of the changes.
Most variables that need synchronization operation were converted to generic
atomic variables (std::atomic<T>). Variables that are updated with combined CAS
are packed into a single atomic variable, and partial read/write is done
through unpacking/packing
Patch by Hansang Bae
Differential Revision: https://reviews.llvm.org/D47903
llvm-svn: 336563
The flag "--no-as-needed" is not recognized by the linker on macOS making the following tests fail:
ompt/loadtool/tool_available/tool_available.c
ompt/loadtool/tool_not_available/tool_not_available.c
This patch removes this flag for macOS and adds it only for Linux and Windows.
I tested it on Ubuntu 16.04 and macOS HighSierra, with Clang/LLVM 6.0.1 and OpenMP trunk.
This solution was also discussed in the OpenMP-dev mailing list.
Patch provided by Simone Atzeni
Differential Revision: https://reviews.llvm.org/D48888
llvm-svn: 336327
The testcase potentially fails when a thread is reused.
The added synchronization makes sure this does not happen.
Patch provided by Simon Convent
Differential Revision: https://reviews.llvm.org/D48932
llvm-svn: 336326
When compiling with icc, there is a problem with reenter frame addresses in
parallel_begin callbacks in the interoperability.c testcase. (The address is
not available. thus NULL)
Using alloca() forces availability of the frame pointer.
Patch provided by Simon Convent
Differential Revision: https://reviews.llvm.org/D48282
llvm-svn: 336088
Several runtime entry points have not been tested from non-OpenMP threads. This
adds tests to an existing testcase. While at it, the testcase was reformatted
Patch provided by Simon Convent
Differential Revision: https://reviews.llvm.org/D48124
llvm-svn: 336087
Especially the thread_end callback has not been tested before.
This adds a testcase for nested and non-nested threads.
Patch provided by Simon Convent
Differential Revision: https://reviews.llvm.org/D47824
llvm-svn: 336086
The current implementation always provides the thread-num for the current
parallel region. This patch fixes the behavior for ancestor levels >0.
Differential Revision: https://reviews.llvm.org/D46533
llvm-svn: 336085
Upcoming changes to FileCheck will modify CHECK-DAG to not match
overlapping regions of the input. This test was found to be affected
because it expects to find four threads to invoke events of type
ompt_event_implicit_task_begin. It turns out this is wrong because
OMP_THREAD_LIMIT is set to 2, so there are only two threads. The
rest of the test got it right so it went unnoticed until now.
(Rewrite test and apply clang-format to it as discussed in the past.)
Differential Revision: https://reviews.llvm.org/D47119
llvm-svn: 333361
Introduce OPENMP_INSTALL_LIBDIR and use in all install() commands.
This also fixes installation of libomptarget-nvptx that previously
didn't honor {OPENMP,LLVM}_LIBDIR_SUFFIX.
Differential Revision: https://reviews.llvm.org/D47130
llvm-svn: 333284
implicit_task_end callbacks in nested parallel regions did not always give the
correct thread_num, since the inner parallel region may have already been
finalized.
Now, the thread_num is stored at the beginning of the implicit task and
retrieved at the end, whenever necessary.
A testcase was added as well.
Differential Revision: https://reviews.llvm.org/D46260
llvm-svn: 331632
The api_calls_misc.c testcase tests the following api calls:
ompt_get_callback()
ompt_get_state()
ompt_enumerate_states()
ompt_enumerate_mutex_impls()
These have not been tested previously.
The api_calls.c testcase has been renamed to api_calls_places.c because it only tests api calls that are related to places.
Differential Revision: https://reviews.llvm.org/D42523
llvm-svn: 331631
Currently, the affinity API reports garbage for the initial place list and any
thread's place lists when using KMP_AFFINITY=none|compact|scatter.
This patch does two things:
for KMP_AFFINITY=none, Creates a one entry table for the places, this way, the
initial place list is just a single place with all the proc ids in it. We also
set the initial place of any thread to 0 instead of KMP_PLACE_ALL so that the
thread reports that single place (place 0) instead of garbage (-1) when using
the affinity API.
When non-OMP_PROC_BIND affinity is used
(including KMP_AFFINITY=compact|scatter), a thread's place list is populated
correctly. We assume that each thread is assigned to a single place. This is
implemented in two of the affinity API functions
Differential Revision: https://reviews.llvm.org/D45527
llvm-svn: 330283
This patch introduces GOMP_taskloop to our API. It adds GOMP_4.5 to our
version symbols. Being a wrapper around __kmpc_taskloop, the function
creates a task with the loop bounds properly nested in the shareds so that
the GOMP task thunk will work properly. Also, the firstprivate copy constructors
are properly handled using the __kmp_gomp_task_dup() auxiliary function.
Currently, only linear spawning of tasks is supported
for the GOMP_taskloop interface.
Differential Revision: https://reviews.llvm.org/D45327
llvm-svn: 330282
This change removes the unnecessary lock operation on __kmp_initz_lock inside
the __kmp_atfork_child() function for Linux; the lock variable is initialized
in the same function later.
Patch by Hansang Bae
Differential Revision: https://reviews.llvm.org/D44949
llvm-svn: 328900
The summarizeStats.py script processes raw data provided by the
instrumented (stats-gathering) OpenMP* runtime library. It provides:
1) A radar chart which plots counters as frequency (per GigaTick) of use within
the program. The frequencies are plotted as log10, however values less than
one are kept as it is and represented in red color. This was done to help
visualize the differences better.
2) Pie charts separating total time as compute and non-compute. The compute and
non-compute times have their own pie charts showing the constructs that
contributed to them. The percentages listed are with respect to the total
time.
3) '.csv' file with percentage of time spent within the different constructs.
The script can be used as:
$ python $PATH_TO_SCRIPT/summarizeStats.py instrumented1.csv instrumented2.csv
Patch by Taru Doodi
Differential Revision: https://reviews.llvm.org/D41838
llvm-svn: 328568
Added settings code to read OMP_TARGET_OFFLOAD environment variable. Added
target-offload-var ICV as __kmp_target_offload, set via OMP_TARGET_OFFLOAD,
if available, otherwise defaulting to DEFAULT. Valid values for the ICV are
specified as enum values {0,1,2} for disabled, default, and mandatory. An
internal API access function __kmpc_get_target_offload is provided.
Patch by Terry Wilmarth
Differential Revision: https://reviews.llvm.org/D44577
llvm-svn: 328046
We have to ensure that the runtime is initialized _before_ waiting
for the two started threads to guarantee that the master threads
post their ompt_event_thread_begin before the worker threads. This
is not guaranteed in the parallel region where one worker thread
could start before the other master thread has invoked the callback.
The problem did not happen with Clang becauses the generated code
calls __kmpc_global_thread_num() and cashes its result for functions
that contain OpenMP pragmas.
Differential Revision: https://reviews.llvm.org/D43882
llvm-svn: 326435
The thread_num parameter of ompt_get_task_info() was not being used previously,
but need to be set.
The print_task_type() function (form the task-types.c testcase) was merged into
the print_ids() function (in callback.h). Testing of ompt_get_task_info() was
added to the task-types.c testcase. It was not tested extensively previously.
Differential Revision: https://reviews.llvm.org/D42472
llvm-svn: 326338
The main change of this patch is to insert {{.*}} in current_address=[[RETURN_ADDRESS_END]].
This is needed to match any of the alternatively printed addresses.
Additionally, clang-format is applied to the two tests.
Differential Revision: https://reviews.llvm.org/D43115
llvm-svn: 326312
This is required to be NULL for implicit barriers at the end of a
parallel region. Noticed in review of D43191.
Differential Revision: https://reviews.llvm.org/D43308
llvm-svn: 325922
The compiler inlines the user code in the task. Check for that case at
runtime by comparing the frame addresses and print the expected exit
address.
Also showcase how I think the OMPT tests could be reformatted to match
LLVM's code style. In my opinion it would be great to that kind of change
to all tests that need to be touched for whatever reason...
Differential Revision: https://reviews.llvm.org/D43191
llvm-svn: 325921
Test whether OMPT-callbacks for two threads that initiate a parallel region are correct.
Differential Revision: https://reviews.llvm.org/D41942
llvm-svn: 325423
Only use ompt_ functions when testing OMPT in api_calls testcase.
Add size parameter to print_list.
Fix small bug in implementation of ompt_get_partition_place_nums(): return correct length.
Differential Revision: https://reviews.llvm.org/D42162
llvm-svn: 325422
This affects all outlined functions, not just tasks! Only show warning
when using Clang 5.0 or later.
Differential Revision: https://reviews.llvm.org/D43190
llvm-svn: 325131
Tests the search for tools as defined in the spec. The OMP_TOOL_LIBRARIES
environment variable contains paths to the following files(in that order)
-to a nonexisting file
-to a shared library that does not have a ompt_start_tool function
-to a shared library that has an ompt_start_tool implementation returning NULL
-to a shared library that has an ompt_start_tool implementation returning a
pointer to a valid instance of ompt_start_tool_result_t
The expected result is that the last tool gets active and can print in the
thread-begin callback.
Differential Revision: https://reviews.llvm.org/D42166
llvm-svn: 324588
Add a testcase that checks wheter the runtime can handle an ompt_start_tool
method that returns NULL indicating that no tool shall be loaded.
All tool_available testcases need a separate folder to avoid file conflicts for
the generated tools.
Differential Revision: https://reviews.llvm.org/D41904
llvm-svn: 324587
If tool initialization returns 0, OMPT should not be active. The current
implementation provided some callback invocations in this case.
Differential Revision: https://reviews.llvm.org/D42709
llvm-svn: 324320
Use fuzzy return addresses in lock testcases so that these
testcases can also be run using the Intel Compiler.
Patch by Simon Convent!
Differential Revision: https://reviews.llvm.org/D41896
llvm-svn: 323529
Add Workaround for Intel Compiler Bug with Case#: 03138964
A critical region within a nested task causes a segfault in icc 14-18:
int main()
{
#pragma omp parallel num_threads(2)
#pragma omp master
#pragma omp task
#pragma omp task
#pragma omp critical
printf("test\n");
}
When the critical region is in a separate function, the segault does not occur.
So we add noinline to make sure that the function call stays there.
Differential Revision: https://reviews.llvm.org/D41182
llvm-svn: 322622
The defintion is not part of the spec and thus should not have the prefix
"ompt_" but rather a prefix that indicates that this is implementation
specific.
Differential Revision: https://reviews.llvm.org/D41166
llvm-svn: 322621