forked from OSchip/llvm-project
327 lines
13 KiB
C
327 lines
13 KiB
C
// This file does not contain any code; it just contains additional text and formatting
|
|
// for doxygen.
|
|
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
//
|
|
// The LLVM Compiler Infrastructure
|
|
//
|
|
// This file is dual licensed under the MIT and the University of Illinois Open
|
|
// Source Licenses. See LICENSE.txt for details.
|
|
//
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
|
/*! @mainpage Intel® OpenMP* Runtime Library Interface
|
|
@section sec_intro Introduction
|
|
|
|
This document describes the interface provided by the
|
|
Intel® OpenMP\other runtime library to the compiler.
|
|
Routines that are directly called as simple functions by user code are
|
|
not currently described here, since their definition is in the OpenMP
|
|
specification available from http://openmp.org
|
|
|
|
The aim here is to explain the interface from the compiler to the runtime.
|
|
|
|
The overall design is described, and each function in the interface
|
|
has its own description. (At least, that's the ambition, we may not be there yet).
|
|
|
|
@section sec_building Building the Runtime
|
|
For the impatient, we cover building the runtime as the first topic here.
|
|
|
|
A top-level Makefile is provided that attempts to derive a suitable
|
|
configuration for the most commonly used environments. To see the
|
|
default settings, type:
|
|
@code
|
|
% make info
|
|
@endcode
|
|
|
|
You can change the Makefile's behavior with the following options:
|
|
|
|
- <b>omp_root</b>: The path to the top-level directory containing the top-level
|
|
Makefile. By default, this will take on the value of the
|
|
current working directory.
|
|
|
|
- <b>omp_os</b>: Operating system. By default, the build will attempt to
|
|
detect this. Currently supports "linux", "macos", and
|
|
"windows".
|
|
|
|
- <b>arch</b>: Architecture. By default, the build will attempt to
|
|
detect this if not specified by the user. Currently
|
|
supported values are
|
|
- "32" for IA-32 architecture
|
|
- "32e" for Intel® 64 architecture
|
|
- "mic" for Intel® Many Integrated Core Architecture (
|
|
If "mic" is specified then "icc" will be used as the
|
|
compiler, and appropriate k1om binutils will be used. The
|
|
necessary packages must be installed on the build machine
|
|
for this to be possible, but an
|
|
Intel® Xeon Phi™
|
|
coprocessor is not required to build the library).
|
|
|
|
- <b>compiler</b>: Which compiler to use for the build. Defaults to "icc"
|
|
or "icl" depending on the value of omp_os. Also supports
|
|
"gcc" when omp_os is "linux" for gcc\other versions
|
|
4.6.2 and higher. For icc on OS X\other, OS X\other versions
|
|
greater than 10.6 are not supported currently. Also, icc
|
|
version 13.0 is not supported. The selected compiler should be
|
|
installed and in the user's path. The corresponding
|
|
Fortran compiler should also be in the path.
|
|
|
|
- <b>mode</b>: Library mode: default is "release". Also supports "debug".
|
|
|
|
To use any of the options above, simple add <option_name>=<value>. For
|
|
example, if you want to build with gcc instead of icc, type:
|
|
@code
|
|
% make compiler=gcc
|
|
@endcode
|
|
|
|
Underneath the hood of the top-level Makefile, the runtime is built by
|
|
a perl script that in turn drives a detailed runtime system make. The
|
|
script can be found at <tt>tools/build.pl</tt>, and will print
|
|
information about all its flags and controls if invoked as
|
|
@code
|
|
% tools/build.pl --help
|
|
@endcode
|
|
|
|
If invoked with no arguments, it will try to build a set of libraries
|
|
that are appropriate for the machine on which the build is happening.
|
|
There are many options for building out of tree, and configuring library
|
|
features that can also be used. Consult the <tt>--help</tt> output for details.
|
|
|
|
@section sec_supported Supported RTL Build Configurations
|
|
|
|
The architectures supported are IA-32 architecture, Intel® 64, and
|
|
Intel® Many Integrated Core Architecture. The build configurations
|
|
supported are shown in the table below.
|
|
|
|
<table border=1>
|
|
<tr><th> <th>icc/icl<th>gcc
|
|
<tr><td>Linux\other OS<td>Yes(1,5)<td>Yes(2,4)
|
|
<tr><td>OS X\other<td>Yes(1,3,4)<td>No
|
|
<tr><td>Windows\other OS<td>Yes(1,4)<td>No
|
|
</table>
|
|
(1) On IA-32 architecture and Intel® 64, icc/icl versions 12.x
|
|
are supported (12.1 is recommended).<br>
|
|
(2) gcc version 4.6.2 is supported.<br>
|
|
(3) For icc on OS X\other, OS X\other version 10.5.8 is supported.<br>
|
|
(4) Intel® Many Integrated Core Architecture not supported.<br>
|
|
(5) On Intel® Many Integrated Core Architecture, icc/icl versions 13.0 or later are required.
|
|
|
|
@section sec_frontend Front-end Compilers that work with this RTL
|
|
|
|
The following compilers are known to do compatible code generation for
|
|
this RTL: icc/icl, gcc. Code generation is discussed in more detail
|
|
later in this document.
|
|
|
|
@section sec_outlining Outlining
|
|
|
|
The runtime interface is based on the idea that the compiler
|
|
"outlines" sections of code that are to run in parallel into separate
|
|
functions that can then be invoked in multiple threads. For instance,
|
|
simple code like this
|
|
|
|
@code
|
|
void foo()
|
|
{
|
|
#pragma omp parallel
|
|
{
|
|
... do something ...
|
|
}
|
|
}
|
|
@endcode
|
|
is converted into something that looks conceptually like this (where
|
|
the names used are merely illustrative; the real library function
|
|
names will be used later after we've discussed some more issues...)
|
|
|
|
@code
|
|
static void outlinedFooBody()
|
|
{
|
|
... do something ...
|
|
}
|
|
|
|
void foo()
|
|
{
|
|
__OMP_runtime_fork(outlinedFooBody, (void*)0); // Not the real function name!
|
|
}
|
|
@endcode
|
|
|
|
@subsection SEC_SHAREDVARS Addressing shared variables
|
|
|
|
In real uses of the OpenMP\other API there are normally references
|
|
from the outlined code to shared variables that are in scope in the containing function.
|
|
Therefore the containing function must be able to address
|
|
these variables. The runtime supports two alternate ways of doing
|
|
this.
|
|
|
|
@subsubsection SEC_SEC_OT Current Technique
|
|
The technique currently supported by the runtime library is to receive
|
|
a separate pointer to each shared variable that can be accessed from
|
|
the outlined function. This is what is shown in the example below.
|
|
|
|
We hope soon to provide an alternative interface to support the
|
|
alternate implementation described in the next section. The
|
|
alternative implementation has performance advantages for small
|
|
parallel regions that have many shared variables.
|
|
|
|
@subsubsection SEC_SEC_PT Future Technique
|
|
The idea is to treat the outlined function as though it
|
|
were a lexically nested function, and pass it a single argument which
|
|
is the pointer to the parent's stack frame. Provided that the compiler
|
|
knows the layout of the parent frame when it is generating the outlined
|
|
function it can then access the up-level variables at appropriate
|
|
offsets from the parent frame. This is a classical compiler technique
|
|
from the 1960s to support languages like Algol (and its descendants)
|
|
that support lexically nested functions.
|
|
|
|
The main benefit of this technique is that there is no code required
|
|
at the fork point to marshal the arguments to the outlined function.
|
|
Since the runtime knows statically how many arguments must be passed to the
|
|
outlined function, it can easily copy them to the thread's stack
|
|
frame. Therefore the performance of the fork code is independent of
|
|
the number of shared variables that are accessed by the outlined
|
|
function.
|
|
|
|
If it is hard to determine the stack layout of the parent while generating the
|
|
outlined code, it is still possible to use this approach by collecting all of
|
|
the variables in the parent that are accessed from outlined functions into
|
|
a single `struct` which is placed on the stack, and whose address is passed
|
|
to the outlined functions. In this way the offsets of the shared variables
|
|
are known (since they are inside the struct) without needing to know
|
|
the complete layout of the parent stack-frame. From the point of view
|
|
of the runtime either of these techniques is equivalent, since in either
|
|
case it only has to pass a single argument to the outlined function to allow
|
|
it to access shared variables.
|
|
|
|
A scheme like this is how gcc\other generates outlined functions.
|
|
|
|
@section SEC_INTERFACES Library Interfaces
|
|
The library functions used for specific parts of the OpenMP\other language implementation
|
|
are documented in different modules.
|
|
|
|
- @ref BASIC_TYPES fundamental types used by the runtime in many places
|
|
- @ref DEPRECATED functions that are in the library but are no longer required
|
|
- @ref STARTUP_SHUTDOWN functions for initializing and finalizing the runtime
|
|
- @ref PARALLEL functions for implementing `omp parallel`
|
|
- @ref THREAD_STATES functions for supporting thread state inquiries
|
|
- @ref WORK_SHARING functions for work sharing constructs such as `omp for`, `omp sections`
|
|
- @ref THREADPRIVATE functions to support thread private data, copyin etc
|
|
- @ref SYNCHRONIZATION functions to support `omp critical`, `omp barrier`, `omp master`, reductions etc
|
|
- @ref ATOMIC_OPS functions to support atomic operations
|
|
- Documentation on tasking has still to be written...
|
|
|
|
@section SEC_EXAMPLES Examples
|
|
@subsection SEC_WORKSHARING_EXAMPLE Work Sharing Example
|
|
This example shows the code generated for a parallel for with reduction and dynamic scheduling.
|
|
|
|
@code
|
|
extern float foo( void );
|
|
|
|
int main () {
|
|
int i;
|
|
float r = 0.0;
|
|
#pragma omp parallel for schedule(dynamic) reduction(+:r)
|
|
for ( i = 0; i < 10; i ++ ) {
|
|
r += foo();
|
|
}
|
|
}
|
|
@endcode
|
|
|
|
The transformed code looks like this.
|
|
@code
|
|
extern float foo( void );
|
|
|
|
int main () {
|
|
static int zero = 0;
|
|
auto int gtid;
|
|
auto float r = 0.0;
|
|
__kmpc_begin( & loc3, 0 );
|
|
// The gtid is not actually required in this example so could be omitted;
|
|
// We show its initialization here because it is often required for calls into
|
|
// the runtime and should be locally cached like this.
|
|
gtid = __kmpc_global thread num( & loc3 );
|
|
__kmpc_fork call( & loc7, 1, main_7_parallel_3, & r );
|
|
__kmpc_end( & loc0 );
|
|
return 0;
|
|
}
|
|
|
|
struct main_10_reduction_t_5 { float r_10_rpr; };
|
|
|
|
static kmp_critical_name lck = { 0 };
|
|
static ident_t loc10; // loc10.flags should contain KMP_IDENT_ATOMIC_REDUCE bit set
|
|
// if compiler has generated an atomic reduction.
|
|
|
|
void main_7_parallel_3( int *gtid, int *btid, float *r_7_shp ) {
|
|
auto int i_7_pr;
|
|
auto int lower, upper, liter, incr;
|
|
auto struct main_10_reduction_t_5 reduce;
|
|
reduce.r_10_rpr = 0.F;
|
|
liter = 0;
|
|
__kmpc_dispatch_init_4( & loc7,*gtid, 35, 0, 9, 1, 1 );
|
|
while ( __kmpc_dispatch_next_4( & loc7, *gtid, & liter, & lower, & upper, & incr ) ) {
|
|
for( i_7_pr = lower; upper >= i_7_pr; i_7_pr ++ )
|
|
reduce.r_10_rpr += foo();
|
|
}
|
|
switch( __kmpc_reduce_nowait( & loc10, *gtid, 1, 4, & reduce, main_10_reduce_5, & lck ) ) {
|
|
case 1:
|
|
*r_7_shp += reduce.r_10_rpr;
|
|
__kmpc_end_reduce_nowait( & loc10, *gtid, & lck );
|
|
break;
|
|
case 2:
|
|
__kmpc_atomic_float4_add( & loc10, *gtid, r_7_shp, reduce.r_10_rpr );
|
|
break;
|
|
default:;
|
|
}
|
|
}
|
|
|
|
void main_10_reduce_5( struct main_10_reduction_t_5 *reduce_lhs,
|
|
struct main_10_reduction_t_5 *reduce_rhs )
|
|
{
|
|
reduce_lhs->r_10_rpr += reduce_rhs->r_10_rpr;
|
|
}
|
|
@endcode
|
|
|
|
@defgroup BASIC_TYPES Basic Types
|
|
Types that are used throughout the runtime.
|
|
|
|
@defgroup DEPRECATED Deprecated Functions
|
|
Functions in this group are for backwards compatibility only, and
|
|
should not be used in new code.
|
|
|
|
@defgroup STARTUP_SHUTDOWN Startup and Shutdown
|
|
These functions are for library initialization and shutdown.
|
|
|
|
@defgroup PARALLEL Parallel (fork/join)
|
|
These functions are used for implementing <tt>\#pragma omp parallel</tt>.
|
|
|
|
@defgroup THREAD_STATES Thread Information
|
|
These functions return information about the currently executing thread.
|
|
|
|
@defgroup WORK_SHARING Work Sharing
|
|
These functions are used for implementing
|
|
<tt>\#pragma omp for</tt>, <tt>\#pragma omp sections</tt>, <tt>\#pragma omp single</tt> and
|
|
<tt>\#pragma omp master</tt> constructs.
|
|
|
|
When handling loops, there are different functions for each of the signed and unsigned 32 and 64 bit integer types
|
|
which have the name suffixes `_4`, `_4u`, `_8` and `_8u`. The semantics of each of the functions is the same,
|
|
so they are only described once.
|
|
|
|
Static loop scheduling is handled by @ref __kmpc_for_static_init_4 and friends. Only a single call is needed,
|
|
since the iterations to be executed by any give thread can be determined as soon as the loop parameters are known.
|
|
|
|
Dynamic scheduling is handled by the @ref __kmpc_dispatch_init_4 and @ref __kmpc_dispatch_next_4 functions.
|
|
The init function is called once in each thread outside the loop, while the next function is called each
|
|
time that the previous chunk of work has been exhausted.
|
|
|
|
@defgroup SYNCHRONIZATION Synchronization
|
|
These functions are used for implementing barriers.
|
|
|
|
@defgroup THREADPRIVATE Thread private data support
|
|
These functions support copyin/out and thread private data.
|
|
|
|
@defgroup TASKING Tasking support
|
|
These functions support are used to implement tasking constructs.
|
|
|
|
*/
|
|
|