forked from OSchip/llvm-project
524 lines
19 KiB
Plaintext
524 lines
19 KiB
Plaintext
|
|
**Design document for OpenMP reductions on the GPU**
|
|
|
|
//Abstract: //In this document we summarize the new design for an OpenMP
|
|
implementation of reductions on NVIDIA GPUs. This document comprises
|
|
* a succinct background review,
|
|
* an introduction to the decoupling of reduction algorithm and
|
|
data-structure-specific processing routines,
|
|
* detailed illustrations of reduction algorithms used and
|
|
* a brief overview of steps we have made beyond the last implementation.
|
|
|
|
**Problem Review**
|
|
|
|
Consider a typical OpenMP program with reduction pragma.
|
|
|
|
```
|
|
double foo, bar;
|
|
#pragma omp parallel for reduction(+:foo, bar)
|
|
for (int i = 0; i < N; i++) {
|
|
foo+=A[i]; bar+=B[i];
|
|
}
|
|
```
|
|
where 'foo' and 'bar' are reduced across all threads in the parallel region.
|
|
Our primary goal is to efficiently aggregate the values of foo and bar in
|
|
such manner that
|
|
* makes the compiler logically concise.
|
|
* efficiently reduces within warps, threads, blocks and the device.
|
|
|
|
**Introduction to Decoupling**
|
|
In this section we address the problem of making the compiler
|
|
//logically concise// by partitioning the task of reduction into two broad
|
|
categories: data-structure specific routines and algorithmic routines.
|
|
|
|
The previous reduction implementation was highly coupled with
|
|
the specificity of the reduction element data structures (e.g., sizes, data
|
|
types) and operators of the reduction (e.g., addition, multiplication). In
|
|
our implementation we strive to decouple them. In our final implementations,
|
|
we could remove all template functions in our runtime system.
|
|
|
|
The (simplified) pseudo code generated by LLVM is as follows:
|
|
|
|
```
|
|
1. Create private copies of variables: foo_p, bar_p
|
|
2. Each thread reduces the chunk of A and B assigned to it and writes
|
|
to foo_p and bar_p respectively.
|
|
3. ret = kmpc_nvptx_reduce_nowait(..., reduceData, shuffleReduceFn,
|
|
interWarpCpyFn)
|
|
where:
|
|
struct ReduceData {
|
|
double *foo;
|
|
double *bar;
|
|
} reduceData
|
|
reduceData.foo = &foo_p
|
|
reduceData.bar = &bar_p
|
|
|
|
shuffleReduceFn and interWarpCpyFn are two auxiliary functions
|
|
generated to aid the runtime performing algorithmic steps
|
|
while being data-structure agnostic about ReduceData.
|
|
|
|
In particular, shuffleReduceFn is a function that takes the following
|
|
inputs:
|
|
a. local copy of ReduceData
|
|
b. its lane_id
|
|
c. the offset of the lane_id which hosts a remote ReduceData
|
|
relative to the current one
|
|
d. an algorithm version paramter determining which reduction
|
|
algorithm to use.
|
|
This shuffleReduceFn retrieves the remote ReduceData through shuffle
|
|
intrinsics and reduces, using the algorithm specified by the 4th
|
|
parameter, the local ReduceData and with the remote ReduceData element
|
|
wise, and places the resultant values into the local ReduceData.
|
|
|
|
Different reduction algorithms are implemented with different runtime
|
|
functions, but they all make calls to this same shuffleReduceFn to
|
|
perform the essential reduction step. Therefore, based on the 4th
|
|
parameter, this shuffleReduceFn will behave slightly differently to
|
|
cooperate with the runtime function to ensure correctness under
|
|
different circumstances.
|
|
|
|
InterWarpCpyFn, as the name suggests, is a function that copies data
|
|
across warps. Its function is to tunnel all the thread private
|
|
ReduceData that is already reduced within a warp to a lane in the first
|
|
warp with minimal shared memory footprint. This is an essential step to
|
|
prepare for the last step of a block reduction.
|
|
|
|
(Warp, block, device level reduction routines that utilize these
|
|
auxiliary functions will be discussed in the next section.)
|
|
|
|
4. if ret == 1:
|
|
The master thread stores the reduced result in the globals.
|
|
foo += reduceData.foo; bar += reduceData.bar
|
|
```
|
|
|
|
**Reduction Algorithms**
|
|
|
|
On the warp level, we have three versions of the algorithms:
|
|
|
|
1. Full Warp Reduction
|
|
|
|
```
|
|
gpu_regular_warp_reduce(void *reduce_data,
|
|
kmp_ShuffleReductFctPtr ShuffleReduceFn) {
|
|
for (int offset = WARPSIZE/2; offset > 0; offset /= 2)
|
|
ShuffleReduceFn(reduce_data, 0, offset, 0);
|
|
}
|
|
```
|
|
ShuffleReduceFn is used here with lane_id set to 0 because it is not used
|
|
therefore we save instructions by not retrieving lane_id from the corresponding
|
|
special registers. The 4th parameters, which represents the version of the
|
|
algorithm being used here, is set to 0 to signify full warp reduction.
|
|
|
|
In this version specified (=0), the ShuffleReduceFn behaves, per element, as
|
|
follows:
|
|
|
|
```
|
|
//reduce_elem refers to an element in the local ReduceData
|
|
//remote_elem is retrieved from a remote lane
|
|
remote_elem = shuffle_down(reduce_elem, offset, 32);
|
|
reduce_elem = reduce_elem @ remote_elem;
|
|
|
|
```
|
|
|
|
An illustration of this algorithm operating on a hypothetical 8-lane full-warp
|
|
would be:
|
|
{F74}
|
|
The coloring invariant follows that elements with the same color will be
|
|
combined and reduced in the next reduction step. As can be observed, no overhead
|
|
is present, exactly log(2, N) steps are needed.
|
|
|
|
2. Contiguous Full Warp Reduction
|
|
```
|
|
gpu_irregular_warp_reduce(void *reduce_data,
|
|
kmp_ShuffleReductFctPtr ShuffleReduceFn, int size,
|
|
int lane_id) {
|
|
int curr_size;
|
|
int offset;
|
|
curr_size = size;
|
|
mask = curr_size/2;
|
|
while (offset>0) {
|
|
ShuffleReduceFn(reduce_data, lane_id, offset, 1);
|
|
curr_size = (curr_size+1)/2;
|
|
offset = curr_size/2;
|
|
}
|
|
}
|
|
```
|
|
|
|
In this version specified (=1), the ShuffleReduceFn behaves, per element, as
|
|
follows:
|
|
```
|
|
//reduce_elem refers to an element in the local ReduceData
|
|
//remote_elem is retrieved from a remote lane
|
|
remote_elem = shuffle_down(reduce_elem, offset, 32);
|
|
if (lane_id < offset) {
|
|
reduce_elem = reduce_elem @ remote_elem
|
|
} else {
|
|
reduce_elem = remote_elem
|
|
}
|
|
```
|
|
|
|
An important invariant (also a restriction on the starting state of the
|
|
reduction) is that this algorithm assumes that all unused ReduceData are
|
|
located in a contiguous subset of threads in a warp starting from lane 0.
|
|
|
|
With the presence of a trailing active lane with an odd-numbered lane
|
|
id, its value will not be aggregated with any other lane. Therefore,
|
|
in order to preserve the invariant, such ReduceData is copied to the first lane
|
|
whose thread-local ReduceData has already being used in a previous reduction
|
|
and would therefore be useless otherwise.
|
|
|
|
An illustration of this algorithm operating on a hypothetical 8-lane partial
|
|
warp woud be:
|
|
{F75}
|
|
|
|
As illustrated, this version of the algorithm introduces overhead whenever
|
|
we have odd number of participating lanes in any reduction step to
|
|
copy data between lanes.
|
|
|
|
3. Dispersed Partial Warp Reduction
|
|
```
|
|
gpu_irregular_simt_reduce(void *reduce_data,
|
|
kmp_ShuffleReductFctPtr ShuffleReduceFn) {
|
|
int size, remote_id;
|
|
int logical_lane_id = find_number_of_dispersed_active_lanes_before_me() * 2;
|
|
do {
|
|
remote_id = find_the_next_active_lane_id_right_after_me();
|
|
// the above function returns 0 of no active lane
|
|
// is present right after the current thread.
|
|
size = get_number_of_active_lanes_in_this_warp();
|
|
logical_lane_id /= 2;
|
|
ShuffleReduceFn(reduce_data, logical_lane_id, remote_id-1-threadIdx.x, 2);
|
|
} while (logical_lane_id % 2 == 0 && size > 1);
|
|
```
|
|
|
|
There is no assumption made about the initial state of the reduction.
|
|
Any number of lanes (>=1) could be active at any position. The reduction
|
|
result is kept in the first active lane.
|
|
|
|
In this version specified (=2), the ShuffleReduceFn behaves, per element, as
|
|
follows:
|
|
```
|
|
//reduce_elem refers to an element in the local ReduceData
|
|
//remote_elem is retrieved from a remote lane
|
|
remote_elem = shuffle_down(reduce_elem, offset, 32);
|
|
if (LaneId % 2 == 0 && Offset > 0) {
|
|
reduce_elem = reduce_elem @ remote_elem
|
|
} else {
|
|
reduce_elem = remote_elem
|
|
}
|
|
```
|
|
We will proceed with a brief explanation for some arguments passed in,
|
|
it is important to notice that, in this section, we will introduce the
|
|
concept of logical_lane_id, and it is important to distinguish it
|
|
from physical lane_id as defined by nvidia.
|
|
1. //logical_lane_id//: as the name suggests, it refers to the calculated
|
|
lane_id (instead of the physical one defined by nvidia) that would make
|
|
our algorithm logically concise. A thread with logical_lane_id k means
|
|
there are (k-1) threads before it.
|
|
2. //remote_id-1-threadIdx.x//: remote_id is indeed the nvidia-defined lane
|
|
id of the remote lane from which we will retrieve the ReduceData. We
|
|
subtract (threadIdx+1) from it because we would like to maintain only one
|
|
underlying shuffle intrinsic (which is used to communicate among lanes in a
|
|
warp). This particular version of shuffle intrinsic we take accepts only
|
|
offsets, instead of absolute lane_id. Therefore the subtraction is performed
|
|
on the absolute lane_id we calculated to obtain the offset.
|
|
|
|
This algorithm is slightly different in 2 ways and it is not, conceptually, a
|
|
generalization of the above algorithms.
|
|
1. It reduces elements close to each other. For instance, values in the 0th lane
|
|
is to be combined with that of the 1st lane; values in the 2nd lane is to be
|
|
combined with that of the 3rd lane. We did not use the previous algorithm
|
|
where the first half of the (partial) warp is reduced with the second half
|
|
of the (partial) warp. This is because, the mapping
|
|
f(x): logical_lane_id -> physical_lane_id;
|
|
can be easily calculated whereas its inverse
|
|
f^-1(x): physical_lane_id -> logical_lane_id
|
|
cannot and performing such reduction requires the inverse to be known.
|
|
2. Because this algorithm is agnostic about the positions of the lanes that are
|
|
active, we do not need to perform the coping step as in the second
|
|
algorithm.
|
|
An illustrative run would look like
|
|
{F76}
|
|
As observed, overhead is high because in each and every step of reduction,
|
|
logical_lane_id is recalculated; so is the remote_id.
|
|
|
|
On a block level, we have implemented the following block reduce algorithm:
|
|
|
|
```
|
|
gpu_irregular_block_reduce(void *reduce_data,
|
|
kmp_ShuffleReductFctPtr shuflReduceFn,
|
|
kmp_InterWarpCopyFctPtr interWarpCpyFn,
|
|
int size) {
|
|
|
|
int wid = threadIdx.x/WARPSIZE;
|
|
int lane_id = threadIdx.x%WARPSIZE;
|
|
|
|
int warp_needed = (size+WARPSIZE-1)/WARPSIZE; //ceiling of division
|
|
|
|
unsigned tnum = __ballot(1);
|
|
int thread_num = __popc(tnum);
|
|
|
|
//full warp reduction
|
|
if (thread_num == WARPSIZE) {
|
|
gpu_regular_warp_reduce(reduce_data, shuflReduceFn);
|
|
}
|
|
//partial warp reduction
|
|
if (thread_num < WARPSIZE) {
|
|
gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, thread_num,
|
|
lane_id);
|
|
}
|
|
//Gather all the reduced values from each warp
|
|
//to the first warp
|
|
//named_barrier inside this function to ensure
|
|
//correctness. It is effectively a sync_thread
|
|
//that won't deadlock.
|
|
interWarpCpyFn(reduce_data, warp_needed);
|
|
|
|
//This is to reduce data gathered from each "warp master".
|
|
if (wid==0) {
|
|
gpu_irregular_warp_reduce(reduce_data, shuflReduceFn, warp_needed,
|
|
lane_id);
|
|
}
|
|
|
|
return;
|
|
}
|
|
```
|
|
In this function, no ShuffleReduceFn is directly called as it makes calls
|
|
to various versions of the warp-reduction functions. It first reduces
|
|
ReduceData warp by warp; in the end, we end up with the number of
|
|
ReduceData equal to the number of warps present in this thread
|
|
block. We then proceed to gather all such ReduceData to the first warp.
|
|
|
|
As observed, in this algorithm we make use of the function InterWarpCpyFn,
|
|
which copies data from each of the "warp master" (0th lane of each warp, where
|
|
a warp-reduced ReduceData is held) to the 0th warp. This step reduces (in a
|
|
mathematical sense) the problem of reduction across warp masters in a block to
|
|
the problem of warp reduction which we already have solutions to.
|
|
|
|
We can thus completely avoid the use of atomics to reduce in a threadblock.
|
|
|
|
**Efficient Cross Block Reduce**
|
|
|
|
The next challenge is to reduce values across threadblocks. We aim to do this
|
|
without atomics or critical sections.
|
|
|
|
Let a kernel be started with TB threadblocks.
|
|
Let the GPU have S SMs.
|
|
There can be at most N active threadblocks per SM at any time.
|
|
|
|
Consider a threadblock tb (tb < TB) running on SM s (s < SM). 'tb' is one of
|
|
at most 'N' active threadblocks on SM s. Let each threadblock active on an SM
|
|
be given an instance identifier id (0 <= id < N). Therefore, the tuple (s, id)
|
|
uniquely identifies an active threadblock on the GPU.
|
|
|
|
To efficiently implement cross block reduce, we first allocate an array for
|
|
each value to be reduced of size S*N (which is the maximum number of active
|
|
threadblocks at any time on the device).
|
|
|
|
Each threadblock reduces its value to slot [s][id]. This can be done without
|
|
locking since no other threadblock can write to the same slot concurrently.
|
|
|
|
As a final stage, we reduce the values in the array as follows:
|
|
|
|
```
|
|
// Compiler generated wrapper function for each target region with a reduction
|
|
clause.
|
|
target_function_wrapper(map_args, reduction_array) <--- start with 1 team and 1
|
|
thread.
|
|
// Use dynamic parallelism to launch M teams, N threads as requested by the
|
|
user to execute the target region.
|
|
|
|
target_function<<M, N>>(map_args)
|
|
|
|
Reduce values in reduction_array
|
|
|
|
```
|
|
|
|
**Comparison with Last Version**
|
|
|
|
|
|
The (simplified) pseudo code generated by LLVM on the host is as follows:
|
|
|
|
|
|
```
|
|
1. Create private copies of variables: foo_p, bar_p
|
|
2. Each thread reduces the chunk of A and B assigned to it and writes
|
|
to foo_p and bar_p respectively.
|
|
3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock)
|
|
where:
|
|
struct ReduceData {
|
|
double *foo;
|
|
double *bar;
|
|
} reduceData
|
|
reduceData.foo = &foo_p
|
|
reduceData.bar = &bar_p
|
|
|
|
reduceFn is a pointer to a function that takes in two inputs
|
|
of type ReduceData, "reduces" them element wise, and places the
|
|
result in the first input:
|
|
reduceFn(ReduceData *a, ReduceData *b)
|
|
a = a @ b
|
|
|
|
Every thread in the parallel region calls kmpc_reduce_nowait with
|
|
its private copy of reduceData. The runtime reduces across the
|
|
threads (using tree reduction on the operator 'reduceFn?) and stores
|
|
the final result in the master thread if successful.
|
|
4. if ret == 1:
|
|
The master thread stores the reduced result in the globals.
|
|
foo += reduceData.foo; bar += reduceData.bar
|
|
5. else if ret == 2:
|
|
In this case kmpc_reduce_nowait() could not use tree reduction,
|
|
so use atomics instead:
|
|
each thread atomically writes to foo
|
|
each thread atomically writes to bar
|
|
```
|
|
|
|
On a GPU, a similar reduction may need to be performed across SIMT threads,
|
|
warps, and threadblocks. The challenge is to do so efficiently in a fashion
|
|
that is compatible with the LLVM OpenMP implementation.
|
|
|
|
In the previously released 0.1 version of the LLVM OpenMP compiler for GPUs,
|
|
the salient steps of the code generated are as follows:
|
|
|
|
|
|
```
|
|
1. Create private copies of variables: foo_p, bar_p
|
|
2. Each thread reduces the chunk of A and B assigned to it and writes
|
|
to foo_p and bar_p respectively.
|
|
3. ret = kmpc_reduce_nowait(..., reduceData, reduceFn, lock)
|
|
status = can_block_reduce()
|
|
if status == 1:
|
|
reduce efficiently to thread 0 using shuffles and shared memory.
|
|
return 1
|
|
else
|
|
cannot use efficient block reduction, fallback to atomics
|
|
return 2
|
|
4. if ret == 1:
|
|
The master thread stores the reduced result in the globals.
|
|
foo += reduceData.foo; bar += reduceData.bar
|
|
5. else if ret == 2:
|
|
In this case kmpc_reduce_nowait() could not use tree reduction,
|
|
so use atomics instead:
|
|
each thread atomically writes to foo
|
|
each thread atomically writes to bar
|
|
```
|
|
|
|
The function can_block_reduce() is defined as follows:
|
|
|
|
|
|
```
|
|
int32_t can_block_reduce() {
|
|
int tid = GetThreadIdInTeam();
|
|
int nt = GetNumberOfOmpThreads(tid);
|
|
if (nt != blockDim.x)
|
|
return 0;
|
|
unsigned tnum = __ballot(1);
|
|
if (tnum != (~0x0)) {
|
|
return 0;
|
|
}
|
|
return 1;
|
|
}
|
|
```
|
|
|
|
This function permits the use of the efficient block reduction algorithm
|
|
using shuffles and shared memory (return 1) only if (a) all SIMT threads in
|
|
a warp are active (i.e., number of threads in the parallel region is a
|
|
multiple of 32) and (b) the number of threads in the parallel region
|
|
(set by the num_threads clause) equals blockDim.x.
|
|
|
|
If either of these preconditions is not true, each thread in the threadblock
|
|
updates the global value using atomics.
|
|
|
|
Atomics and compare-and-swap operations are expensive on many threaded
|
|
architectures such as GPUs and we must avoid them completely.
|
|
|
|
|
|
**Appendix: Implementation Details**
|
|
|
|
|
|
```
|
|
// Compiler generated function.
|
|
reduceFn(ReduceData *a, ReduceData *b)
|
|
a->foo = a->foo + b->foo
|
|
a->bar = a->bar + b->bar
|
|
|
|
// Compiler generated function.
|
|
swapAndReduceFn(ReduceData *thread_private, int lane)
|
|
ReduceData *remote = new ReduceData()
|
|
remote->foo = shuffle_double(thread_private->foo, lane)
|
|
remote->bar = shuffle_double(thread_private->bar, lane)
|
|
reduceFn(thread_private, remote)
|
|
|
|
// OMP runtime function.
|
|
warpReduce_regular(ReduceData *thread_private, Fn *swapAndReduceFn):
|
|
offset = 16
|
|
while (offset > 0)
|
|
swapAndReduceFn(thread_private, offset)
|
|
offset /= 2
|
|
|
|
// OMP runtime function.
|
|
warpReduce_irregular():
|
|
...
|
|
|
|
// OMP runtime function.
|
|
kmpc_reduce_warp(reduceData, swapAndReduceFn)
|
|
if all_lanes_active:
|
|
warpReduce_regular(reduceData, swapAndReduceFn)
|
|
else:
|
|
warpReduce_irregular(reduceData, swapAndReduceFn)
|
|
if in_simd_region:
|
|
// all done, reduce to global in simd lane 0
|
|
return 1
|
|
else if in_parallel_region:
|
|
// done reducing to one value per warp, now reduce across warps
|
|
return 3
|
|
|
|
// OMP runtime function; one for each basic type.
|
|
kmpc_reduce_block_double(double *a)
|
|
if lane == 0:
|
|
shared[wid] = *a
|
|
named_barrier(1, num_threads)
|
|
if wid == 0
|
|
block_reduce(shared)
|
|
if lane == 0
|
|
*a = shared[0]
|
|
named_barrier(1, num_threads)
|
|
if wid == 0 and lane == 0
|
|
return 1 // write back reduced result
|
|
else
|
|
return 0 // don't do anything
|
|
|
|
```
|
|
|
|
|
|
|
|
```
|
|
// Compiler generated code.
|
|
1. Create private copies of variables: foo_p, bar_p
|
|
2. Each thread reduces the chunk of A and B assigned to it and writes
|
|
to foo_p and bar_p respectively.
|
|
3. ret = kmpc_reduce_warp(reduceData, swapAndReduceFn)
|
|
4. if ret == 1:
|
|
The master thread stores the reduced result in the globals.
|
|
foo += reduceData.foo; bar += reduceData.bar
|
|
5. else if ret == 3:
|
|
ret = block_reduce_double(reduceData.foo)
|
|
if ret == 1:
|
|
foo += reduceData.foo
|
|
ret = block_reduce_double(reduceData.bar)
|
|
if ret == 1:
|
|
bar += reduceData.bar
|
|
```
|
|
|
|
**Notes**
|
|
|
|
1. This scheme requires that the CUDA OMP runtime can call llvm generated
|
|
functions. This functionality now works.
|
|
2. If the user inlines the CUDA OMP runtime bitcode, all of the machinery
|
|
(including calls through function pointers) are optimized away.
|
|
3. If we are reducing multiple to multiple variables in a parallel region,
|
|
the reduce operations are all performed in warpReduce_[ir]regular(). This
|
|
results in more instructions in the loop and should result in fewer
|
|
stalls due to data dependencies. Unfortunately we cannot do the same in
|
|
kmpc_reduce_block_double() without increasing shared memory usage.
|