forked from OSchip/llvm-project
75 lines
3.2 KiB
ReStructuredText
75 lines
3.2 KiB
ReStructuredText
.. _omp111:
|
|
|
|
Replaced globalized variable with X bytes of shared memory. [OMP111]
|
|
====================================================================
|
|
|
|
This optimization occurs when a globalized variable's data is shared between
|
|
multiple threads, but requires a constant amount of memory that can be
|
|
determined at compile time. This is the case when only a single thread creates
|
|
the memory and is then shared between every thread. The memory can then be
|
|
pushed to a static buffer of shared memory on the device. This optimization
|
|
allows users to declare shared memory on the device without using OpenMP's
|
|
custom allocators.
|
|
|
|
Globalization occurs when a pointer to a thread-local variable escapes the
|
|
current scope. If a single thread is known to be responsible for creating and
|
|
sharing the data it can instead be mapped directly to the device's shared
|
|
memory. Checking if only a single thread can execute an instruction requires
|
|
that the parent functions have internal linkage. Otherwise, an external caller
|
|
could invalidate this analysis but having multiple threads call that function.
|
|
The optimization pass will make internal copies of each function to use for this
|
|
reason, but it is still recommended to mark them as internal using keywords like
|
|
``static`` whenever possible.
|
|
|
|
Example
|
|
-------
|
|
|
|
This optimization should apply to any variable declared in an OpenMP target
|
|
region that is then shared with every thread in a parallel region. This allows
|
|
the user to declare shared memory without using custom allocators. A simple
|
|
stencil calculation shows how this can be used.
|
|
|
|
.. code-block:: c++
|
|
|
|
void stencil(int M, int N, double *X, double *Y) {
|
|
#pragma omp target teams distribute collapse(2) \
|
|
map(to : X [0:M * N]) map(tofrom : Y [0:M * N])
|
|
for (int i0 = 0; i0 < M; i0 += MC) {
|
|
for (int j0 = 0; j0 < N; j0 += NC) {
|
|
double sX[MC][NC];
|
|
|
|
#pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
|
|
for (int i1 = 0; i1 < MC; ++i1)
|
|
for (int j1 = 0; j1 < NC; ++j1)
|
|
sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)];
|
|
|
|
#pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
|
|
for (int i1 = 1; i1 < MC - 1; ++i1)
|
|
for (int j1 = 1; j1 < NC - 1; ++j1)
|
|
Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] +
|
|
sX[i1][j1 + 1] + sX[i1][j1 - 1] +
|
|
-4.0 * sX[i1][j1]) / (dX * dX);
|
|
}
|
|
}
|
|
}
|
|
|
|
.. code-block:: console
|
|
|
|
|
|
$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp
|
|
omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111]
|
|
double sX[MC][NC];
|
|
^
|
|
|
|
The default mapping for variables captured in an OpenMP parallel region is
|
|
``shared``. This means taking a pointer to the object which will ultimately
|
|
result in globalization that will be mapped to shared memory when it could have
|
|
been placed in registers. To avoid this, make sure each variable that can be
|
|
copied into the region is marked ``firstprivate`` either explicitly or using the
|
|
OpenMP 5.1 feature ``default(firstprivate)``.
|
|
|
|
Diagnostic Scope
|
|
----------------
|
|
|
|
OpenMP target offloading optimization remark.
|