forked from OSchip/llvm-project
324 lines
13 KiB
Markdown
324 lines
13 KiB
Markdown
<!--===- docs/DoConcurrent.md
|
|
|
|
Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
See https://llvm.org/LICENSE.txt for license information.
|
|
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
|
-->
|
|
|
|
# `DO CONCURRENT` isn't necessarily concurrent
|
|
|
|
```eval_rst
|
|
.. contents::
|
|
:local:
|
|
```
|
|
|
|
A variant form of Fortran's primary looping construct was
|
|
added to the Fortran 2008 language standard with the apparent
|
|
intent of enabling more effective automatic parallel execution of code
|
|
written in the standard language without the use of
|
|
non-standard directives.
|
|
Spelled `DO CONCURRENT`, the construct takes a rectilinear iteration
|
|
space specification like `FORALL` and allows us to write
|
|
a multidimensional loop nest construct with a single `DO CONCURRENT`
|
|
statement and a single terminating `END DO` statement.
|
|
|
|
Within the body of a `DO CONCURRENT` loop the program must respect
|
|
a long list of restrictions on its use of Fortran language features.
|
|
Actions that obviously can't be executed in parallel or that
|
|
don't allow all iterations to execute are prohibited.
|
|
These include:
|
|
* Control flow statements that would prevent the loop nest from
|
|
executing all its iterations: `RETURN`, `EXIT`, and any
|
|
`GOTO` or `CYCLE` that leaves the construct.
|
|
* Image control statements: `STOP`, `SYNC`, `LOCK`/`UNLOCK`, `EVENT`,
|
|
and `ALLOCATE`/`DEALLOCATE` of a coarray.
|
|
* Calling a procedure that is not `PURE`.
|
|
* Deallocation of any polymorphic entity, as that could cause
|
|
an impure FINAL subroutine to be called.
|
|
* Messing with the IEEE floating-point control and status flags.
|
|
* Accepting some restrictions on data flow between iterations
|
|
(i.e., none) and on liveness of modified objects after the loop.
|
|
(The details are spelled out later.)
|
|
|
|
In return for accepting these restrictions, a `DO CONCURRENT` might
|
|
compile into code that exploits the parallel features of the target
|
|
machine to run the iterations of the `DO CONCURRENT` construct.
|
|
One needn't necessarily require OpenACC or OpenMP directives.
|
|
|
|
But it turns out that these rules, though *necessary* for safe parallel
|
|
execution, are not *sufficient*.
|
|
One may write conforming `DO CONCURRENT` constructs that cannot
|
|
be safely parallelized by a compiler; worse, one may write conforming
|
|
`DO CONCURRENT` constructs whose parallelizability a compiler cannot
|
|
determine even in principle -- forcing a conforming compiler to
|
|
assume the worst and generate sequential code.
|
|
|
|
## Localization
|
|
|
|
The Fortran language standard does not actually define `DO CONCURRENT` as a
|
|
concurrent construct, or even as a construct that imposes sufficient
|
|
requirements on the programmer to allow for parallel execution.
|
|
`DO CONCURRENT` is instead defined as executing the iterations
|
|
of the loop in some arbitrary order (see subclause 11.1.7.4.3 paragraph 3).
|
|
|
|
A `DO CONCURRENT` construct cannot modify an object in one iteration
|
|
and expect to be able to read it in another, or read it in one before it gets
|
|
modified by another -- there's no way to synchronize inter-iteration
|
|
communication with critical sections or atomics.
|
|
|
|
But a conforming `DO CONCURRENT` construct *can* modify an object in
|
|
multiple iterations of the loop so long as its only reads from that
|
|
object *after* having modified it earler in the *same* iteration.
|
|
(See 11.1.7.5 paragraph 4 for the details.)
|
|
|
|
For example:
|
|
|
|
```
|
|
DO CONCURRENT (J=1:N)
|
|
TMP = A(J) + B(J)
|
|
C(J) = TMP
|
|
END DO
|
|
! And TMP is undefined afterwards
|
|
```
|
|
|
|
The scalar variable `TMP` is used in this loop in a way that conforms
|
|
to the standard, as every use of `TMP` follows a definition that appears
|
|
earlier in the same iteration.
|
|
|
|
The idea, of course, is that a parallelizing compiler isn't required to
|
|
use the same word of memory to hold the value of `TMP`;
|
|
for parallel execution, `TMP` can be _localized_.
|
|
This means that the loop can be internally rewritten as if it had been
|
|
```
|
|
DO CONCURRENT (J=1:N)
|
|
BLOCK
|
|
REAL :: TMP
|
|
TMP = A(J) + B(J)
|
|
C(J) = TMP
|
|
END BLOCK
|
|
END DO
|
|
```
|
|
and thus any risk of data flow between the iterations is removed.
|
|
|
|
## The identification problem
|
|
|
|
The automatic localization rules of `DO CONCURRENT` that allow
|
|
usage like `TMP` above are not limited to simple local scalar
|
|
variables.
|
|
They also apply to arbitrary variables, and thus may apply
|
|
in cases that a compiler cannot determine exactly due to
|
|
the presence of indexing, indirection, and interprocedural data flow.
|
|
|
|
Let's see why this turns out to be a problem.
|
|
|
|
Examples:
|
|
```
|
|
DO CONCURRENT (J=1:N)
|
|
T(IX(J)) = A(J) + B(J)
|
|
C(J) = T(IY(J))
|
|
END DO
|
|
```
|
|
This loop conforms to the standard language if,
|
|
whenever `IX(J)` equals `IY(J')` for any distinct pair of iterations
|
|
`J` and `J'`,
|
|
then the load must be reading a value stored earlier in the
|
|
same iteration -- so `IX(J')==IY(J')`, and hence `IX(J)==IX(J')` too,
|
|
in this example.
|
|
Otherwise, a load in one iteration might depend on a store
|
|
in another.
|
|
|
|
When all values of `IX(J)` are distinct, and the program conforms
|
|
to the restrictions of `DO CONCURRENT`, a compiler can parallelize
|
|
the construct easily without applying localization to `T(...)`.
|
|
And when some values of `IX(J)` are duplicates, a compiler can parallelize
|
|
the loop by forwarding the stored value to the load in those
|
|
iterations.
|
|
But at compilation time, there's _no way to distinguish_ these
|
|
cases in general, and a conservative implementation has to assume
|
|
the worst and run the loop's iterations serially.
|
|
(Or compare `IX(J)` with `IY(J)` at runtime and forward the
|
|
stored value conditionally, which adds overhead and becomes
|
|
quickly impractical in loops with multiple loads and stores.)
|
|
|
|
In
|
|
```
|
|
TYPE :: T
|
|
REAL, POINTER :: P
|
|
END TYPE
|
|
TYPE(T) :: T1(N), T2(N)
|
|
DO CONCURRENT (J=1:N)
|
|
T1(J)%P = A(J) + B(J)
|
|
C(J) = T2(J)%P
|
|
END DO
|
|
```
|
|
we have the same kind of ambiguity from the compiler's perspective.
|
|
Are the targets of the pointers used for the stores all distinct
|
|
from the targets of the pointers used for the loads?
|
|
The programmer may know that they are so, but a compiler
|
|
cannot; and there is no syntax by which one can stipulate
|
|
that they are so.
|
|
|
|
## The global variable localization problem
|
|
|
|
Here's another case:
|
|
```
|
|
MODULE M
|
|
REAL :: T
|
|
END MODULE
|
|
...
|
|
USE M
|
|
INTERFACE
|
|
PURE REAL FUNCTION F(X)
|
|
REAL, INTENT(IN) :: X
|
|
END FUNCTION
|
|
END INTERFACE
|
|
DO CONCURRENT (J=1:N)
|
|
T = A(J) + B(J)
|
|
D(J) = F(A(J)) + T
|
|
END DO
|
|
```
|
|
The variable `T` is obviously meant to be localized.
|
|
However, a compiler can't be sure that the pure function `F`
|
|
doesn't read from `T`; if it does, there wouldn't be a
|
|
practical way to convey the localized copy to it.
|
|
|
|
In summary, standard Fortran defines `DO CONCURRENT` as a serial
|
|
construct with a sheaf of constraints that we assume are intended
|
|
to enable straightforward parallelization without
|
|
all of the complexity of defining threading models or shared memory semantics,
|
|
with the addition of an automatic localization rule that provides
|
|
convenient temporaries objects without requiring the use of nested
|
|
`BLOCK` or `ASSOCIATE` constructs.
|
|
But the language allows ambiguous cases in which a compiler can neither
|
|
1. prove that automatic localization *is* required for a given
|
|
object in every iteration, nor
|
|
1. prove that automatic localization *isn't* required in any iteration.
|
|
|
|
## Locality specifiers
|
|
|
|
The Fortran 2018 standard added "locality specifiers" to the
|
|
`DO CONCURRENT` statement.
|
|
These allow one to define some variable names as being `LOCAL` or
|
|
`SHARED`, overriding the automatic localization rule so that it
|
|
applies only in the remaining cases of "unspecified" locality.
|
|
|
|
`LOCAL` variables are those that can be defined by more than one
|
|
iteration but are referenced only after having been defined
|
|
earlier in the same iteration.
|
|
`SHARED` variables are those that, if defined in
|
|
any iteration, are not defined or referenced in any other iteration.
|
|
|
|
(There is also a `LOCAL_INIT` specifier that is not relevant to the
|
|
problem at hand, and a `DEFAULT(NONE)` specifier that requires a
|
|
locality specifier be present for every variable mentioned in the
|
|
`DO CONCURRENT` construct.)
|
|
|
|
These locality specifiers can help resolve some otherwise ambiguous
|
|
cases of localization, but they're not a complete solution to the problems
|
|
described above.
|
|
|
|
First, the specifiers allow explicit localization of objects
|
|
(like the scalar `T` in `MODULE M` above) that are not local variables
|
|
of the subprogram.
|
|
`DO CONCURRENT` still allows a pure procedure called from the loop
|
|
to reference `T`, and so explicit localization just confirms the
|
|
worst-case assumptions about interprocedural data flow
|
|
within an iteration that a compiler must make anyway.
|
|
|
|
Second, the specifiers allow arbitary variables to be localized,
|
|
not just scalars.
|
|
One may localize a million-element array of derived type
|
|
with allocatable components to be created in each iteration,
|
|
for example.
|
|
(It is not clear whether localized objects are finalized;
|
|
probably not.)
|
|
|
|
Third, as Fortran uses context to distinguish references to
|
|
pointers from (de)references to their targets, it's not clear
|
|
whether `LOCAL(PTR)` localizes a pointer, its target, or both.
|
|
|
|
Fourth, the specifiers can be applied only to variable _names_,
|
|
not to any designator with subscripts or component references.
|
|
One may have defined a derived type to hold a representation
|
|
of a sparse matrix, using `ALLOCATABLE` components to store its
|
|
packed data and indexing structures, but a program cannot localize
|
|
some parts of it and share the rest.
|
|
(Perhaps one may wrap `ASSOCIATE` constructs around the
|
|
`DO CONCURRENT` construct;
|
|
the interaction between locality specifiers and construct entities is
|
|
not clearly defined in the language.)
|
|
|
|
In the example above that defines `T(IX(J))` and reads from `T(IY(J))`,
|
|
the locality specifiers can't be used to share those elements of `T()`
|
|
that are modified at most once and localize the cases where
|
|
`IX(J)` is a duplicate and `IY(J)==IX(J)`.
|
|
|
|
Last, when a loop both defines and references many shared objects,
|
|
including potential references to globally accessible object
|
|
in called procedures, one may need to name all of them in a `SHARED`
|
|
specifier.
|
|
|
|
## What to do now
|
|
|
|
These problems have been presented to the J3 Fortran language
|
|
standard committee.
|
|
Their responses in
|
|
recent [e-mail discussions](https://mailman.j3-fortran.org/pipermail/j3/2020-July/thread.html)
|
|
did not include an intent to address them in future standards or corrigenda.
|
|
The most effective-looking response -- which was essentially "just use
|
|
`DEFAULT(SHARED)` to disable all automatic localization" -- is not an
|
|
viable option, since the language does not include such a specifier!
|
|
|
|
Programmers writing `DO CONCURRENT` loops that are safely parallelizable
|
|
need an effective means to convey to compilers that those compilers
|
|
do not have to assume only the weaker stipulations required by
|
|
today's `DO CONCURRENT` without having to write verbose and
|
|
error-prone locality specifiers (when those would suffice).
|
|
Specifically, an easy means is required that stipulates that localization
|
|
should apply at most only to the obvious cases of local non-pointer
|
|
non-allocatable scalars.
|
|
|
|
In the LLVM Fortran compiler project (a/k/a "flang", "f18") we considered
|
|
several solutions to this problem.
|
|
1. Add syntax (e.g., `DO PARALLEL` or `DO CONCURRENT() DEFAULT(PARALLEL)`)
|
|
by which one can inform the compiler that it should localize only
|
|
the obvious cases of simple local scalars.
|
|
Such syntax seems unlikely to ever be standardized, so its usage
|
|
would be nonportable.
|
|
1. Add a command-line option &/or a source directive to stipulate
|
|
the stronger guarantees. Obvious non-parallelizable usage in the construct
|
|
would elicit a stern warning. The `DO CONCURRENT` loops in the source
|
|
would continue to be portable to other compilers.
|
|
1. Assume that these stronger conditions hold by default, and add a command-line
|
|
option &/or a source directive to "opt out" back to the weaker
|
|
requirements of the standard language
|
|
in the event that the program contains one of those inherently
|
|
non-parallelizable `DO CONCURRENT` loops that perhaps should never have
|
|
been possible to write in a conforming program in the first place.
|
|
Actual parallel `DO CONCURRENT` constructs would produce parallel
|
|
code for users who would otherwise be surprised to learn about these
|
|
problems in the language.
|
|
But this option could lead to non-standard behavior for codes that depend,
|
|
accidentally or not, on non-parallelizable implicit localization.
|
|
1. Accept the standard as it exists, do the best job of automatic
|
|
parallelization that can be done, and refer dissatisfied users to J3.
|
|
This would be avoiding the problem.
|
|
|
|
None of these options is without a fairly obvious disadvantage.
|
|
The best option seems to be the one that assumes that users who write
|
|
`DO CONCURRENT` constructs are doing so with the intent to write parallel code.
|
|
|
|
## Other precedents
|
|
|
|
As of August 2020, we observe that the GNU Fortran compiler (10.1) does not
|
|
yet implement the Fortran 2018 locality clauses, but will parallelize some
|
|
`DO CONCURRENT` constructs without ambiguous data dependences when the automatic
|
|
parallelization option is enabled.
|
|
|
|
The Intel Fortran compiler supports the new locality clauses and will parallelize
|
|
some `DO CONCURRENT` constructs when automatic parallelization option is enabled.
|
|
When OpenMP is enabled, ifort reports that all `DO CONCURRENT` constructs are
|
|
parallelized, but they seem to execute in a serial fashion when data flow
|
|
hazards are present.
|