forked from OSchip/llvm-project
94 lines
5.2 KiB
ReStructuredText
94 lines
5.2 KiB
ReStructuredText
================
|
|
The Architecture
|
|
================
|
|
|
|
Polly is a loop optimizer for LLVM. Starting from LLVM-IR it detects and
|
|
extracts interesting loop kernels. For each kernel a mathematical model is
|
|
derived which precisely describes the individual computations and memory
|
|
accesses in the kernels. Within Polly a variety of analysis and code
|
|
transformations are performed on this mathematical model. After all
|
|
optimizations have been derived and applied, optimized LLVM-IR is regenerated
|
|
and inserted into the LLVM-IR module.
|
|
|
|
.. image:: images/architecture.png
|
|
:align: center
|
|
|
|
Polly in the LLVM pass pipeline
|
|
-------------------------------
|
|
|
|
The standard LLVM pass pipeline as it is used in -O1/-O2/-O3 mode of clang/opt
|
|
consists of a sequence of passes that can be grouped in different conceptual
|
|
phases. The first phase, we call it here **Canonicalization**, is a scalar
|
|
canonicalization phase that contains passes like -mem2reg, -instcombine,
|
|
-cfgsimplify, or early loop unrolling. It has the goal of removing and
|
|
simplifying the given IR as much as possible focusing mostly on scalar
|
|
optimizations. The second phase consists of three conceptual groups that are
|
|
executed in the so-called **Inliner cycle**, This is again a set of **Scalar
|
|
Simplification** passes, a set of **Simple Loop Optimizations**, and the
|
|
**Inliner** itself. Even though these passes make up the majority of the LLVM
|
|
pass pipeline, the primary goal of these passes is still canonicalization
|
|
without loosing semantic information that complicates later analysis. As part of
|
|
the inliner cycle, the LLVM inliner step-by-step tries to inline functions, runs
|
|
canonicalization passes to exploit newly exposed simplification opportunities,
|
|
and then tries to inline the further simplified functions. Some simple loop
|
|
optimizations are executed as part of the inliner cycle. Even though they
|
|
perform some optimizations, their primary goal is still the simplification of
|
|
the program code. Loop invariant code motion is one such optimization that
|
|
besides being beneficial for program performance also allows us to move
|
|
computation out of loops and in the best case enables us to eliminate certain
|
|
loops completely. Only after the inliner cycle has been finished, a last
|
|
**Target Specialization** phase is run, where IR complexity is deliberately
|
|
increased to take advantage of target specific features that maximize the
|
|
execution performance on the device we target. One of the principal
|
|
optimizations in this phase is vectorization, but also target specific loop
|
|
unrolling, or some loop transformations (e.g., distribution) that expose more
|
|
vectorization opportunities.
|
|
|
|
.. image:: images/LLVM-Passes-only.png
|
|
:align: center
|
|
|
|
Polly can conceptually be run at three different positions in the pass pipeline.
|
|
As an early optimizer before the standard LLVM pass pipeline, as a later
|
|
optimizer as part of the target specialization sequence, and theoretically also
|
|
with the loop optimizations in the inliner cycle. We only discuss the first two
|
|
options, as running Polly in the inline loop, is likely to disturb the inliner
|
|
and is consequently not a good idea.
|
|
|
|
.. image:: images/LLVM-Passes-all.png
|
|
:align: center
|
|
|
|
Running Polly early before the standard pass pipeline has the benefit that the
|
|
LLVM-IR processed by Polly is still very close to the original input code.
|
|
Hence, it is less likely that transformations applied by LLVM change the IR in
|
|
ways not easily understandable for the programmer. As a result, user feedback is
|
|
likely better and it is less likely that kernels that in C seem a perfect fit
|
|
for Polly have been transformed such that Polly can not handle them any
|
|
more. On the other hand, codes that require inlining to be optimized won't
|
|
benefit if Polly is scheduled at this position. The additional set of
|
|
canonicalization passes required will result in a small, but general compile
|
|
time increase and some random run-time performance changes due to slightly
|
|
different IR being passed through the optimizers. To force Polly to run early in
|
|
the pass pipeline use the option *-polly-position=early* (default today).
|
|
|
|
.. image:: images/LLVM-Passes-early.png
|
|
:align: center
|
|
|
|
Running Polly right before the vectorizer has the benefit that the full inlining
|
|
cycle has been run and as a result even heavily templated C++ code could
|
|
theoretically benefit from Polly (more work is necessary to make Polly here
|
|
really effective). As the IR that is passed to Polly has already been
|
|
canonicalized, there is also no need to run additional canonicalization passes.
|
|
General compile time is almost not affected by Polly, as detection of loop
|
|
kernels is generally very fast and the actual optimization and cleanup passes
|
|
are only run on functions which contain loop kernels that are worth optimizing.
|
|
However, due to the many optimizations that LLVM runs before Polly the IR that
|
|
reaches Polly often has additional scalar dependences that make Polly a lot less
|
|
efficient. To force Polly to run before the vectorizer in the pass pipleline use
|
|
the option *-polly-position=before-vectorizer*. This position is not yet the
|
|
default for Polly, but work is on its way to be effective even in presence of
|
|
scalar dependences. After this work has been completed, Polly will likely use
|
|
this position by default.
|
|
|
|
.. image:: images/LLVM-Passes-late.png
|
|
:align: center
|