forked from OSchip/llvm-project
476 lines
19 KiB
ReStructuredText
476 lines
19 KiB
ReStructuredText
==================================================
|
|
How to manually use the Individual pieces of Polly
|
|
==================================================
|
|
|
|
Execute the individual Polly passes manually
|
|
============================================
|
|
|
|
.. sectionauthor:: Singapuram Sanjay Srivallabh
|
|
|
|
This example presents the individual passes that are involved when optimizing
|
|
code with Polly. We show how to execute them individually and explain for
|
|
each which analysis is performed or what transformation is applied. In this
|
|
example the polyhedral transformation is user-provided to show how much
|
|
performance improvement can be expected by an optimal automatic optimizer.
|
|
|
|
1. **Create LLVM-IR from the C code**
|
|
-------------------------------------
|
|
Polly works on LLVM-IR. Hence it is necessary to translate the source
|
|
files into LLVM-IR. If more than one file should be optimized the
|
|
files can be combined into a single file with llvm-link.
|
|
|
|
.. code-block:: console
|
|
|
|
clang -S -emit-llvm matmul.c -o matmul.s
|
|
|
|
|
|
2. **Prepare the LLVM-IR for Polly**
|
|
------------------------------------
|
|
|
|
Polly is only able to work with code that matches a canonical form.
|
|
To translate the LLVM-IR into this form we use a set of
|
|
canonicalication passes. They are scheduled by using
|
|
'-polly-canonicalize'.
|
|
|
|
.. code-block:: console
|
|
|
|
opt -S -polly-canonicalize matmul.s > matmul.preopt.ll
|
|
|
|
3. **Show the SCoPs detected by Polly (optional)**
|
|
--------------------------------------------------
|
|
|
|
To understand if Polly was able to detect SCoPs, we print the structure
|
|
of the detected SCoPs. In our example two SCoPs are detected. One in
|
|
'init_array' the other in 'main'.
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt -polly-ast -analyze -q matmul.preopt.ll -polly-process-unprofitable
|
|
|
|
.. code-block:: guess
|
|
|
|
:: isl ast :: init_array :: %for.cond1.preheader---%for.end19
|
|
|
|
if (1)
|
|
|
|
for (int c0 = 0; c0 <= 1535; c0 += 1)
|
|
for (int c1 = 0; c1 <= 1535; c1 += 1)
|
|
Stmt_for_body3(c0, c1);
|
|
|
|
else
|
|
{ /* original code */ }
|
|
|
|
:: isl ast :: main :: %for.cond1.preheader---%for.end30
|
|
|
|
if (1)
|
|
|
|
for (int c0 = 0; c0 <= 1535; c0 += 1)
|
|
for (int c1 = 0; c1 <= 1535; c1 += 1) {
|
|
Stmt_for_body3(c0, c1);
|
|
for (int c2 = 0; c2 <= 1535; c2 += 1)
|
|
Stmt_for_body8(c0, c1, c2);
|
|
}
|
|
|
|
else
|
|
{ /* original code */ }
|
|
|
|
4. **Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)**
|
|
----------------------------------------------------------------------------------------
|
|
|
|
Polly can use graphviz to graphically show a CFG in which the detected
|
|
SCoPs are highlighted. It can also create '.dot' files that can be
|
|
translated by the 'dot' utility into various graphic formats.
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt -view-scops -disable-output matmul.preopt.ll
|
|
$ opt -view-scops-only -disable-output matmul.preopt.ll
|
|
|
|
The output for the different functions:
|
|
|
|
- view-scops : main_, init_array_, print_array_
|
|
- view-scops-only : main-scopsonly_, init_array-scopsonly_, print_array-scopsonly_
|
|
|
|
.. _main: http://polly.llvm.org/experiments/matmul/scops.main.dot.png
|
|
.. _init_array: http://polly.llvm.org/experiments/matmul/scops.init_array.dot.png
|
|
.. _print_array: http://polly.llvm.org/experiments/matmul/scops.print_array.dot.png
|
|
.. _main-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.main.dot.png
|
|
.. _init_array-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.init_array.dot.png
|
|
.. _print_array-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.print_array.dot.png
|
|
|
|
5. **View the polyhedral representation of the SCoPs**
|
|
------------------------------------------------------
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt -polly-scops -analyze matmul.preopt.ll -polly-process-unprofitable
|
|
|
|
.. code-block:: guess
|
|
|
|
[...]Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond1.preheader => for.end19' in function 'init_array':
|
|
Function: init_array
|
|
Region: %for.cond1.preheader---%for.end19
|
|
Max Loop Depth: 2
|
|
Invariant Accesses: {
|
|
}
|
|
Context:
|
|
{ : }
|
|
Assumed Context:
|
|
{ : }
|
|
Invalid Context:
|
|
{ : 1 = 0 }
|
|
Arrays {
|
|
float MemRef_A[*][1536]; // Element size 4
|
|
float MemRef_B[*][1536]; // Element size 4
|
|
}
|
|
Arrays (Bounds as pw_affs) {
|
|
float MemRef_A[*][ { [] -> [(1536)] } ]; // Element size 4
|
|
float MemRef_B[*][ { [] -> [(1536)] } ]; // Element size 4
|
|
}
|
|
Alias Groups (0):
|
|
n/a
|
|
Statements {
|
|
Stmt_for_body3
|
|
Domain :=
|
|
{ Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 };
|
|
Schedule :=
|
|
{ Stmt_for_body3[i0, i1] -> [i0, i1] };
|
|
MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
|
|
{ Stmt_for_body3[i0, i1] -> MemRef_A[i0, i1] };
|
|
MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
|
|
{ Stmt_for_body3[i0, i1] -> MemRef_B[i0, i1] };
|
|
}
|
|
[...]Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond1.preheader => for.end30' in function 'main':
|
|
Function: main
|
|
Region: %for.cond1.preheader---%for.end30
|
|
Max Loop Depth: 3
|
|
Invariant Accesses: {
|
|
}
|
|
Context:
|
|
{ : }
|
|
Assumed Context:
|
|
{ : }
|
|
Invalid Context:
|
|
{ : 1 = 0 }
|
|
Arrays {
|
|
float MemRef_C[*][1536]; // Element size 4
|
|
float MemRef_A[*][1536]; // Element size 4
|
|
float MemRef_B[*][1536]; // Element size 4
|
|
}
|
|
Arrays (Bounds as pw_affs) {
|
|
float MemRef_C[*][ { [] -> [(1536)] } ]; // Element size 4
|
|
float MemRef_A[*][ { [] -> [(1536)] } ]; // Element size 4
|
|
float MemRef_B[*][ { [] -> [(1536)] } ]; // Element size 4
|
|
}
|
|
Alias Groups (0):
|
|
n/a
|
|
Statements {
|
|
Stmt_for_body3
|
|
Domain :=
|
|
{ Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 };
|
|
Schedule :=
|
|
{ Stmt_for_body3[i0, i1] -> [i0, i1, 0, 0] };
|
|
MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
|
|
{ Stmt_for_body3[i0, i1] -> MemRef_C[i0, i1] };
|
|
Stmt_for_body8
|
|
Domain :=
|
|
{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1535 };
|
|
Schedule :=
|
|
{ Stmt_for_body8[i0, i1, i2] -> [i0, i1, 1, i2] };
|
|
ReadAccess := [Reduction Type: NONE] [Scalar: 0]
|
|
{ Stmt_for_body8[i0, i1, i2] -> MemRef_C[i0, i1] };
|
|
ReadAccess := [Reduction Type: NONE] [Scalar: 0]
|
|
{ Stmt_for_body8[i0, i1, i2] -> MemRef_A[i0, i2] };
|
|
ReadAccess := [Reduction Type: NONE] [Scalar: 0]
|
|
{ Stmt_for_body8[i0, i1, i2] -> MemRef_B[i2, i1] };
|
|
MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
|
|
{ Stmt_for_body8[i0, i1, i2] -> MemRef_C[i0, i1] };
|
|
}
|
|
|
|
|
|
6. **Show the dependences for the SCoPs**
|
|
-----------------------------------------
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt -polly-dependences -analyze matmul.preopt.ll -polly-process-unprofitable
|
|
|
|
.. code-block:: guess
|
|
|
|
[...]Printing analysis 'Polly - Calculate dependences' for region: 'for.cond1.preheader => for.end19' in function 'init_array':
|
|
RAW dependences:
|
|
{ }
|
|
WAR dependences:
|
|
{ }
|
|
WAW dependences:
|
|
{ }
|
|
Reduction dependences:
|
|
n/a
|
|
Transitive closure of reduction dependences:
|
|
{ }
|
|
[...]Printing analysis 'Polly - Calculate dependences' for region: 'for.cond1.preheader => for.end30' in function 'main':
|
|
RAW dependences:
|
|
{ Stmt_for_body3[i0, i1] -> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] -> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 }
|
|
WAR dependences:
|
|
{ }
|
|
WAW dependences:
|
|
{ Stmt_for_body3[i0, i1] -> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] -> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 }
|
|
Reduction dependences:
|
|
n/a
|
|
Transitive closure of reduction dependences:
|
|
{ }
|
|
|
|
7. **Export jscop files**
|
|
-------------------------
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt -polly-export-jscop matmul.preopt.ll -polly-process-unprofitable
|
|
|
|
.. code-block:: guess
|
|
|
|
[...]Writing JScop '%for.cond1.preheader---%for.end19' in function 'init_array' to './init_array___%for.cond1.preheader---%for.end19.jscop'.
|
|
|
|
Writing JScop '%for.cond1.preheader---%for.end30' in function 'main' to './main___%for.cond1.preheader---%for.end30.jscop'.
|
|
|
|
|
|
|
|
8. **Import the changed jscop files and print the updated SCoP structure (optional)**
|
|
-------------------------------------------------------------------------------------
|
|
|
|
Polly can reimport jscop files, in which the schedules of the statements
|
|
are changed. These changed schedules are used to descripe
|
|
transformations. It is possible to import different jscop files by
|
|
providing the postfix of the jscop file that is imported.
|
|
|
|
We apply three different transformations on the SCoP in the main
|
|
function. The jscop files describing these transformations are
|
|
hand written (and available in docs/experiments/matmul).
|
|
|
|
**No Polly**
|
|
|
|
As a baseline we do not call any Polly code generation, but only apply the normal -O3 optimizations.
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll -polly-import-jscop -polly-ast -analyze -polly-process-unprofitable
|
|
|
|
.. code-block:: c
|
|
|
|
[...]
|
|
:: isl ast :: main :: %for.cond1.preheader---%for.end30
|
|
|
|
if (1)
|
|
|
|
for (int c0 = 0; c0 <= 1535; c0 += 1)
|
|
for (int c1 = 0; c1 <= 1535; c1 += 1) {
|
|
Stmt_for_body3(c0, c1);
|
|
for (int c3 = 0; c3 <= 1535; c3 += 1)
|
|
Stmt_for_body8(c0, c1, c3);
|
|
}
|
|
|
|
else
|
|
{ /* original code */ }
|
|
[...]
|
|
|
|
**Loop Interchange (and Fission to allow the interchange)**
|
|
|
|
We split the loops and can now apply an interchange of the loop dimensions that enumerate Stmt_for_body8.
|
|
|
|
.. Although I feel (and have created a .jscop) we can avoid splitting the loops.
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged -polly-ast -analyze -polly-process-unprofitable
|
|
|
|
.. code-block:: c
|
|
|
|
[...]
|
|
:: isl ast :: main :: %for.cond1.preheader---%for.end30
|
|
|
|
if (1)
|
|
|
|
{
|
|
for (int c1 = 0; c1 <= 1535; c1 += 1)
|
|
for (int c2 = 0; c2 <= 1535; c2 += 1)
|
|
Stmt_for_body3(c1, c2);
|
|
for (int c1 = 0; c1 <= 1535; c1 += 1)
|
|
for (int c2 = 0; c2 <= 1535; c2 += 1)
|
|
for (int c3 = 0; c3 <= 1535; c3 += 1)
|
|
Stmt_for_body8(c1, c3, c2);
|
|
}
|
|
|
|
else
|
|
{ /* original code */ }
|
|
[...]
|
|
|
|
**Interchange + Tiling**
|
|
|
|
In addition to the interchange we now tile the second loop nest.
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-ast -analyze -polly-process-unprofitable
|
|
|
|
.. code-block:: c
|
|
|
|
[...]
|
|
:: isl ast :: main :: %for.cond1.preheader---%for.end30
|
|
|
|
if (1)
|
|
|
|
{
|
|
for (int c1 = 0; c1 <= 1535; c1 += 1)
|
|
for (int c2 = 0; c2 <= 1535; c2 += 1)
|
|
Stmt_for_body3(c1, c2);
|
|
for (int c1 = 0; c1 <= 1535; c1 += 64)
|
|
for (int c2 = 0; c2 <= 1535; c2 += 64)
|
|
for (int c3 = 0; c3 <= 1535; c3 += 64)
|
|
for (int c4 = c1; c4 <= c1 + 63; c4 += 1)
|
|
for (int c5 = c3; c5 <= c3 + 63; c5 += 1)
|
|
for (int c6 = c2; c6 <= c2 + 63; c6 += 1)
|
|
Stmt_for_body8(c4, c6, c5);
|
|
}
|
|
|
|
else
|
|
{ /* original code */ }
|
|
[...]
|
|
|
|
|
|
**Interchange + Tiling + Strip-mining to prepare vectorization**
|
|
|
|
To later allow vectorization we create a so called trivially
|
|
parallelizable loop. It is innermost, parallel and has only four
|
|
iterations. It can be replaced by 4-element SIMD instructions.
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-ast -analyze -polly-process-unprofitable
|
|
|
|
.. code-block:: c
|
|
|
|
[...]
|
|
:: isl ast :: main :: %for.cond1.preheader---%for.end30
|
|
|
|
if (1)
|
|
|
|
{
|
|
for (int c1 = 0; c1 <= 1535; c1 += 1)
|
|
for (int c2 = 0; c2 <= 1535; c2 += 1)
|
|
Stmt_for_body3(c1, c2);
|
|
for (int c1 = 0; c1 <= 1535; c1 += 64)
|
|
for (int c2 = 0; c2 <= 1535; c2 += 64)
|
|
for (int c3 = 0; c3 <= 1535; c3 += 64)
|
|
for (int c4 = c1; c4 <= c1 + 63; c4 += 1)
|
|
for (int c5 = c3; c5 <= c3 + 63; c5 += 1)
|
|
for (int c6 = c2; c6 <= c2 + 63; c6 += 4)
|
|
for (int c7 = c6; c7 <= c6 + 3; c7 += 1)
|
|
Stmt_for_body8(c4, c7, c5);
|
|
}
|
|
|
|
else
|
|
{ /* original code */ }
|
|
[...]
|
|
|
|
9. **Codegenerate the SCoPs**
|
|
-----------------------------
|
|
|
|
This generates new code for the SCoPs detected by polly. If
|
|
-polly-import-jscop is present, transformations specified in the
|
|
imported jscop files will be applied.
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll | opt -O3 > matmul.normalopt.ll
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged -polly-codegen -polly-process-unprofitable | opt -O3 > matmul.polly.interchanged.ll
|
|
|
|
.. code-block:: guess
|
|
|
|
Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged'.
|
|
File could not be read: No such file or directory
|
|
Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged'.
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-codegen -polly-process-unprofitable | opt -O3 > matmul.polly.interchanged+tiled.ll
|
|
|
|
.. code-block:: guess
|
|
|
|
Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled'.
|
|
File could not be read: No such file or directory
|
|
Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled'.
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector -polly-codegen -polly-vectorizer=polly -polly-process-unprofitable | opt -O3 > matmul.polly.interchanged+tiled+vector.ll
|
|
|
|
.. code-block:: guess
|
|
|
|
Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled+vector'.
|
|
File could not be read: No such file or directory
|
|
Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled+vector'.
|
|
|
|
.. code-block:: console
|
|
|
|
$ opt matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector -polly-codegen -polly-vectorizer=polly -polly-parallel -polly-process-unprofitable | opt -O3 > matmul.polly.interchanged+tiled+openmp.ll
|
|
|
|
.. code-block:: guess
|
|
|
|
Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled+vector'.
|
|
File could not be read: No such file or directory
|
|
Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled+vector'.
|
|
|
|
|
|
10. **Create the executables**
|
|
------------------------------
|
|
|
|
.. code-block:: console
|
|
|
|
$ llc matmul.normalopt.ll -o matmul.normalopt.s && gcc matmul.normalopt.s -o matmul.normalopt.exe
|
|
$ llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s && gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe
|
|
$ llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s && gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe
|
|
$ llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s && gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe
|
|
$ llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s && gcc -fopenmp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe
|
|
|
|
11. **Compare the runtime of the executables**
|
|
----------------------------------------------
|
|
|
|
By comparing the runtimes of the different code snippets we see that a
|
|
simple loop interchange gives here the largest performance boost.
|
|
However in this case, adding vectorization and using OpenMP degrades
|
|
the performance.
|
|
|
|
.. code-block:: console
|
|
|
|
$ time ./matmul.normalopt.exe
|
|
|
|
real 0m11.295s
|
|
user 0m11.288s
|
|
sys 0m0.004s
|
|
$ time ./matmul.polly.interchanged.exe
|
|
|
|
real 0m0.988s
|
|
user 0m0.980s
|
|
sys 0m0.008s
|
|
$ time ./matmul.polly.interchanged+tiled.exe
|
|
|
|
real 0m0.830s
|
|
user 0m0.816s
|
|
sys 0m0.012s
|
|
$ time ./matmul.polly.interchanged+tiled+vector.exe
|
|
|
|
real 0m5.430s
|
|
user 0m5.424s
|
|
sys 0m0.004s
|
|
$ time ./matmul.polly.interchanged+tiled+vector+openmp.exe
|
|
|
|
real 0m3.184s
|
|
user 0m11.972s
|
|
sys 0m0.036s
|
|
|