forked from OSchip/llvm-project
455 lines
17 KiB
HTML
455 lines
17 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
|
"http://www.w3.org/TR/html4/strict.dtd">
|
|
<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
|
|
<html>
|
|
<head>
|
|
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
|
|
<title>Polly - Examples</title>
|
|
<link type="text/css" rel="stylesheet" href="menu.css">
|
|
<link type="text/css" rel="stylesheet" href="content.css">
|
|
</head>
|
|
<body>
|
|
<!--#include virtual="menu.html.incl"-->
|
|
<div id="content">
|
|
<!--=====================================================================-->
|
|
<h1>Polly: Execute the individual Polly passes manually</h1>
|
|
<!--=====================================================================-->
|
|
|
|
<p>
|
|
This example presents the individual passes that are involved when optimizing
|
|
code with Polly. We show how to execute them individually and explain for each
|
|
which analysis is performed or what transformation is applied. In this example
|
|
the polyhedral transformation is user-provided to show how much performance
|
|
improvement can be expected by an optimal automatic optimizer.</p>
|
|
|
|
The files used and created in this example are available in the Polly checkout
|
|
in the folder <em>www/experiments/matmul</em>. They can be created automatically
|
|
by running the <em>www/experiments/matmul/runall.sh</em> script.
|
|
|
|
<ol>
|
|
<li><h4>Create LLVM-IR from the C code</h4>
|
|
|
|
Polly works on LLVM-IR. Hence it is necessary to translate the source files into
|
|
LLVM-IR. If more than on file should be optimized the files can be combined into
|
|
a single file with llvm-link.
|
|
|
|
<pre class="code">clang -S -emit-llvm matmul.c -o matmul.s</pre>
|
|
</li>
|
|
|
|
|
|
<li><h4>Load Polly automatically when calling the 'opt' tool</h4>
|
|
|
|
Polly is not built into opt or bugpoint, but it is a shared library that needs
|
|
to be loaded into these tools explicitally. The Polly library is called
|
|
LVMPolly.so. For a cmake build it is available in the build/lib/ directory,
|
|
autoconf creates the same file in
|
|
build/tools/polly/{Release+Asserts|Asserts|Debug}/lib. For convenience we create
|
|
an alias that automatically loads Polly if 'opt' is called.
|
|
<pre class="code">
|
|
export PATH_TO_POLLY_LIB="~/polly/build/lib/"
|
|
alias opt="opt -load ${PATH_TO_POLLY_LIB}/LLVMPolly.so"</pre>
|
|
</li>
|
|
|
|
<li><h4>Prepare the LLVM-IR for Polly</h4>
|
|
|
|
Polly is only able to work with code that matches a canonical form. To translate
|
|
the LLVM-IR into this form we use a set of canonicalication passes. For this
|
|
example only three passes are necessary. To get good coverage on more
|
|
complicated input files often more canonicalization passes are needed. pollycc
|
|
contains a list of passes that have shown to be beneficial.
|
|
<pre class="code">opt -S -mem2reg -loop-simplify -polly-indvars matmul.s > matmul.preopt.ll</pre></li>
|
|
|
|
<li><h4>Show the SCoPs detected by Polly (optional)</h4>
|
|
|
|
To understand if Polly was able to detect SCoPs, we print the
|
|
structure of the detected SCoPs. In our example two SCoPs were detected. One in
|
|
'init_array' the other in 'main'.
|
|
|
|
<pre class="code">opt -basicaa -polly-cloog -analyze -q matmul.preopt.ll</pre>
|
|
|
|
<pre>
|
|
init_array():
|
|
for (c2=0;c2<=1023;c2++) {
|
|
for (c4=0;c4<=1023;c4++) {
|
|
Stmt_5(c2,c4);
|
|
}
|
|
}
|
|
|
|
main():
|
|
for (c2=0;c2<=1023;c2++) {
|
|
for (c4=0;c4<=1023;c4++) {
|
|
Stmt_4(c2,c4);
|
|
for (c6=0;c6<=1023;c6++) {
|
|
Stmt_6(c2,c4,c6);
|
|
}
|
|
}
|
|
}
|
|
</pre>
|
|
</li>
|
|
<li><h4>Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)</h4>
|
|
|
|
Polly can use graphviz to graphically show a CFG in which the detected SCoPs are
|
|
highlighted. It can also create '.dot' files that can be translated by
|
|
the 'dot' utility into various graphic formats.
|
|
|
|
<pre class="code">opt -basicaa -view-scops -disable-output matmul.preopt.ll
|
|
opt -basicaa -view-scops-only -disable-output matmul.preopt.ll</pre>
|
|
The output for the different functions<br />
|
|
view-scops:
|
|
<a href="experiments/matmul/scops.main.dot.png">main</a>,
|
|
<a href="experiments/matmul/scops.init_array.dot.png">init_array</a>,
|
|
<a href="experiments/matmul/scops.print_array.dot.png">print_array</a><br />
|
|
view-scops-only:
|
|
<a href="experiments/matmul/scopsonly.main.dot.png">main</a>,
|
|
<a href="experiments/matmul/scopsonly.init_array.dot.png">init_array</a>,
|
|
<a href="experiments/matmul/scopsonly.print_array.dot.png">print_array</a>
|
|
</li>
|
|
|
|
<li><h4>View the polyhedral representation of the SCoPs</h4>
|
|
<pre class="code">opt -basicaa -polly-scops -analyze matmul.preopt.ll</pre>
|
|
<pre>
|
|
[...]
|
|
Printing analysis 'Polly - Create polyhedral description of Scops' for region:
|
|
'%1 => %17' in function 'init_array':
|
|
Context:
|
|
{ [] }
|
|
Statements {
|
|
Stmt_5
|
|
Domain :=
|
|
{ Stmt_5[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 };
|
|
Scattering :=
|
|
{ Stmt_5[i0, i1] -> scattering[0, i0, 0, i1, 0] };
|
|
WriteAccess :=
|
|
{ Stmt_5[i0, i1] -> MemRef_A[1037i0 + i1] };
|
|
WriteAccess :=
|
|
{ Stmt_5[i0, i1] -> MemRef_B[1047i0 + i1] };
|
|
FinalRead
|
|
Domain :=
|
|
{ FinalRead[0] };
|
|
Scattering :=
|
|
{ FinalRead[i0] -> scattering[200000000, o1, o2, o3, o4] };
|
|
ReadAccess :=
|
|
{ FinalRead[i0] -> MemRef_A[o0] };
|
|
ReadAccess :=
|
|
{ FinalRead[i0] -> MemRef_B[o0] };
|
|
}
|
|
[...]
|
|
Printing analysis 'Polly - Create polyhedral description of Scops' for region:
|
|
'%1 => %17' in function 'main':
|
|
Context:
|
|
{ [] }
|
|
Statements {
|
|
Stmt_4
|
|
Domain :=
|
|
{ Stmt_4[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 };
|
|
Scattering :=
|
|
{ Stmt_4[i0, i1] -> scattering[0, i0, 0, i1, 0, 0, 0] };
|
|
WriteAccess :=
|
|
{ Stmt_4[i0, i1] -> MemRef_C[1067i0 + i1] };
|
|
Stmt_6
|
|
Domain :=
|
|
{ Stmt_6[i0, i1, i2] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023 };
|
|
Scattering :=
|
|
{ Stmt_6[i0, i1, i2] -> scattering[0, i0, 0, i1, 1, i2, 0] };
|
|
ReadAccess :=
|
|
{ Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] };
|
|
ReadAccess :=
|
|
{ Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] };
|
|
ReadAccess :=
|
|
{ Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] };
|
|
WriteAccess :=
|
|
{ Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] };
|
|
FinalRead
|
|
Domain :=
|
|
{ FinalRead[0] };
|
|
Scattering :=
|
|
{ FinalRead[i0] -> scattering[200000000, o1, o2, o3, o4, o5, o6] };
|
|
ReadAccess :=
|
|
{ FinalRead[i0] -> MemRef_C[o0] };
|
|
ReadAccess :=
|
|
{ FinalRead[i0] -> MemRef_A[o0] };
|
|
ReadAccess :=
|
|
{ FinalRead[i0] -> MemRef_B[o0] };
|
|
}
|
|
[...]
|
|
</pre>
|
|
</li>
|
|
|
|
<li><h4>Show the dependences for the SCoPs</h4>
|
|
<pre class="code">opt -basicaa -polly-dependences -analyze matmul.preopt.ll</pre>
|
|
<pre>Printing analysis 'Polly - Calculate dependences for SCoP' for region:
|
|
'for.cond => for.end28' in function 'init_array':
|
|
Must dependences:
|
|
{ }
|
|
May dependences:
|
|
{ }
|
|
Must no source:
|
|
{ }
|
|
May no source:
|
|
{ }
|
|
Printing analysis 'Polly - Calculate dependences for SCoP' for region:
|
|
'for.cond => for.end48' in function 'main':
|
|
Must dependences:
|
|
{ Stmt_4[i0, i1] -> Stmt_6[i0, i1, 0] :
|
|
i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023;
|
|
Stmt_6[i0, i1, i2] -> Stmt_6[i0, i1, 1 + i2] :
|
|
i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1022;
|
|
Stmt_6[i0, i1, 1023] -> FinalRead[0] :
|
|
i1 <= 1091540 - 1067i0 and i1 >= -1067i0 and i1 >= 0 and i1 <= 1023;
|
|
Stmt_6[1023, i1, 1023] -> FinalRead[0] :
|
|
i1 >= 0 and i1 <= 1023
|
|
}
|
|
May dependences:
|
|
{ }
|
|
Must no source:
|
|
{ Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] :
|
|
i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023;
|
|
Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] :
|
|
i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023;
|
|
FinalRead[0] -> MemRef_A[o0];
|
|
FinalRead[0] -> MemRef_B[o0]
|
|
FinalRead[0] -> MemRef_C[o0] :
|
|
o0 >= 1092565 or (exists (e0 = [(o0)/1067]: o0 <= 1091540 and o0 >= 0
|
|
and 1067e0 <= -1024 + o0 and 1067e0 >= -1066 + o0)) or o0 <= -1;
|
|
}
|
|
May no source:
|
|
{ }
|
|
</pre></li>
|
|
|
|
<li><h4>Export jscop files</h4>
|
|
|
|
Polly can export the polyhedral representation in so called jscop files. Jscop
|
|
files contain the polyhedral representation stored in a JSON file.
|
|
<pre class="code">opt -basicaa -polly-export-jscop matmul.preopt.ll</pre>
|
|
<pre>Writing SCoP 'for.cond => for.end28' in function 'init_array' to './init_array___%for.cond---%for.end28.jscop'.
|
|
Writing SCoP 'for.cond => for.end48' in function 'main' to './main___%for.cond---%for.end48.jscop'.
|
|
</pre></li>
|
|
|
|
<li><h4>Import the changed jscop files and print the updated SCoP structure
|
|
(optional)</h4>
|
|
<p>Polly can reimport jscop files, in which the schedules of the statements are
|
|
changed. These changed schedules are used to descripe transformations.
|
|
It is possible to import different jscop files by providing the postfix
|
|
of the jscop file that is imported.</p>
|
|
<p> We apply three different transformations on the SCoP in the main function.
|
|
The jscop files describing these transformations are hand written (and available
|
|
in <em>www/experiments/matmul</em>).
|
|
|
|
<h5>No Polly</h5>
|
|
|
|
<p>As a baseline we do not call any Polly code generation, but only apply the
|
|
normal -O3 optimizations.</p>
|
|
|
|
<pre class="code">
|
|
opt matmul.preopt.ll -basicaa \
|
|
-polly-import-jscop \
|
|
-polly-cloog -analyze
|
|
</pre>
|
|
<pre>
|
|
[...]
|
|
main():
|
|
for (c2=0;c2<g;=1535;c2++) {
|
|
for (c4=0;c4<g;=1535;c4++) {
|
|
Stmt_4(c2,c4);
|
|
for (c6=0;c6<g;=1535;c6++) {
|
|
Stmt_6(c2,c4,c6);
|
|
}
|
|
}
|
|
}
|
|
[...]
|
|
</pre>
|
|
<h5>Interchange (and Fission to allow the interchange)</h5>
|
|
<p>We split the loops and can now apply an interchange of the loop dimensions that
|
|
enumerate Stmt_6.</p>
|
|
<pre class="code">
|
|
opt matmul.preopt.ll -basicaa \
|
|
-polly-import-jscop -polly-import-jscop-postfix=interchanged \
|
|
-polly-cloog -analyze
|
|
</pre>
|
|
<pre>
|
|
[...]
|
|
Reading JScop '%1 => %17' in function 'main' from './main___%1---%17.jscop.interchanged'.
|
|
[...]
|
|
main():
|
|
for (c2=0;c2<=1535;c2++) {
|
|
for (c4=0;c4<=1535;c4++) {
|
|
Stmt_4(c2,c4);
|
|
}
|
|
}
|
|
for (c2=0;c2<=1535;c2++) {
|
|
for (c4=0;c4<=1535;c4++) {
|
|
for (c6=0;c6<=1535;c6++) {
|
|
Stmt_6(c2,c6,c4);
|
|
}
|
|
}
|
|
}
|
|
[...]
|
|
</pre>
|
|
<h5>Interchange + Tiling</h5>
|
|
<p>In addition to the interchange we tile now the second loop nest.</p>
|
|
|
|
<pre class="code">
|
|
opt matmul.preopt.ll -basicaa \
|
|
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
|
|
-polly-cloog -analyze
|
|
</pre>
|
|
<pre>
|
|
[...]
|
|
Reading JScop '%1 => %17' in function 'main' from './main___%1---%17.jscop.interchanged+tiled'.
|
|
[...]
|
|
main():
|
|
for (c2=0;c2<=1535;c2++) {
|
|
for (c4=0;c4<=1535;c4++) {
|
|
Stmt_4(c2,c4);
|
|
}
|
|
}
|
|
for (c2=0;c2<=1535;c2+=64) {
|
|
for (c3=0;c3<=1535;c3+=64) {
|
|
for (c4=0;c4<=1535;c4+=64) {
|
|
for (c5=c2;c5<=c2+63;c5++) {
|
|
for (c6=c4;c6<=c4+63;c6++) {
|
|
for (c7=c3;c7<=c3+63;c7++) {
|
|
Stmt_6(c5,c7,c6);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
[...]
|
|
</pre>
|
|
<h5>Interchange + Tiling + Strip-mining to prepare vectorization</h5>
|
|
To later allow vectorization we create a so called trivially parallelizable
|
|
loop. It is innermost, parallel and has only four iterations. It can be
|
|
replaced by 4-element SIMD instructions.
|
|
<pre class="code">
|
|
opt matmul.preopt.ll -basicaa \
|
|
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
|
|
-polly-cloog -analyze </pre>
|
|
|
|
<pre>
|
|
[...]
|
|
Reading JScop '%1 => %17' in function 'main' from './main___%1---%17.jscop.interchanged+tiled+vector'.
|
|
[...]
|
|
main():
|
|
for (c2=0;c2<=1535;c2++) {
|
|
for (c4=0;c4<=1535;c4++) {
|
|
Stmt_4(c2,c4);
|
|
}
|
|
}
|
|
for (c2=0;c2<=1535;c2+=64) {
|
|
for (c3=0;c3<=1535;c3+=64) {
|
|
for (c4=0;c4<=1535;c4+=64) {
|
|
for (c5=c2;c5<=c2+63;c5++) {
|
|
for (c6=c4;c6<=c4+63;c6++) {
|
|
for (c7=c3;c7<=c3+63;c7+=4) {
|
|
for (c8=c7;c8<=c7+3;c8++) {
|
|
Stmt_6(c5,c8,c6);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
[...]
|
|
</pre>
|
|
|
|
</li>
|
|
|
|
<li><h4>Codegenerate the SCoPs</h4>
|
|
<p>
|
|
This generates new code for the SCoPs detected by polly.
|
|
If -polly-import is present, transformations specified in the imported openscop
|
|
files will be applied.</p>
|
|
<pre class="code">opt matmul.preopt.ll | opt -O3 > matmul.normalopt.ll</pre>
|
|
<pre class="code">
|
|
opt -basicaa \
|
|
-polly-import-jscop -polly-import-jscop-postfix=interchanged \
|
|
-polly-codegen matmul.preopt.ll \
|
|
| opt -O3 > matmul.polly.interchanged.ll</pre>
|
|
<pre>
|
|
Reading JScop '%1 => %19' in function 'init_array' from
|
|
'./init_array___%1---%19.jscop.interchanged'.
|
|
File could not be read: No such file or directory
|
|
Reading JScop '%1 => %17' in function 'main' from
|
|
'./main___%1---%17.jscop.interchanged'.
|
|
</pre>
|
|
<pre class="code">
|
|
opt -basicaa \
|
|
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
|
|
-polly-codegen matmul.preopt.ll \
|
|
| opt -O3 > matmul.polly.interchanged+tiled.ll</pre>
|
|
<pre>
|
|
Reading JScop '%1 => %19' in function 'init_array' from
|
|
'./init_array___%1---%19.jscop.interchanged+tiled'.
|
|
File could not be read: No such file or directory
|
|
Reading JScop '%1 => %17' in function 'main' from
|
|
'./main___%1---%17.jscop.interchanged+tiled'.
|
|
</pre>
|
|
<pre class="code">
|
|
opt -basicaa \
|
|
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
|
|
-polly-codegen -polly-vectorizer=polly matmul.preopt.ll \
|
|
| opt -O3 > matmul.polly.interchanged+tiled+vector.ll</pre>
|
|
<pre>
|
|
Reading JScop '%1 => %19' in function 'init_array' from
|
|
'./init_array___%1---%19.jscop.interchanged+tiled+vector'.
|
|
File could not be read: No such file or directory
|
|
Reading JScop '%1 => %17' in function 'main' from
|
|
'./main___%1---%17.jscop.interchanged+tiled+vector'.
|
|
</pre>
|
|
<pre class="code">
|
|
opt -basicaa \
|
|
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
|
|
-polly-codegen -polly-vectorizer=polly -enable-polly-openmp matmul.preopt.ll \
|
|
| opt -O3 > matmul.polly.interchanged+tiled+openmp.ll</pre>
|
|
<pre>
|
|
Reading JScop '%1 => %19' in function 'init_array' from
|
|
'./init_array___%1---%19.jscop.interchanged+tiled+vector'.
|
|
File could not be read: No such file or directory
|
|
Reading JScop '%1 => %17' in function 'main' from
|
|
'./main___%1---%17.jscop.interchanged+tiled+vector'.
|
|
</pre>
|
|
|
|
<li><h4>Create the executables</h4>
|
|
|
|
Create one executable optimized with plain -O3 as well as a set of executables
|
|
optimized in different ways with Polly. One changes only the loop structure, the
|
|
other adds tiling, the next adds vectorization and finally we use OpenMP
|
|
parallelism.
|
|
<pre class="code">
|
|
llc matmul.normalopt.ll -o matmul.normalopt.s && \
|
|
gcc matmul.normalopt.s -o matmul.normalopt.exe
|
|
llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s && \
|
|
gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe
|
|
llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s && \
|
|
gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe
|
|
llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s && \
|
|
gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe
|
|
llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s && \
|
|
gcc -lgomp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe </pre>
|
|
|
|
<li><h4>Compare the runtime of the executables</h4>
|
|
|
|
By comparing the runtimes of the different code snippets we see that a simple
|
|
loop interchange gives here the largest performance boost. However by adding
|
|
vectorization and by using OpenMP we can further improve the performance
|
|
significantly.
|
|
<pre class="code">time ./matmul.normalopt.exe</pre>
|
|
<pre>42.68 real, 42.55 user, 0.00 sys</pre>
|
|
<pre class="code">time ./matmul.polly.interchanged.exe</pre>
|
|
<pre>04.33 real, 4.30 user, 0.01 sys</pre>
|
|
<pre class="code">time ./matmul.polly.interchanged+tiled.exe</pre>
|
|
<pre>04.11 real, 4.10 user, 0.00 sys</pre>
|
|
<pre class="code">time ./matmul.polly.interchanged+tiled+vector.exe</pre>
|
|
<pre>01.39 real, 1.36 user, 0.01 sys</pre>
|
|
<pre class="code">time ./matmul.polly.interchanged+tiled+vector+openmp.exe</pre>
|
|
<pre>00.66 real, 2.58 user, 0.02 sys</pre>
|
|
</li>
|
|
</ol>
|
|
|
|
</div>
|
|
</body>
|
|
</html>
|