llvm-project/polly/www/example_manual_matmul.html

455 lines
17 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Polly - Examples</title>
<link type="text/css" rel="stylesheet" href="menu.css">
<link type="text/css" rel="stylesheet" href="content.css">
</head>
<body>
<div id="box">
<!--#include virtual="menu.html.incl"-->
<div id="content">
<!--=====================================================================-->
<h1>Execute the individual Polly passes manually</h1>
<!--=====================================================================-->
<p>
This example presents the individual passes that are involved when optimizing
code with Polly. We show how to execute them individually and explain for each
which analysis is performed or what transformation is applied. In this example
the polyhedral transformation is user-provided to show how much performance
improvement can be expected by an optimal automatic optimizer.</p>
The files used and created in this example are available in the Polly checkout
in the folder <em>www/experiments/matmul</em>. They can be created automatically
by running the <em>www/experiments/matmul/runall.sh</em> script.
<ol>
<li><h4>Create LLVM-IR from the C code</h4>
Polly works on LLVM-IR. Hence it is necessary to translate the source files into
LLVM-IR. If more than on file should be optimized the files can be combined into
a single file with llvm-link.
<pre class="code">clang -S -emit-llvm matmul.c -o matmul.s</pre>
</li>
<li><h4>Load Polly automatically when calling the 'opt' tool</h4>
Polly is not built into opt or bugpoint, but it is a shared library that needs
to be loaded into these tools explicitally. The Polly library is called
LVMPolly.so. For a cmake build it is available in the build/lib/ directory,
autoconf creates the same file in
build/tools/polly/{Release+Asserts|Asserts|Debug}/lib. For convenience we create
an alias that automatically loads Polly if 'opt' is called.
<pre class="code">
export PATH_TO_POLLY_LIB="~/polly/build/lib/"
alias opt="opt -load ${PATH_TO_POLLY_LIB}/LLVMPolly.so"</pre>
</li>
<li><h4>Prepare the LLVM-IR for Polly</h4>
Polly is only able to work with code that matches a canonical form. To translate
the LLVM-IR into this form we use a set of canonicalication passes. They are
scheduled by using '-polly-canonicalize'.
<pre class="code">opt -S -polly-canonicalize matmul.s &gt; matmul.preopt.ll</pre></li>
<li><h4>Show the SCoPs detected by Polly (optional)</h4>
To understand if Polly was able to detect SCoPs, we print the
structure of the detected SCoPs. In our example two SCoPs were detected. One in
'init_array' the other in 'main'.
<pre class="code">opt -basicaa -polly-ast -analyze -q matmul.preopt.ll</pre>
<pre>
init_array():
for (c2=0;c2&lt;=1023;c2++) {
for (c4=0;c4&lt;=1023;c4++) {
Stmt_5(c2,c4);
}
}
main():
for (c2=0;c2&lt;=1023;c2++) {
for (c4=0;c4&lt;=1023;c4++) {
Stmt_4(c2,c4);
for (c6=0;c6&lt;=1023;c6++) {
Stmt_6(c2,c4,c6);
}
}
}
</pre>
</li>
<li><h4>Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)</h4>
Polly can use graphviz to graphically show a CFG in which the detected SCoPs are
highlighted. It can also create '.dot' files that can be translated by
the 'dot' utility into various graphic formats.
<pre class="code">opt -basicaa -view-scops -disable-output matmul.preopt.ll
opt -basicaa -view-scops-only -disable-output matmul.preopt.ll</pre>
The output for the different functions<br />
view-scops:
<a href="experiments/matmul/scops.main.dot.png">main</a>,
<a href="experiments/matmul/scops.init_array.dot.png">init_array</a>,
<a href="experiments/matmul/scops.print_array.dot.png">print_array</a><br />
view-scops-only:
<a href="experiments/matmul/scopsonly.main.dot.png">main</a>,
<a href="experiments/matmul/scopsonly.init_array.dot.png">init_array</a>,
<a href="experiments/matmul/scopsonly.print_array.dot.png">print_array</a>
</li>
<li><h4>View the polyhedral representation of the SCoPs</h4>
<pre class="code">opt -basicaa -polly-scops -analyze matmul.preopt.ll</pre>
<pre>
[...]
Printing analysis 'Polly - Create polyhedral description of Scops' for region:
'for.cond =&gt; for.end19' in function 'init_array':
Context:
{ [] }
Statements {
Stmt_5
Domain&nbsp;:=
{ Stmt_5[i0, i1]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 };
Scattering&nbsp;:=
{ Stmt_5[i0, i1] -&gt; scattering[0, i0, 0, i1, 0] };
WriteAccess&nbsp;:=
{ Stmt_5[i0, i1] -&gt; MemRef_A[1037i0 + i1] };
WriteAccess&nbsp;:=
{ Stmt_5[i0, i1] -&gt; MemRef_B[1047i0 + i1] };
FinalRead
Domain&nbsp;:=
{ FinalRead[0] };
Scattering&nbsp;:=
{ FinalRead[i0] -&gt; scattering[200000000, o1, o2, o3, o4] };
ReadAccess&nbsp;:=
{ FinalRead[i0] -&gt; MemRef_A[o0] };
ReadAccess&nbsp;:=
{ FinalRead[i0] -&gt; MemRef_B[o0] };
}
[...]
Printing analysis 'Polly - Create polyhedral description of Scops' for region:
'for.cond =&gt; for.end30' in function 'main':
Context:
{ [] }
Statements {
Stmt_4
Domain&nbsp;:=
{ Stmt_4[i0, i1]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 };
Scattering&nbsp;:=
{ Stmt_4[i0, i1] -&gt; scattering[0, i0, 0, i1, 0, 0, 0] };
WriteAccess&nbsp;:=
{ Stmt_4[i0, i1] -&gt; MemRef_C[1067i0 + i1] };
Stmt_6
Domain&nbsp;:=
{ Stmt_6[i0, i1, i2]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023 };
Scattering&nbsp;:=
{ Stmt_6[i0, i1, i2] -&gt; scattering[0, i0, 0, i1, 1, i2, 0] };
ReadAccess&nbsp;:=
{ Stmt_6[i0, i1, i2] -&gt; MemRef_C[1067i0 + i1] };
ReadAccess&nbsp;:=
{ Stmt_6[i0, i1, i2] -&gt; MemRef_A[1037i0 + i2] };
ReadAccess&nbsp;:=
{ Stmt_6[i0, i1, i2] -&gt; MemRef_B[i1 + 1047i2] };
WriteAccess&nbsp;:=
{ Stmt_6[i0, i1, i2] -&gt; MemRef_C[1067i0 + i1] };
FinalRead
Domain&nbsp;:=
{ FinalRead[0] };
Scattering&nbsp;:=
{ FinalRead[i0] -&gt; scattering[200000000, o1, o2, o3, o4, o5, o6] };
ReadAccess&nbsp;:=
{ FinalRead[i0] -&gt; MemRef_C[o0] };
ReadAccess&nbsp;:=
{ FinalRead[i0] -&gt; MemRef_A[o0] };
ReadAccess&nbsp;:=
{ FinalRead[i0] -&gt; MemRef_B[o0] };
}
[...]
</pre>
</li>
<li><h4>Show the dependences for the SCoPs</h4>
<pre class="code">opt -basicaa -polly-dependences -analyze matmul.preopt.ll</pre>
<pre>Printing analysis 'Polly - Calculate dependences for SCoP' for region:
'for.cond =&gt; for.end19' in function 'init_array':
Must dependences:
{ }
May dependences:
{ }
Must no source:
{ }
May no source:
{ }
Printing analysis 'Polly - Calculate dependences for SCoP' for region:
'for.cond =&gt; for.end30' in function 'main':
Must dependences:
{ Stmt_4[i0, i1] -&gt; Stmt_6[i0, i1, 0]&nbsp;:
i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023;
Stmt_6[i0, i1, i2] -&gt; Stmt_6[i0, i1, 1 + i2]&nbsp;:
i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1022;
Stmt_6[i0, i1, 1023] -&gt; FinalRead[0]&nbsp;:
i1 &lt;= 1091540 - 1067i0 and i1 &gt;= -1067i0 and i1 &gt;= 0 and i1 &lt;= 1023;
Stmt_6[1023, i1, 1023] -&gt; FinalRead[0]&nbsp;:
i1 &gt;= 0 and i1 &lt;= 1023
}
May dependences:
{ }
Must no source:
{ Stmt_6[i0, i1, i2] -&gt; MemRef_A[1037i0 + i2]&nbsp;:
i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023;
Stmt_6[i0, i1, i2] -&gt; MemRef_B[i1 + 1047i2]&nbsp;:
i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023;
FinalRead[0] -&gt; MemRef_A[o0];
FinalRead[0] -&gt; MemRef_B[o0]
FinalRead[0] -&gt; MemRef_C[o0]&nbsp;:
o0 &gt;= 1092565 or (exists (e0 = [(o0)/1067]: o0 &lt;= 1091540 and o0 &gt;= 0
and 1067e0 &lt;= -1024 + o0 and 1067e0 &gt;= -1066 + o0)) or o0 &lt;= -1;
}
May no source:
{ }
</pre></li>
<li><h4>Export jscop files</h4>
Polly can export the polyhedral representation in so called jscop files. Jscop
files contain the polyhedral representation stored in a JSON file.
<pre class="code">opt -basicaa -polly-export-jscop matmul.preopt.ll</pre>
<pre>Writing SCoP 'for.cond =&gt; for.end19' in function 'init_array' to './init_array___%for.cond---%for.end19.jscop'.
Writing SCoP 'for.cond =&gt; for.end30' in function 'main' to './main___%for.cond---%for.end30.jscop'.
</pre></li>
<li><h4>Import the changed jscop files and print the updated SCoP structure
(optional)</h4>
<p>Polly can reimport jscop files, in which the schedules of the statements are
changed. These changed schedules are used to descripe transformations.
It is possible to import different jscop files by providing the postfix
of the jscop file that is imported.</p>
<p> We apply three different transformations on the SCoP in the main function.
The jscop files describing these transformations are hand written (and available
in <em>www/experiments/matmul</em>).
<h5>No Polly</h5>
<p>As a baseline we do not call any Polly code generation, but only apply the
normal -O3 optimizations.</p>
<pre class="code">
opt matmul.preopt.ll -basicaa \
-polly-import-jscop \
-polly-ast -analyze
</pre>
<pre>
[...]
main():
for (c2=0;c2&ltg;=1535;c2++) {
for (c4=0;c4&ltg;=1535;c4++) {
Stmt_4(c2,c4);
for (c6=0;c6&ltg;=1535;c6++) {
Stmt_6(c2,c4,c6);
}
}
}
[...]
</pre>
<h5>Interchange (and Fission to allow the interchange)</h5>
<p>We split the loops and can now apply an interchange of the loop dimensions that
enumerate Stmt_6.</p>
<pre class="code">
opt matmul.preopt.ll -basicaa \
-polly-import-jscop -polly-import-jscop-postfix=interchanged \
-polly-ast -analyze
</pre>
<pre>
[...]
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
[...]
main():
for (c2=0;c2&lt;=1535;c2++) {
for (c4=0;c4&lt;=1535;c4++) {
Stmt_4(c2,c4);
}
}
for (c2=0;c2&lt;=1535;c2++) {
for (c4=0;c4&lt;=1535;c4++) {
for (c6=0;c6&lt;=1535;c6++) {
Stmt_6(c2,c6,c4);
}
}
}
[...]
</pre>
<h5>Interchange + Tiling</h5>
<p>In addition to the interchange we tile now the second loop nest.</p>
<pre class="code">
opt matmul.preopt.ll -basicaa \
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
-polly-ast -analyze
</pre>
<pre>
[...]
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
[...]
main():
for (c2=0;c2&lt;=1535;c2++) {
for (c4=0;c4&lt;=1535;c4++) {
Stmt_4(c2,c4);
}
}
for (c2=0;c2&lt;=1535;c2+=64) {
for (c3=0;c3&lt;=1535;c3+=64) {
for (c4=0;c4&lt;=1535;c4+=64) {
for (c5=c2;c5&lt;=c2+63;c5++) {
for (c6=c4;c6&lt;=c4+63;c6++) {
for (c7=c3;c7&lt;=c3+63;c7++) {
Stmt_6(c5,c7,c6);
}
}
}
}
}
}
[...]
</pre>
<h5>Interchange + Tiling + Strip-mining to prepare vectorization</h5>
To later allow vectorization we create a so called trivially parallelizable
loop. It is innermost, parallel and has only four iterations. It can be
replaced by 4-element SIMD instructions.
<pre class="code">
opt matmul.preopt.ll -basicaa \
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
-polly-ast -analyze </pre>
<pre>
[...]
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
[...]
main():
for (c2=0;c2&lt;=1535;c2++) {
for (c4=0;c4&lt;=1535;c4++) {
Stmt_4(c2,c4);
}
}
for (c2=0;c2&lt;=1535;c2+=64) {
for (c3=0;c3&lt;=1535;c3+=64) {
for (c4=0;c4&lt;=1535;c4+=64) {
for (c5=c2;c5&lt;=c2+63;c5++) {
for (c6=c4;c6&lt;=c4+63;c6++) {
for (c7=c3;c7&lt;=c3+63;c7+=4) {
for (c8=c7;c8&lt;=c7+3;c8++) {
Stmt_6(c5,c8,c6);
}
}
}
}
}
}
}
[...]
</pre>
</li>
<li><h4>Codegenerate the SCoPs</h4>
<p>
This generates new code for the SCoPs detected by polly.
If -polly-import-jscop is present, transformations specified in the imported
jscop files will be applied.</p>
<pre class="code">opt matmul.preopt.ll | opt -O3 &gt; matmul.normalopt.ll</pre>
<pre class="code">
opt -basicaa \
-polly-import-jscop -polly-import-jscop-postfix=interchanged \
-polly-codegen matmul.preopt.ll \
| opt -O3 &gt; matmul.polly.interchanged.ll</pre>
<pre>
Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
'./init_array___%for.cond---%for.end19.jscop.interchanged'.
File could not be read: No such file or directory
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
'./main___%for.cond---%for.end30.jscop.interchanged'.
</pre>
<pre class="code">
opt -basicaa \
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
-polly-codegen matmul.preopt.ll \
| opt -O3 &gt; matmul.polly.interchanged+tiled.ll</pre>
<pre>
Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
'./init_array___%for.cond---%for.end19.jscop.interchanged+tiled'.
File could not be read: No such file or directory
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
'./main___%for.cond---%for.end30.jscop.interchanged+tiled'.
</pre>
<pre class="code">
opt -basicaa \
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
-polly-codegen -polly-vectorizer=polly matmul.preopt.ll \
| opt -O3 &gt; matmul.polly.interchanged+tiled+vector.ll</pre>
<pre>
Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
'./init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'.
File could not be read: No such file or directory
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
'./main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
</pre>
<pre class="code">
opt -basicaa \
-polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
-polly-codegen -polly-vectorizer=polly -polly-parallel matmul.preopt.ll \
| opt -O3 &gt; matmul.polly.interchanged+tiled+openmp.ll</pre>
<pre>
Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
'./init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'.
File could not be read: No such file or directory
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
'./main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
</pre>
<li><h4>Create the executables</h4>
Create one executable optimized with plain -O3 as well as a set of executables
optimized in different ways with Polly. One changes only the loop structure, the
other adds tiling, the next adds vectorization and finally we use OpenMP
parallelism.
<pre class="code">
llc matmul.normalopt.ll -o matmul.normalopt.s &amp;&amp; \
gcc matmul.normalopt.s -o matmul.normalopt.exe
llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s &amp;&amp; \
gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe
llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s &amp;&amp; \
gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe
llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s &amp;&amp; \
gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe
llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s &amp;&amp; \
gcc -lgomp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe </pre>
<li><h4>Compare the runtime of the executables</h4>
By comparing the runtimes of the different code snippets we see that a simple
loop interchange gives here the largest performance boost. However by adding
vectorization and by using OpenMP we can further improve the performance
significantly.
<pre class="code">time ./matmul.normalopt.exe</pre>
<pre>42.68 real, 42.55 user, 0.00 sys</pre>
<pre class="code">time ./matmul.polly.interchanged.exe</pre>
<pre>04.33 real, 4.30 user, 0.01 sys</pre>
<pre class="code">time ./matmul.polly.interchanged+tiled.exe</pre>
<pre>04.11 real, 4.10 user, 0.00 sys</pre>
<pre class="code">time ./matmul.polly.interchanged+tiled+vector.exe</pre>
<pre>01.39 real, 1.36 user, 0.01 sys</pre>
<pre class="code">time ./matmul.polly.interchanged+tiled+vector+openmp.exe</pre>
<pre>00.66 real, 2.58 user, 0.02 sys</pre>
</li>
</ol>
</div>
</div>
</body>
</html>