forked from OSchip/llvm-project
67 lines
3.3 KiB
HTML
67 lines
3.3 KiB
HTML
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
||
|
"http://www.w3.org/TR/html4/strict.dtd">
|
||
|
<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
|
||
|
<html>
|
||
|
<head> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
|
||
|
<title>Polly - Performance</title>
|
||
|
<link type="text/css" rel="stylesheet" href="menu.css">
|
||
|
<link type="text/css" rel="stylesheet" href="content.css">
|
||
|
</head>
|
||
|
<body>
|
||
|
<!--#include virtual="menu.html.incl"-->
|
||
|
<div id="content">
|
||
|
<h1>Polly: Performance</h1>
|
||
|
|
||
|
<p>To evaluate the performance benefits Polly currently provides we compiled the
|
||
|
<a href="http://www.cse.ohio-state.edu/~pouchet/software/polybench/">Polybench
|
||
|
2.0</a> benchmark suite. Each benchmark was run with double precision floating
|
||
|
point values on an Intel Core Xeon X5670 CPU @ 2.93GHz (12 cores, 24 thread)
|
||
|
system. We used <a href="http://pocc.sf.net">PoCC</a> and the included <a
|
||
|
href="http://pluto-compiler.sf.net">Pluto</a> transformations to optimize the
|
||
|
code. The source code of Polly and LLVM/clang was checked out on
|
||
|
25/03/2011.</p>
|
||
|
|
||
|
<p>The results shown were created fully automatically without manual
|
||
|
interaction. We did not yet spend any time to tune the results. Hence
|
||
|
further improvments may be achieved by tuning the code generated by Polly, the
|
||
|
heuristics used by Pluto or by investigating if more code could be optimized.
|
||
|
As Pluto was never used at such a low level, its heuristics are probably
|
||
|
far from perfect. Another area where we expect larger performance improvements
|
||
|
is the SIMD vector code generation. At the moment, it rarely yields to
|
||
|
performance improvements, as we did not yet include vectorization in our
|
||
|
heuristics. By changing this we should be able to significantly increase the
|
||
|
number of test cases that show improvements.</p>
|
||
|
|
||
|
<p>The polybench test suite contains computation kernels from linear algebra
|
||
|
routines, stencil computations, image processing and data mining. Polly
|
||
|
recognices the majority of them and is able to show good speedup. However,
|
||
|
to show similar speedup on larger examples like the SPEC CPU benchmarks Polly
|
||
|
still misses support for integer casts, variable-sized multi-dimensional arrays
|
||
|
and probably several other construts. This support is necessary as such
|
||
|
constructs appear in larger programs, but not in our limited test suite.
|
||
|
|
||
|
<h2> Sequential runs</h2>
|
||
|
|
||
|
For the sequential runs we used Polly to create a program structure that is
|
||
|
optimized for data-locality. One of the major optimizations performed is tiling.
|
||
|
The speedups shown are without the use of any multi-core parallelism. No
|
||
|
additional hardware is used, but the single available core is used more
|
||
|
efficiently.
|
||
|
<h3> Small data size</h3>
|
||
|
<img src="images/performance/sequential-small.png" /><br />
|
||
|
<h3> Large data size</h3>
|
||
|
<img src="images/performance/sequential-large.png" />
|
||
|
<h2> Parallel runs</h2>
|
||
|
For the parallel runs we used Polly to expose parallelism and to add calls to an
|
||
|
OpenMP runtime library. With OpenMP we can use all 12 hardware cores
|
||
|
instead of the single core that was used before. We can see that in several
|
||
|
cases we obtain more than linear speedup. This additional speedup is due to
|
||
|
improved data-locality.
|
||
|
<h3> Small data size</h3>
|
||
|
<img src="images/performance/parallel-small.png" /><br />
|
||
|
<h3> Large data size</h3>
|
||
|
<img src="images/performance/parallel-large.png" />
|
||
|
</div>
|
||
|
</body>
|
||
|
</html>
|