Typically processor architectures do not include an L3 cache, which means that
Nc, the parameter of the micro-kernel, is, for all practical purposes,
redundant ([1]). However, its small values can cause the redundant packing of
the same elements of the matrix A, the first operand of the matrix
multiplication. At the same time, big values of the parameter Nc can cause
segmentation faults in case the available stack is exceeded.
This patch adds an option to specify the parameter Nc as a multiple of
the parameter of the micro-kernel Nr.
In case of Intel Core i7-3820 SandyBridge and the following options,
clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME
-march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true
-DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8
-mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm
-polly-target-latency-vector-fma=8
it helps to improve the performance from 11.303 GFlops/sec (39,247% of
theoretical peak) to 17.896 GFlops/sec (62,14% of theoretical peak).
Refs.:
[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
Reviewed-by: Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D28019
llvm-svn: 290256
Aligning data to cache lines boundaries helps to avoid overheads related to
an access to it ([1]). This patch aligns newly created arrays and adds an
option to specify the first level cache line size. By default we use 64 bytes,
which is a typical cache-line size ([2]).
In case of Intel Core i7-3820 SandyBridge and the following options,
clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME
-march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true
-DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8
-mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm
-polly-target-latency-vector-fma=8
it helps to improve the performance from 11.303 GFlops/sec (39,247% of
theoretical peak) to 12.63 GFlops/sec (43,8542% of theoretical peak).
Refs.:
[1] - http://www.alexonlinux.com/aligned-vs-unaligned-memory-access
[2] - http://igoro.com/archive/gallery-of-processor-cache-effects/
Differential Revision: https://reviews.llvm.org/D28020
Reviewed-by: Tobias Grosser <tobias@grosser.es>
llvm-svn: 290253
multiplication
Previously we had two-dimensional accesses to store packed operands of
the matrix multiplication for the sake of simplicity of the packed arrays.
However, addition of the third dimension helps to simplify the corresponding
memory access, reduce the execution time of isl operations applied to it, and
consequently reduce the compile-time of Polly. For example, in case of
Intel Core i7-3820 SandyBridge and the following options,
clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME
-march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true
-DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8
-mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm
-polly-target-latency-vector-fma=7
it helps to reduce the compile-time from about 361.456 seconds to about 0.816
seconds.
Reviewed-by: Michael Kruse <llvm@meinersbur.de>,
Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D27878
llvm-svn: 290251
To prevent copy statements from accessing arrays out of bounds, ranges of their
extension maps are restricted, according to the constraints of domains.
Reviewed-by: Michael Kruse <llvm@meinersbur.de>
Differential Revision: https://reviews.llvm.org/D25655
llvm-svn: 289815
gemm ([1]). In particular, elements of the matrix B, the second operand of
matrix multiplication, are reused between iterations of the innermost loop.
To keep the reused data in cache, only elements of matrix A, the first operand
of matrix multiplication, should be evicted during an iteration of the
innermost loop. To provide such a cache replacement policy, elements of the
matrix A can, in particular, be loaded first and, consequently, be
least-recently-used.
In our case matrices are stored in row-major order instead of column-major
order used in the BLIS implementation ([1]). One of the ways to address it is
to accordingly change the order of the loops of the loop nest. However, it
makes elements of the matrix A to be reused in the innermost loop and,
consequently, requires to load elements of the matrix B first. Since the LLVM
vectorizer always generates loads from the matrix A before loads from the
matrix B and we can not provide it. Consequently, we only change the BLIS micro
kernel and the computation of its parameters instead. In particular, reused
elements of the matrix B are successively multiplied by specific elements of
the matrix A .
Refs.:
[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
Reviewed-by: Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D25653
llvm-svn: 289806
The AssumptionCache was removed in r289756 after being replaced by the an
addtional operand list of affected values in r289755. The absence of that cache
means that we have now have to manually search for llvm.assume intrinsics as
now done by other passes (LazyValueInfo, CodeMetrics) do not take into
account an llvm::Instruction's user lists (ScalarEvolution).
llvm-svn: 289791
clang-format has been updated in r289531 to keep labels and values on
the same line. This change updates Polly to the new formatting style.
llvm-svn: 289533
Add and implement foreachElt for isl_map, isl_set and isl_union_set. These are
used by an out-of-tree patch which is in process of being upstreamed.
llvm-svn: 288924
Add traits for isl_id and isl_multi_aff, required by out-of-tree patches
currently in progress of upstreaming.
isl_union_pw_aff_dump has been added to ISL during one of the last ISL
updates, such that we can also enable its dump() trait.
llvm-svn: 288915
This version includes an update for imath (isl-0.17.1-49-g2f1c129). It fixes
the compilation under windows, which does not know ssize_t.
In addition, isl-0.17.1-288-g0500299 changed the way isl_test finds the source
directory. It now generates a file isl_srcdir.c at configure-time, containing
the source path, to not require setting the environment variable "srcdir" at
test-time. The cmake build system had to be modified to also generate that file.
llvm-svn: 288811
Unsigned operations are often useful to support but the heuristics are
not yet tuned. This options allows to disable them if necessary.
llvm-svn: 288521
Relational comparisons should not involve multiple potentially
aliasing pointers. Similarly this should hold for switch conditions
and the two conditions involved in equality comparisons (separately!).
This is a heuristic based on the C semantics that does only allow such
operations when the base pointers do point into the same object.
Since this makes aliasing likely we will bail out early instead of
producing a probably failing runtime check.
llvm-svn: 288516
It did happen that after the inliner finished we end up with promotable
allocas in a function. We now run mem2reg to make sure everything is
promoted if possible.
llvm-svn: 288514
This allows us to delinearize code such as the one below, where the array
sizes are A[][2 * n] as there are n times two elements in the innermost
dimension. Alternatively, we could try to generate another dimension for the
struct in the innermost dimension, but as the struct has constant size,
recovering this dimension is easy.
struct com {
double Real;
double Img;
};
void foo(long n, struct com A[][n]) {
for (long i = 0; i < 100; i++)
for (long j = 0; j < 1000; j++)
A[i][j].Real += A[i][j].Img;
}
int main() {
struct com A[100][1000];
foo(1000, A);
llvm-svn: 288489
After having built memory accesses we perform some additional transformations
on them to increase the chances that our delinearization guesses the right
shape. Only after these transformations, we take the assumptions that the
array shape we predict is such that no out-of-bounds memory accesses arise.
Before this change, the construction of the memory access, the access folding
that improves the represenation for certain parametric subscripts, and taking
the assumption was all done right after a memory access was created. In this
change we split this now into three separate iterations over all memory
accesses. This means only after all memory accesses have been built, we start
to canonicalize accesses, and to take assumptions. This split prepares for
future canonicalizations that must consider all memory accesses for deriving
additional beneficial transformations.
llvm-svn: 288479
Feasibility is checked late on its own but early it is hidden behind
the "PollyProcessUnprofitable" guard. This change will make sure we opt
out early if the runtime context is infeasible anyway.
llvm-svn: 288329
In '[DBG] Allow to emit the RTC value at runtime' the diagnostics were printed
without a newline at the end of each diagnostic. We add such a newline to
improve readability.
llvm-svn: 288323
Add an empty DeLICM pass, without any functional parts.
Extracting the boilerplate from the the functional part reduces the size of the
code to review (https://reviews.llvm.org/D24716)
Suggested-by: Tobias Grosser <tobias@grosser.es>
llvm-svn: 288160
We now collect:
Number of total loops
Number of loops in scops
Number of scops
Number of scops with maximal loop depth 1
Number of scops with maximal loop depth 2
Number of scops with maximal loop depth 3
Number of scops with maximal loop depth 4
Number of scops with maximal loop depth 5
Number of scops with maximal loop depth 6 and larger
Number of loops in scops (profitable scops only)
Number of scops (profitable scops only)
Number of scops with maximal loop depth 1 (profitable scops only)
Number of scops with maximal loop depth 2 (profitable scops only)
Number of scops with maximal loop depth 3 (profitable scops only)
Number of scops with maximal loop depth 4 (profitable scops only)
Number of scops with maximal loop depth 5 (profitable scops only)
Number of scops with maximal loop depth 6 and larger (profitable scops only)
These statistics are certainly completely accurate as we might drop scops
when building up their polyhedral representation, but they should give a good
indication of the number of scops we detect.
llvm-svn: 287973
Our original statistics were added before we introduced a more fine-grained
diagnostic system, but the granularity of our statistics has never been
increased accordingly. This change introduces now one statistic counter per
diagnostic to enable us to collect fine-grained statistics about who certain
scops are not detected. In case coarser grained statistics are needed, the
user is expected to combine counters manually.
llvm-svn: 287968
Reflect this correctly in the RejectReasonKind enum. The definition of
RejectReasonKind::IrreducibleRegion was introduced in r258497, when we started
to refuse regions containing irreducible loops.
llvm-svn: 287965
In r248118 some diagnostics for unstructured control flow have been removed,
but the corresponding RejectReasonKind was accidentally not removed. This
change removes it, as it is not needed any more.
llvm-svn: 287964
Introduce the new flag -polly-codegen-generate-expressions which forces Polly
to code generate AST expressions instead of using our SCEV based access
expression generation even for cases where the original memory access relation
was not changed and the SCEV based access expression could be code generated
without any issue.
This is an experimental option for better testing the isl ast expression
generation. The default behavior of Polly remains unchanged. We also exclude
a couple of cases for which the AST expression is not yet working.
llvm-svn: 287694
Drop instructions that do not influence the memory impact of a basic block.
They are not needed to reproduce the original bug (verified) and will cause
random test noise if we would decide to only model the instructions that
have visible side-effects.
llvm-svn: 287626
Add two store instructions at the end of basic blocks that are required to
reproduce the original bug to ensure we always process and model these basic
blocks. This makes this test case stable even in case we would decide to bail
out early of basic blocks which do not modify the global state. Also add
additional check lines to verify how we model the basic block.
llvm-svn: 287625
We add CHECK lines to this test case to make it easier to see the difference
between affine and non-affine memory accesses. We also change the test case to
use a parameteric index expression as otherwise our range analysis will
understand that the non-affine memory access can only access input[1],
which makes it difficult to see that the memory access is in-fact modeled as
non-affine access.
llvm-svn: 287623