This speeds up scop modeling for scops with many redundent existentially
quantified constraints. For the attached test case, this change reduces
scop modeling time from minutes (hours?) to 0.15 seconds.
This change resolves a compilation timeout on the AOSP build.
Thanks Eli for reporting _and_ reducing the test case!
Reported-by: Eli Friedman <efriedma@codeaurora.org>
llvm-svn: 303600
The SCEVs of loops surrounding the escape users of a merge blocks are
forgotten, so that loop trip counts based on old values can be revoked.
This fixes llvm.org//PR32536
Contributed-by: Baranidharan Mohan <mbdharan@gmail.com>
Differential Revision: https://reviews.llvm.org/D33195
llvm-svn: 303561
Allow the BlockGenerator to generate memory writes that are not defined
over the complete statement domain, but only over a subset of it. It
generates a condition that evaluates to 1 if executing the subdomain,
and only then execute the access.
Only write accesses are supported. Read accesses would require a PHINode
which has a value if the access is not executed.
Partial write makes DeLICM able to apply mappings that are not defined
over the entire domain (for instance, a branch that leaves a loop with
a PHINode in its header; a MemoryKind::PHI write when leaving is never
read by its PHI read).
Differential Revision: https://reviews.llvm.org/D33255
llvm-svn: 303517
A test case with a GPU runline was added without setting 'REQUIRES=pollyacc'. We
drop the GPU run line, as the basic functionality can already be tested with
the normal code generation.
llvm-svn: 303485
- We use the outermost dimension of arrays since we need this
information to generate GPU transfers.
- In general, if we do not know the outermost dimension of the array
(because the indexing expression is non-affine, for example) then we
simply cannot generate transfer code.
- However, for Fortran arrays, we can use the Fortran array
representation which stores the dimensions of all arrays.
- This patch uses the Fortran array representation to generate code that
computes the outermost dimension size.
Differential Revision: https://reviews.llvm.org/D32967
llvm-svn: 303429
Summary:
- Rename global / local naming convention that did not make much sense
to Visible / Invisible, where the visible refers to whether the ALLOCATE
call to the Fortran array is present in the current module or not.
- This match now works on both cross fortran module globals and on
parameters to functions since neither of them are necessarily allocated
at the point of their usage.
- Add testcase that matches against both a load and a store against
function parameters.
Differential Revision: https://reviews.llvm.org/D33190
llvm-svn: 303356
- This breaks the previous assumption that Fortran Arrays are `GlobalValue`.
- The names of functions were getting unwieldy. So, I renamed the
Fortran related functions.
Differential Revision: https://reviews.llvm.org/D33075
llvm-svn: 303040
At the time of code generation, an instruction with an llvm intrinsic is ignored
in copyBB. However, if the value of the instruction is used later in the
program, the value needs to be synthesized. However, this is causing some issues
with the instructions being generated in a hoisted basic block.
Removing llvm.expect from the list of ignored intrinsics fixes this bug.
This resolves http://llvm.org/PR32324.
Contributed-by: Annanay Agarwal <cs14btech11001@iith.ac.in>
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32992
llvm-svn: 303006
Removal of overwritten writes currently encompasses all the cases
of the identical write removal.
There is an observable behavioral change in that the last, instead
of the first, MemoryAccess is kept. This should not affect the
generated code, however.
Differential Revision: https://reviews.llvm.org/D33143
llvm-svn: 302987
Remove memory writes that are overwritten by later writes. This works
for StoreInsts:
store double 21.0, double* %A
store double 42.0, double* %A
scalar writes at the end of a statement and mixes of these.
Multiple writes can be the result of DeLICM, which might map multiple
writes to the same location when it knows that these do no conflict
(for instance because they write the same value). Such writes
interfere with pattern-matched optimization such as gemm and may not
get removed by other LLVM passes after code generation.
Differential Revision: https://reviews.llvm.org/D33142
llvm-svn: 302986
- Move the testcases to ScopInfo/ since the processing takes place in
ScopBuilder.
- Cleanup testcases, run -polly-canonicalize on them, find minimal set
of opt parameters.
llvm-svn: 302886
Today Polly generates induction variable in this way:
polly.indvar = phi 0, polly.indvar.next
...
polly.indvar.next = polly.indvar + stide
polly.loop_cond = predicate polly.indvar, (UB - stride)
Instead of:
polly.indvar = phi 0, polly.indvar.next
...
polly.indvar.next = polly.indvar + stide
polly.loop_cond = predicate polly.indvar.next, UB
The way Polly generate induction variable cause some problem in the indvar simplify pass.
This patch make polly generate the later form, by assuming the induction variable never overflow
Differential Revision: https://reviews.llvm.org/D33089
llvm-svn: 302866
After DeLICM, it is possible to have two writes of the same value to
the same location in the same statement when it determined that those
writes do not conflict (write the same value).
Teach -polly-simplify to remove one of the writes. It interferes with
the pattern matching of matrix-multiplication kernels and also seem
to not be optimized away by LLVM.
The algorthm is simple, has O(n^2) behaviour (n = max number of
MemoryAccesses in a statement) and only matches the most obvious cases,
but seem to be enough to pattern-match Boost ublas gemm.
Not handled cases include:
- StoreInst instructions (a.k.a. explicit writes), since the value might
be loaded or overwritten between the two stores.
- PHINode, especially LCSSA, when the PHI value matches with on other's.
- Partial writes (in preparation)
llvm-svn: 302805
Some isl functions can simplify their __isl_keep arguments. The
argument object after the call uses different contraints to represent
the same set. Different contraints can result in different outputs
when printed to a string.
In assert builds additional isl functions are called (in assert() or
mentioned, these can change the internal representation of its read-only
arguments such that printed strings are different in debug and non-debug
builds.
What happened here is that a call to isl_set_is_equal inside an assert
in getScatterFor normalizes one of its arguments such that one redundant
constraint is removed. The redundant constraint therefore does not appear
in the string representing the domain, which FileCheck notices as a
regression test failure compared to a build with assertions disabled.
This fix removes the redundant contraints the domain from the start such
that the redundant contraint is removed in assert and non-assert builds.
Isl adds a flag to such sets such that the removal of redundancies is
not done multiple times (here: by isl_set_is_equal).
Thanks to Tobias Grosser for reporting and hinting to the cause.
llvm-svn: 302711
Add the ability to tag certain memory accesses as those belonging to
Fortran arrays. We do this by pattern matching against known patterns
of Dragonegg's LLVM IR output from Fortran code.
Fortran arrays have metadata stored with them in a struct. This struct
is called the "Fortran array descriptor", and a reference to this is
stored in each MemoryAccess.
Differential Revision: https://reviews.llvm.org/D32639
llvm-svn: 302653
Summary:
In case two arrays share base pointers in the same invariant load equivalence
class, we canonicalize all memory accesses to the first of these arrays
(according to their order in the equivalence class).
This enables us to optimize kernels such as boost::ublas by ensuring that
different references to the C array are interpreted as accesses to the same
array. Before this change the runtime alias check for ublas would fail, as it
would assume models of the C array with differing (but identically valued) base
pointers would reference distinct regions of memory whereas the referenced
memory regions were indeed identical.
As part of this change we remove most of the MemoryAccess::get*BaseAddr
interface. We removed already all references to get*BaseAddr in previous
commits to ensure that no code relies on matching base pointers between
memory accesses and scop arrays -- except for three remaining uses where we
need the original base pointer. We document for these situations that
MemoryAccess::getOriginalBaseAddr may return a base pointer that is distinct
to the base pointer of the scop array referenced by this memory access.
Reviewers: sebpop, Meinersbur, zinob, gareevroman, pollydev, huihuiz, efriedma, jdoerfert
Reviewed By: Meinersbur
Subscribers: etherzhhb
Tags: #polly
Differential Revision: https://reviews.llvm.org/D28518
llvm-svn: 302636
Summary: PPCGCodeGeneration now attaches the size of the kernel launch parameters at the end of the parameter list. For the existing CUDA Runtime, this gets ignored, but the OpenCL Runtime knows to check for kernel-argument size at the end of the parameter list. (The resulting parameters list is twice as long. This has been accounted for in the corresponding test cases).
Reviewers: grosser, Meinersbur, bollu
Reviewed By: bollu
Subscribers: nemanjai, yaxunl, Anastasia, pollydev, llvm-commits
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32961
llvm-svn: 302515
Summary:
When compiling for GPU, one can now choose to compile for OpenCL or CUDA,
with the corresponding polly-gpu-runtime flag (libopencl / libcudart). The
GPURuntime library (GPUJIT) has been extended with the OpenCL Runtime library
for that purpose, correctly choosing the corresponding library calls to the
option chosen when compiling (via different initialization calls).
Additionally, a specific GPU Target architecture can now be chosen with -polly-gpu-arch (only nvptx64 implemented thus far).
Reviewers: grosser, bollu, Meinersbur, etherzhhb, singam-sanjay
Reviewed By: grosser, Meinersbur
Subscribers: singam-sanjay, llvm-commits, pollydev, nemanjai, mgorny, yaxunl, Anastasia
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32431
llvm-svn: 302379
Extend the Knowledge class to store information about the contents
of array elements and which values are written. Two knowledges do
not conflict the known content is the same. The content information
if computed from writes to and loads from the array elements, and
represented by "ValInst": isl spaces that compare equal if the value
represented is the same.
Differential Revision: https://reviews.llvm.org/D31247
llvm-svn: 302339
SCoPs with unfeasible runtime context are thrown away and therefore
do not need their uses verified.
The added test case requires a complexity limit to exceed.
Normally, error statements are removed from the SCoP and for that
reason are skipped during the verification. If there is a unfeasible
runtime context (here: because of the complexity limit being reached),
the removal of error statements and other SCoP construction steps are
skipped to not waste time. Error statements are not modeled in SCoPs
and therefore have no requirements on whether the scalars used in
them are available.
llvm-svn: 302234
Since r294891, in MemoryAccess::computeBoundsOnAccessRelation(), we skip
manually bounding the access relation in case the parameter of the load
instruction is already a wrapped set. Later on we assume that the lower
bound on the set is always smaller or equal to the upper bound on the
set. Bug 32715 manages to construct a sign wrapped set, in which case
the assertion does not necessarily hold. Fix this by handling a sign
wrapped set similar to a normal wrapped set, that is skipping the
computation.
Contributed-by: Maximilian Falkenstein <falkensm@student.ethz.ch>
Reviewers: grosser
Subscribers: pollydev, llvm-commits
Tags: #Polly
Differential Revision: https://reviews.llvm.org/D32893
llvm-svn: 302231
This reverts commit 17a84e414adb51ee375d14836d4c2a817b191933.
Patches should have been submitted in the order of:
1. D32852
2. D32854
3. D32431
I mistakenly pushed D32431(3) first. Reverting to push in the correct
order.
llvm-svn: 302217
Summary:
When compiling for GPU, one can now choose to compile for OpenCL or CUDA,
with the corresponding polly-gpu-runtime flag (libopencl / libcudart). The
GPURuntime library (GPUJIT) has been extended with the OpenCL Runtime library
for that purpose, correctly choosing the corresponding library calls to the
option chosen when compiling (via different initialization calls).
Additionally, a specific GPU Target architecture can now be chosen with -polly-gpu-arch (only nvptx64 implemented thus far).
Reviewers: grosser, bollu, Meinersbur, etherzhhb, singam-sanjay
Reviewed By: grosser, Meinersbur
Subscribers: singam-sanjay, llvm-commits, pollydev, nemanjai, mgorny, yaxunl, Anastasia
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32431
llvm-svn: 302215
The test subdirectory POLLY_TEST_DIRECTORIES was heavily outdated and
only used in out-of-LLVM-tree builds
(to generate polly-test-${subdir} targets).
llvm-svn: 302142
This makes sure we still test the case that a PHI-NODE cannot be analyzed by
scalar evolution and consequently must be code generated explicitly. As
Michael's optimization triggers only on a very specific "add %iv, %step"
pattern, just changing 'add' to 'mul' adds back test coverage.
llvm-svn: 302132
LLVM-IR names are commonly available in debug builds, but often not in release
builds. Hence, using LLVM-IR names to identify statements or memory reference
results makes the behavior of Polly depend on the compile mode. This is
undesirable. Hence, we now just number the statements instead of using LLVM-IR
names to identify them (this issue has previously been brought up by Zino
Benaissa).
However, as LLVM-IR names help in making test cases more readable, we add an
option '-polly-use-llvm-names' to still use LLVM-IR names. This flag is by
default set in the polly tests to make test cases more readable.
This change reduces the time in ScopInfo from 32 seconds to 2 seconds for the
following test case provided by Eli Friedman <efriedma@codeaurora.org> (already
used in one of the previous commits):
struct X { int x; };
void a();
#define SIG (int x, X **y, X **z)
typedef void (*fn)SIG;
#define FN { for (int i = 0; i < x; ++i) { (*y)[i].x += (*z)[i].x; } a(); }
#define FN5 FN FN FN FN FN
#define FN25 FN5 FN5 FN5 FN5
#define FN125 FN25 FN25 FN25 FN25 FN25
#define FN250 FN125 FN125
#define FN1250 FN250 FN250 FN250 FN250 FN250
void x SIG { FN1250 }
For a larger benchmark I have on-hand (10000 loops), this reduces the time for
running -polly-scops from 5 minutes to 4 minutes, a reduction by 20%.
The reason for this large speedup is that our previous use of printAsOperand
had a quadratic cost, as for each printed and unnamed operand the full function
was scanned to find the instruction number that identifies the operand.
We do not need to adjust the way memory reference ids are constructured, as
they do not use LLVM values.
Reviewed by: efriedma
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32789
llvm-svn: 302072
- Fixes breakage from commit 5536f.
- Interference with commit 764f3 caused testcase to fail. Reverting
764f3 allows commit 5536f to succeed.
- Generated kernel code was slightly different due to 764f3, which
caused testcase to fail.
llvm-svn: 302021
Before this change a memory reference identifier had the form:
<STMT>_<ACCESSTYPE><ID>_<MEMREF>, e.g., Stmt_bb9_Write0_MemRef_tmp11
After this change, we use the format:
<STMT>_<ACCESSTYPE><ID>, e.g., Stmt_bb9_Write0
The name of the array that is accessed through a memory reference is not
necessary to uniquely identify a memory reference, but was only added to
provide additional information for debugging. We drop this information now
for the following two reasons:
1) This shortens the names and consequently improves readability
2) This removes a second location where we decide on the name of a scop array,
leaving us only with the location where the actual scop array is created.
Having after 2) only a single location to name scop arrays will allow us to
change the naming convention of scop arrays more easily, which we will do
in a future commit to reduce compilation time.
llvm-svn: 302004
This makes it easier to read and possibly even modify the test cases, as there
is no need to keep the variable increment in steps of one. More importantly, by
using explicit variable names we do not need to rely on the implicit numbering
of statements when dumping the scop information.
This makes it easier to read and possibly even modify the test cases.
Furthermore, by using explicit variables we do not need to rely on the implicit
numbering of statements when dumping the scop information. In a future commit,
this implicit numbering will likely not be used any more to refer to LLVM-IR
values as it is very expensive to construct.
llvm-svn: 301689
generation.
This needs changes to GPURuntime to expose synchronization between host
and device.
1. Needs better function naming, I want a better name than
"getOrCreateManagedDeviceArray"
2. DeviceAllocations is used by both the managed memory and the
non-managed memory path. This exploits the fact that the two code paths
are never run together. I'm not sure if this is the best design decision
Reviewed by: PhilippSchaad
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32215
llvm-svn: 301640
When we introduced in r297375 support for hoisting loads that are known
to be dereferencable without any conditional guard, we forgot to keep the check
to verify that no other write into the very same location exists. This
change ensures now that dereferencable loads are allowed to access everything,
but can only be hoisted in case no conflicting write exists.
This resolves llvm.org/PR32778
Reported-by: Huihui Zhang <huihuiz@codeaurora.org>
llvm-svn: 301582
Added a small change to the way pointer arguments are set in the kernel
code generation. The way the pointer is retrieved now, specifically requests
global address space to be annotated. This is necessary, if the IR should be
run through NVPTX to generate OpenCL compatible PTX.
The changes do not affect the PTX Strings generated for the CUDA target
(nvptx64-nvidia-cuda), but are necessary for OpenCL (nvptx64-nvidia-nvcl).
Additionally, the data layout has been updated to what the NVPTX Backend requests/recommends.
Contributed-by: Philipp Schaad
Reviewers: Meinersbur, grosser, bollu
Reviewed By: grosser, bollu
Subscribers: jlebar, pollydev, llvm-commits, nemanjai, yaxunl, Anastasia
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32215
llvm-svn: 301299
Earlier, the call to buildFlow was:
WAR = buildFlow(Write, Read, MustWrite, Schedule).
This meant that Read could block another Read, since must-sources can
block each other.
Fixed the call to buildFlow to correctly compute Read. The resulting
code needs to do some ISL juggling to get the output we want.
Bug report: https://bugs.llvm.org/show_bug.cgi?id=32623
Reviewers: Meinersbur
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32011
llvm-svn: 301266
The isl unittest modified its PATH variable to point to the LLVM bin dir.
When building out-of-LLVM-tree, it does not contain the
polly-isl-test executable, hence the test fails.
Ensure that the polly-isl-test is written to a bin directory in the
build root, just like it would happen in an inside-LLVM build.
Then, change PATH to include that dir such that the executable in it
is prioritized before any other location.
llvm-svn: 301096
Dimensions of band nodes can be implicitly permuted by the algorithm applied
during the schedule generation.
For example, in case of the following matrix-matrix multiplication,
for (i = 0; i < 1024; i++)
for (k = 0; k < 1024; k++)
for (j = 0; j < 1024; j++)
C[i][j] += A[i][k] * B[k][j];
it can produce the following schedule tree
domain: "{ Stmt_for_body6[i0, i1, i2] : 0 <= i0 <= 1023 and 0 <= i1 <= 1023 and
0 <= i2 <= 1023 }"
child:
schedule: "[{ Stmt_for_body6[i0, i1, i2] -> [(i0)] },
{ Stmt_for_body6[i0, i1, i2] -> [(i1)] },
{ Stmt_for_body6[i0, i1, i2] -> [(i2)] }]"
permutable: 1
coincident: [ 1, 1, 0 ]
The current implementation of the pattern matching optimizations relies on the
initial ordering of dimensions. Otherwise, it can produce the miscompilation
(e.g., [1]).
This patch helps to restore the initial ordering of dimensions by recreating
the band node when the corresponding conditions are satisfied.
Refs.:
[1] - https://bugs.llvm.org/show_bug.cgi?id=32500
Reviewed-by: Michael Kruse <llvm@meinersbur.de>
Differential Revision: https://reviews.llvm.org/D31741
llvm-svn: 299662
Because Polly exposes parameters that directly influence tile size
calculations, one can setup situations like divide-by-zero.
Check against a possible divide-by-zero in getMacroKernelParams
and return early.
Also assert at the end of getMacroKernelParams that the block sizes
computed for matrices are positive (>= 1).
Tags: #polly
Differential Revision: https://reviews.llvm.org/D31708
llvm-svn: 299633
The current StackColoring algorithm does not correctly handle the
situation when some, but not all paths from a BB to the entry node
cross a llvm.lifetime.start. According to an interpretation of the
language reference at
http://llvm.org/docs/LangRef.html#llvm-lifetime-start-intrinsic
this might be correct, but it would cost too much effort to handle
in StackColoring.
To be on the safe side, remove all lifetime markers even in the original
code version (they have never been copied to the optimized version)
to ensure that no path to the entry block will cross a
llvm.lifetime.start.
The same principle applies to paths the a function return and the
llvm.lifetime.end marker, so we remove them as well.
This fixes llvm.org/PR32251.
Also see the discussion at
http://lists.llvm.org/pipermail/llvm-dev/2017-March/111551.html
llvm-svn: 299585
= Change of WAR, WAW generation: =
- `buildFlow(Sink, MustSource, MaySource, Sink)` treates any flow of the form
`sink <- may source <- must source` as a *may* dependence.
- we used to call:
```lang=cpp, name=old-flow-call.cpp
Flow = buildFlow(MustWrite, MustWrite, Read, Schedule);
WAW = isl_union_flow_get_must_dependence(Flow);
WAR = isl_union_flow_get_may_dependence(Flow);
```
- This caused some WAW dependences to be treated as WAR dependences.
- Incorrect semantics.
- Now, we call WAR and WAW correctly.
== Correct WAW: ==
```lang=cpp, name=new-waw-call.cpp
Flow = buildFlow(Write, MustWrite, MayWrite, Schedule);
WAW = isl_union_flow_get_may_dependence(Flow);
isl_union_flow_free(Flow);
```
== Correct WAR: ==
```lang=cpp, name=new-war-call.cpp
Flow = buildFlow(Write, Read, MustaWrite, Schedule);
WAR = isl_union_flow_get_must_dependence(Flow);
isl_union_flow_free(Flow);
```
- We want the "shortest" WAR possible (exact dependences).
- We mark all the *must-writes* as may-source, reads as must-souce.
- Then, we ask for *must* dependence.
- This removes all the reads that flow through a *must-write*
before reaching a sink.
- Note that we only block ealier writes with *must-writes*. This is
intuitively correct, as we do not want may-writes to block
must-writes.
- Leaves us with direct (R -> W).
- This affects reduction generation since RED is built using WAW and WAR.
= New StrictWAW for Reductions: =
- We used to call:
```lang=cpp,name=old-waw-war-call.cpp
Flow = buildFlow(MustWrite, MustWrite, Read, Schedule);
WAW = isl_union_flow_get_must_dependence(Flow);
WAR = isl_union_flow_get_may_dependence(Flow);
```
- This *is* the right model of WAW we need for reductions, just not in general.
- Reductions need to track only *strict* WAW, without any interfering reductions.
= Explanation: Why the new WAR dependences in tests are correct: =
- We no longer set WAR = WAR - WAW
- Hence, we will have WAR dependences that were originally removed.
- These may look incorrect, but in fact make sense.
== Code: ==
```lang=llvm, name=new-war-dependence.ll
; void manyreductions(long *A) {
; for (long i = 0; i < 1024; i++)
; for (long j = 0; j < 1024; j++)
; S0: *A += 42;
;
; for (long i = 0; i < 1024; i++)
; for (long j = 0; j < 1024; j++)
; S1: *A += 42;
;
```
=== WAR dependence: ===
{ S0[1023, 1023] -> S1[0, 0] }
- Between `S0[1023, 1023]` and `S1[0, 0]`, we will have the dependences:
```lang=cpp, name=dependence-incorrect, counterexample
S0[1023, 1023]:
*-- tmp = *A (load0)--*
WAR 2 add = tmp + 42 |
*-> *A = add (store0) |
WAR 1
S1[0, 0]: |
tmp = *A (load1) |
add = tmp + 42 |
A = add (store1)<-*
```
- One may assume that WAR2 *hides* WAR1 (since store0 happens before
store1). However, within a statement, Polly has no idea about the
ordering of loads and stores.
- Hence, according to Polly, the code may have looked like this:
```lang=cpp, name=dependence-correct
S0[1023, 1023]:
A = add (store0)
tmp = A (load0) ---*
add = A + 42 |
WAR 1
S1[0, 0]: |
tmp = A (load1) |
add = A + 42 |
A = add (store1) <-*
```
- So, Polly generates (correct) WAR dependences. It does not make sense
to remove these dependences, since they are correct with respect to
Polly's model.
Reviewers: grosser, Meinersbur
tags: #polly
Differential revision: https://reviews.llvm.org/D31386
llvm-svn: 299429