Summary:
Introduce a "hybrid" `-polly-target` option to optimise code for either the GPU or CPU.
When this target is selected, PPCGCodeGeneration will attempt first to optimise a Scop. If the Scop isn't modified, it is then sent to the passes that form the CPU pipeline, i.e. IslScheduleOptimizerPass, IslAstInfoWrapperPass and CodeGeneration.
In case the Scop is modified, it is marked to be skipped by the subsequent CPU optimisation passes.
Reviewers: grosser, Meinersbur, bollu
Reviewed By: grosser
Subscribers: kbarton, nemanjai, pollydev
Tags: #polly
Differential Revision: https://reviews.llvm.org/D34054
llvm-svn: 306863
This patch aims to implement the option of allocating new arrays created
by polly on heap instead of stack. To enable this option, a key named
'allocation' must be written in the imported json file with the value
'heap'.
We need such a feature because in a next iteration, we will implement a
mechanism of maximal static expansion which will need a way to allocate
arrays on heap. Indeed, the expansion is very costly in terms of memory
and doing the allocation on stack is not worth considering.
The malloc and the free are added respectively at polly.start and
polly.exiting such that there is no use-after-free (for instance in case
of Scop in a loop) and such that all memory cells allocated with a
malloc are free'd when we don't need them anymore.
We also add :
- In the class ScopArrayInfo, we add a boolean as member called IsOnHeap
which represents the fact that the array in allocated on heap or not.
- A new branch in the method allocateNewArrays in the ISLNodeBuilder for
the case of heap allocation. allocateNewArrays now takes a BBPair
containing polly.start and polly.exiting. allocateNewArrays takes this
two blocks and add the malloc and free calls respectively to
polly.start and polly.exiting.
- As IntPtrTy for the malloc call, we use the DataLayout one.
To do that, we have modified :
- createScopArrayInfo and getOrCreateScopArrayInfo such that it returns
a non-const SAI, in order to be able to call setIsOnHeap in the
JSONImporter.
- executeScopConditionnaly such that it return both start block and end
block of the scop, because we need this two blocs to be able to add
the malloc and the free calls at the right position.
Differential Revision: https://reviews.llvm.org/D33688
llvm-svn: 306540
Before we would 'guess' the correct location for the MergeBlock
that got introduced when executing a Scop conditionally. This
implicitly depends on the situation that at this point during
CodeGen there will be nothing between polly.start and polly.exiting.
With this commit we explicitly state that we want the block that
directly follows polly.exiting.
llvm-svn: 306398
- In D33414, if any function call was found within a kernel, we would bail out.
- This is an over-approximation. This patch changes this by allowing the
`llvm.sqrt.*` family of intrinsics.
- This introduces an additional step when creating a separate llvm::Module
for a kernel (GPUModule). We now copy function declarations from the
original module to new module.
- We also populate IslNodeBuilder::ValueMap so it replaces the function
references to the old module to the ones in the new module
(GPUModule).
Differential Revision: https://reviews.llvm.org/D34145
llvm-svn: 306284
This commit returns both the start and the exit block that are created
by executeScopConditionally.
In a future commit we will make use of the exit block. Before we would
have to use the implicit property that there won't be any code generated
between polly.start and polly.exiting at the time of use to find the
correct block ('polly.exiting').
All usage location are semantically unchanged.
llvm-svn: 306283
The condition that disallowed code generation in PPCGCodeGeneration with
invariant loads is not required. I haven't been able to construct a
counterexample where this generates invalid code.
Differential Revision: https://reviews.llvm.org/D34604
llvm-svn: 306245
Ensure that all array base pointers are assigned before generating
aliasing metadata by allocating new arrays beforehand.
Before this patch, getBasePtr() returned nullptr for new arrays because
the arrays were created at a later point. Nullptr did not match to any
array after the created array base pointers have been assigned and when
the loads/stores are generated.
llvm-svn: 305675
In `PPCGCodeGeneration`, we try to take the references of every `Value`
that is used within a Scop to offload to the kernel. This occurs in
`GPUNodeBuilder::createLaunchParameters`.
This breaks if one of the values is a function pointer, since one of
these cases will trigger:
1. We try to to take the references of an intrinsic function, and this
breaks at `verifyModule`, since it is illegal to take the reference of
an intrinsic.
2. We manage to take the reference to a function, but this fails at
`verifyModule` since the function will not be present in the module that
is created in the kernel.
3. Even if `verifyModule` succeeds (which should not occur), we would
then try to call a *host function* from the *device*, which is
illegal runtime behaviour.
So, we disable this entire range of possibilities by simply not allowing
function references within a `Scop` which corresponds to a kernel.
However, note that this is too conservative. We *can* allow intrinsics
within kernels if the backend can lower the intrinsic correctly. For
example, an intrinsic like `llvm.powi.*` can actually be lowered by the `NVPTX`
backend.
We will now gradually whitelist intrinsics which are known to be safe.
Differential Revision: https://reviews.llvm.org/D33414
llvm-svn: 305185
Summary:
The RegionGenerator traditionally kept a BlockMap that mapped from original
basic blocks to newly generated basic blocks. With the introduction of partial
writes such a 1:1 mapping is not possible any more, as a single basic block
can be code generated into multiple basic blocks. Hence, depending on the use
case we need to either use the first basic block or the last basic block.
This is intended to address the last four cases of incorrect code generation
in our AOSP buildbot and hopefully should turn it green.
Reviewers: Meinersbur, bollu, gareevroman, efriedma, huihuiz, sebpop, simbuerg
Reviewed By: Meinersbur
Subscribers: pollydev, llvm-commits
Tags: #polly
Differential Revision: https://reviews.llvm.org/D33767
llvm-svn: 304808
- Add a counter that is incremented once on exit from a scop.
- Test cases got split into two: one to test the cycles, and another one
to test trip counts.
- Sample output:
```name=sample-output.txt
scop function, entry block name, exit block name, total time, trip count
warmup, %entry.split, %polly.merge_new_and_old, 5180, 1
f, %entry.split, %polly.merge_new_and_old, 409944, 500
g, %entry.split, %polly.merge_new_and_old, 1226, 1
```
Differential Revision: https://reviews.llvm.org/D33822
llvm-svn: 304543
We should bail out if performance monitoring is not supported, since
we would have no information to print per-scop, and `FinalStartBB`,
`ReturnFromFinal` would be `nullptr`.
Assert that these are not `nullptr` if performance monitoring is supported.
llvm-svn: 304529
Previously, we would generate one performance counter for all scops.
Now, we generate both the old information, as well as a per-scop
performance counter to generate finer grained information.
This patch needed a way to generate a unique name for a `Scop`.
The start region, end region, and function name combined provides a
unique `Scop` name. So, `Scop` has a new public API to provide its start
and end region names.
Differential Revision: https://reviews.llvm.org/D33723
llvm-svn: 304528
For when statements do not contain all instructions of a BasicBlock
anymore, the block generator needs to go through the explicit list of
instructions it contains.
Contributed-by: Nandini Singhal <cs15mtech01004@iith.ac.in>
Differential Revision: https://reviews.llvm.org/D33653
llvm-svn: 304502
A partial write is a write where the domain of the values written is a subset of
the execution domain of the parent statement containing the write. Originally,
we directly checked this subset relation whereas it is indeed only important
that the subset relation holds for the parameter values that are known to be
valid in the execution context of the scop. We update our check to avoid the
unnecessary introduction of partial writes in situations where the write appears
to be partial without context information, but where context information allows
us to understand that a full write can be generated.
This change fixes (hides) a recent regression introduced in r303517, which broke
our AOSP builds. The part that is correctly fixed in this change is that we do
not any more unnecessarily generate a partial write. This is good performance
wise and, as we currently do not yet explicitly introduce partial writes in the
default configuration, this also hides possible bugs in the partial writes
implementation. The crashes that we have originally seen were caused by such
a bug, where partial writes were incorrectly generated in region statements. An
additional patch in a subsequent commit is needed to address this problem.
Reported-by: Reported-by: Eli Friedman <efriedma@codeaurora.org>
Differential Revision: https://reviews.llvm.org/D33759
llvm-svn: 304398
Summary: To move CG to the new PM I outlined the various helper that were previously members of the CG class into free static functions. The CG class itself I moved into a header, which is required because we need to include it in `RegisterPasses` eventually.
Reviewers: grosser, Meinersbur
Reviewed By: grosser
Subscribers: pollydev, llvm-commits, sanjoy
Tags: #polly
Differential Revision: https://reviews.llvm.org/D33423
llvm-svn: 303624
Summary: This patch ports IslAst to the new PM. The change is mostly straightforward. The only major modification required is making IslAst move-only, to correctly manage the isl resources it owns.
Reviewers: grosser, Meinersbur
Reviewed By: grosser
Subscribers: nemanjai, pollydev, llvm-commits
Tags: #polly
Differential Revision: https://reviews.llvm.org/D33422
llvm-svn: 303622
The SCEVs of loops surrounding the escape users of a merge blocks are
forgotten, so that loop trip counts based on old values can be revoked.
This fixes llvm.org//PR32536
Contributed-by: Baranidharan Mohan <mbdharan@gmail.com>
Differential Revision: https://reviews.llvm.org/D33195
llvm-svn: 303561
Allow the BlockGenerator to generate memory writes that are not defined
over the complete statement domain, but only over a subset of it. It
generates a condition that evaluates to 1 if executing the subdomain,
and only then execute the access.
Only write accesses are supported. Read accesses would require a PHINode
which has a value if the access is not executed.
Partial write makes DeLICM able to apply mappings that are not defined
over the entire domain (for instance, a branch that leaves a loop with
a PHINode in its header; a MemoryKind::PHI write when leaving is never
read by its PHI read).
Differential Revision: https://reviews.llvm.org/D33255
llvm-svn: 303517
- We use the outermost dimension of arrays since we need this
information to generate GPU transfers.
- In general, if we do not know the outermost dimension of the array
(because the indexing expression is non-affine, for example) then we
simply cannot generate transfer code.
- However, for Fortran arrays, we can use the Fortran array
representation which stores the dimensions of all arrays.
- This patch uses the Fortran array representation to generate code that
computes the outermost dimension size.
Differential Revision: https://reviews.llvm.org/D32967
llvm-svn: 303429
Summary:
Implements PR889
Removing the virtual table pointer from Value saves 1% of RSS when doing
LTO of llc on Linux. The impact on time was positive, but too noisy to
conclusively say that performance improved. Here is a link to the
spreadsheet with the original data:
https://docs.google.com/spreadsheets/d/1F4FHir0qYnV0MEp2sYYp_BuvnJgWlWPhWOwZ6LbW7W4/edit?usp=sharing
This change makes it invalid to directly delete a Value, User, or
Instruction pointer. Instead, such code can be rewritten to a null check
and a call Value::deleteValue(). Value objects tend to have their
lifetimes managed through iplist, so for the most part, this isn't a big
deal. However, there are some places where LLVM deletes values, and
those places had to be migrated to deleteValue. I have also created
llvm::unique_value, which has a custom deleter, so it can be used in
place of std::unique_ptr<Value>.
I had to add the "DerivedUser" Deleter escape hatch for MemorySSA, which
derives from User outside of lib/IR. Code in IR cannot include MemorySSA
headers or call the MemoryAccess object destructors without introducing
a circular dependency, so we need some level of indirection.
Unfortunately, no class derived from User may have any virtual methods,
because adding a virtual method would break User::getHungOffOperands(),
which assumes that it can find the use list immediately prior to the
User object. I've added a static_assert to the appropriate OperandTraits
templates to help people avoid this trap.
Reviewers: chandlerc, mehdi_amini, pete, dberlin, george.burgess.iv
Reviewed By: chandlerc
Subscribers: krytarowski, eraman, george.burgess.iv, mzolotukhin, Prazek, nlewycky, hans, inglorion, pcc, tejohnson, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D31261
llvm-svn: 303362
Summary: This is a proof of concept of how to port polly-passes to the new PassManager architecture. This approach works ootb for Function-Passes, but might not be directly applicable to Scop/Region-Passes. While we could just run the Analyses/Transforms over functions instead, we'd surrender the nice pipelining behaviour we have now.
Reviewers: Meinersbur, grosser
Reviewed By: grosser
Subscribers: pollydev, sanjoy, nemanjai, llvm-commits
Tags: #polly
Differential Revision: https://reviews.llvm.org/D31459
llvm-svn: 302902
Today Polly generates induction variable in this way:
polly.indvar = phi 0, polly.indvar.next
...
polly.indvar.next = polly.indvar + stide
polly.loop_cond = predicate polly.indvar, (UB - stride)
Instead of:
polly.indvar = phi 0, polly.indvar.next
...
polly.indvar.next = polly.indvar + stide
polly.loop_cond = predicate polly.indvar.next, UB
The way Polly generate induction variable cause some problem in the indvar simplify pass.
This patch make polly generate the later form, by assuming the induction variable never overflow
Differential Revision: https://reviews.llvm.org/D33089
llvm-svn: 302866
Summary:
In case two arrays share base pointers in the same invariant load equivalence
class, we canonicalize all memory accesses to the first of these arrays
(according to their order in the equivalence class).
This enables us to optimize kernels such as boost::ublas by ensuring that
different references to the C array are interpreted as accesses to the same
array. Before this change the runtime alias check for ublas would fail, as it
would assume models of the C array with differing (but identically valued) base
pointers would reference distinct regions of memory whereas the referenced
memory regions were indeed identical.
As part of this change we remove most of the MemoryAccess::get*BaseAddr
interface. We removed already all references to get*BaseAddr in previous
commits to ensure that no code relies on matching base pointers between
memory accesses and scop arrays -- except for three remaining uses where we
need the original base pointer. We document for these situations that
MemoryAccess::getOriginalBaseAddr may return a base pointer that is distinct
to the base pointer of the scop array referenced by this memory access.
Reviewers: sebpop, Meinersbur, zinob, gareevroman, pollydev, huihuiz, efriedma, jdoerfert
Reviewed By: Meinersbur
Subscribers: etherzhhb
Tags: #polly
Differential Revision: https://reviews.llvm.org/D28518
llvm-svn: 302636
Summary: PPCGCodeGeneration now attaches the size of the kernel launch parameters at the end of the parameter list. For the existing CUDA Runtime, this gets ignored, but the OpenCL Runtime knows to check for kernel-argument size at the end of the parameter list. (The resulting parameters list is twice as long. This has been accounted for in the corresponding test cases).
Reviewers: grosser, Meinersbur, bollu
Reviewed By: bollu
Subscribers: nemanjai, yaxunl, Anastasia, pollydev, llvm-commits
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32961
llvm-svn: 302515
Summary:
When compiling for GPU, one can now choose to compile for OpenCL or CUDA,
with the corresponding polly-gpu-runtime flag (libopencl / libcudart). The
GPURuntime library (GPUJIT) has been extended with the OpenCL Runtime library
for that purpose, correctly choosing the corresponding library calls to the
option chosen when compiling (via different initialization calls).
Additionally, a specific GPU Target architecture can now be chosen with -polly-gpu-arch (only nvptx64 implemented thus far).
Reviewers: grosser, bollu, Meinersbur, etherzhhb, singam-sanjay
Reviewed By: grosser, Meinersbur
Subscribers: singam-sanjay, llvm-commits, pollydev, nemanjai, mgorny, yaxunl, Anastasia
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32431
llvm-svn: 302379
This reverts commit 17a84e414adb51ee375d14836d4c2a817b191933.
Patches should have been submitted in the order of:
1. D32852
2. D32854
3. D32431
I mistakenly pushed D32431(3) first. Reverting to push in the correct
order.
llvm-svn: 302217
Summary:
When compiling for GPU, one can now choose to compile for OpenCL or CUDA,
with the corresponding polly-gpu-runtime flag (libopencl / libcudart). The
GPURuntime library (GPUJIT) has been extended with the OpenCL Runtime library
for that purpose, correctly choosing the corresponding library calls to the
option chosen when compiling (via different initialization calls).
Additionally, a specific GPU Target architecture can now be chosen with -polly-gpu-arch (only nvptx64 implemented thus far).
Reviewers: grosser, bollu, Meinersbur, etherzhhb, singam-sanjay
Reviewed By: grosser, Meinersbur
Subscribers: singam-sanjay, llvm-commits, pollydev, nemanjai, mgorny, yaxunl, Anastasia
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32431
llvm-svn: 302215
If a ScopStmt references a (scalar) value, there are multiple
possibilities where this value can come. The decision about what kind of
use it is must be handled consistently at different places, which can be
error-prone. VirtualUse is meant to centralize the handling of the
different types of value uses.
This patch makes ScopBuilder and CodeGeneration use VirtualUse. This
already helps to show inconsistencies with the value handling. In order
to keep this patch NFC, exceptions to the general rules are added.
These might be fixed later if they turn to problems. Overall, this
should result in fewer post-codegen IR-verification errors, but instead
assertion failures in `getNewValue` that are closer to the actual error.
Differential Revision: https://reviews.llvm.org/D32667
llvm-svn: 302157
As has been reported in the previous commit, codegen verification can result in
quadratic compile time increases for large functions with many scops. This is
certainly not something we would like to have in the Polly default
configuration. Hence, we disable codegen verification by default -- also to see
if this resolves some of the compilation timeouts we currently see on the AOSP
buildbots. We still leave this feature in Polly as it has shown _very_ useful
for debugging. In fact, we may want to have a discussion if we can bring this
feature back in a way that does not impact compilation time so much.
Thanks to Eli Friedman <efriedma@codeaurora.org> for reporting this issue and
for providing the test case in the previous commit (where I forgot to
acknowledge him).
llvm-svn: 301670
Before this change, we always tried to verify the function and printed
verification errors, but just did not abort in case -polly-codegen-verify=false
was set and verification failed. As verification can become very cosly -- for
large functions with many scops we may verify the very same function very often
-- this can affect compile time very negatively. Hence, we respect the
-polly-codegen-verify flag with this check, ensuring that no verification is run
if -polly-codegen-verify=false.
This reduces code generation time from 26 seconds to 4 seconds on the test
case below with -polly-codegen-verify=false:
struct X { int x; };
void a();
#define SIG (int x, X **y, X **z)
typedef void (*fn)SIG;
#define FN { for (int i = 0; i < x; ++i) { (*y)[i].x += (*z)[i].x; } a(); }
#define FN5 FN FN FN FN FN
#define FN25 FN5 FN5 FN5 FN5
#define FN125 FN25 FN25 FN25 FN25 FN25
#define FN250 FN125 FN125
#define FN1250 FN250 FN250 FN250 FN250 FN250
void x SIG { FN1250 }
llvm-svn: 301669
generation.
This needs changes to GPURuntime to expose synchronization between host
and device.
1. Needs better function naming, I want a better name than
"getOrCreateManagedDeviceArray"
2. DeviceAllocations is used by both the managed memory and the
non-managed memory path. This exploits the fact that the two code paths
are never run together. I'm not sure if this is the best design decision
Reviewed by: PhilippSchaad
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32215
llvm-svn: 301640
Added a small change to the way pointer arguments are set in the kernel
code generation. The way the pointer is retrieved now, specifically requests
global address space to be annotated. This is necessary, if the IR should be
run through NVPTX to generate OpenCL compatible PTX.
The changes do not affect the PTX Strings generated for the CUDA target
(nvptx64-nvidia-cuda), but are necessary for OpenCL (nvptx64-nvidia-nvcl).
Additionally, the data layout has been updated to what the NVPTX Backend requests/recommends.
Contributed-by: Philipp Schaad
Reviewers: Meinersbur, grosser, bollu
Reviewed By: grosser, bollu
Subscribers: jlebar, pollydev, llvm-commits, nemanjai, yaxunl, Anastasia
Tags: #polly
Differential Revision: https://reviews.llvm.org/D32215
llvm-svn: 301299
The current StackColoring algorithm does not correctly handle the
situation when some, but not all paths from a BB to the entry node
cross a llvm.lifetime.start. According to an interpretation of the
language reference at
http://llvm.org/docs/LangRef.html#llvm-lifetime-start-intrinsic
this might be correct, but it would cost too much effort to handle
in StackColoring.
To be on the safe side, remove all lifetime markers even in the original
code version (they have never been copied to the optimized version)
to ensure that no path to the entry block will cross a
llvm.lifetime.start.
The same principle applies to paths the a function return and the
llvm.lifetime.end marker, so we remove them as well.
This fixes llvm.org/PR32251.
Also see the discussion at
http://lists.llvm.org/pipermail/llvm-dev/2017-March/111551.html
llvm-svn: 299585
Summary:
A couple of the utilities used to analyze or build IR make explicit use of the legacy PM on their interface, to access analysis results. This patch removes the legacy PM from the interface, and just passes the required results directly.
This shouldn't introduce any function changes, although the API technically allowed to obtain two different analysis results before, one passed by reference and one through the PM. I don't believe that was ever intended, however.
Reviewers: grosser, Meinersbur
Reviewed By: grosser
Subscribers: nemanjai, pollydev, llvm-commits
Tags: #polly
Differential Revision: https://reviews.llvm.org/D31653
llvm-svn: 299423
Instead of creating the declaration ourselves, we obtain it directly from the
LLVM intrinsic definitions. This addresses a post-review comment for r299359.
Suggested-by: Hongzing Zheng <etherzhhb@gmail.com>
llvm-svn: 299360
Add support for -polly-codegen-perf-monitoring. When performance monitoring
is enabled, we emit performance monitoring code during code generation that
prints after program exit statistics about the total number of cycles executed
as well as the number of cycles spent in scops. This gives an estimate on how
useful polyhedral optimizations might be for a given program.
Example output:
Polly runtime information
-------------------------
Total: 783110081637
Scops: 663718949365
In the future, we might also add functionality to measure how much time is spent
in optimized scops and how many cycles are spent in the fallback code.
Reviewers: bollu,sebpop
Tags: #polly
Differential Revision: https://reviews.llvm.org/D31599
llvm-svn: 299359
No-alias metadata grows quadratic in the size of arrays involved, which can
become very costly for large programs. This commit bounds the number of arrays
for which we construct no-alias information to ten. This is conservatively
correct, as we just provide less information to LLVM and speeds up the compile
time of one of my internal test cases from 'does-not-terminate' to
'finishes-in-less-than-a-minute'. In the future we might try to be more clever
here, but this change should provide a good baseline.
llvm-svn: 299352
Provide an common way for testing if a statement contains something
for region and block statements. First user is
RegionGenerator::addOperandToPHI.
Suggested-by: Tobias Grosser <tobias@grosser.es>
llvm-svn: 298617
Introduce another level of alias metadata to distinguish the individual
non-aliasing accesses that have inter iteration alias-free base pointers
marked with "Inter iteration alias-free" mark nodes. It can be used to,
for example, distinguish different stores (loads) produced by unrolling of
the innermost loops and, subsequently, sink (hoist) them by LICM.
Reviewed-by: Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D30606
llvm-svn: 298510
Map the new load to the base pointer of the invariant load hoisted load
to be able to find the alias information for it.
Reviewed-by: Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D30605
llvm-svn: 298507
When not adding constraints on parameters using -polly-ignore-parameter-bounds,
the context may not necessarily list all parameter dimensions. To support code
generation in this situation, we now always iterate over the actual parameter
list, rather than relying on the context to list all parameter dimensions.
llvm-svn: 298197
After this change, enabling -polly-codegen-add-debug-printing in combination
with -polly-codegen-generate-expressions allows us to instrument the compiled
binaries to not only print the values stored and loaded to a given memory
access, but also to print the accessed location with array name and
per-dimension offset:
MemRef_A[3][2]
Store to 6299784: 5.000000
MemRef_A[3][3]
Load from 6299788: 0.000000
MemRef_A[3][3]
Store to 6299788: 6.000000
This can be very helpful for debugging.
llvm-svn: 298194
In commit r219005 lifetime markers have been introduced to mark the lifetime of
the OpenMP context data structure. However, their use seems incorrect and
recently caused a miscompile in ASC_Sequoia/CrystalMk after r298053 which was
not at all related to r298053. r298053 only caused a change in the loop order,
as this change resulted in a different isl internal representation which caused
the scheduler to derive a different schedule. This change then caused the IR to
change, which apparently created a pattern in which LLVM exploites the lifetime
markers. It seems we are using the OpenMP context outside of the lifetime
markers. Even though CrystalMk could probably be fixed by expanding the scope of
the lifetime markers, it is not clear what happens in case the OpenMP function
call is in a loop which will cause a sequence of starting and ending lifetimes.
As it is unlikely that the lifetime markers give any performance benefit, we
just drop them to remove complexity.
llvm-svn: 298192
If a SCoP is most probably sequential, then it's better to run it on a CPU.
Hence, there's no point in running it on a GPU.
Reviewers: grosser
Subscribers: nemanjai
Tags: #polly
Contributed-by: Singapuram Sanjay <singapuram.sanjay@gmail.com>
Differential Revision: https://reviews.llvm.org/D30864
llvm-svn: 297578
This new pass removes unnecessary accesses and writes. It currently
supports 2 simplifications, but more are planned.
It removes write accesses that write a loaded value back to the location
it was loaded from. It is a typical artifact from DeLICM. Removing it
will get rid of bogus dependencies later in dependency analysis.
It also removes statements without side-effects. ScopInfo already
removes these, but the removal of unnecessary writes can result in
more side-effect free statements.
Differential Revision: https://reviews.llvm.org/D30820
llvm-svn: 297473
We can not perform the dependence analysis and, consequently, the parallel
code generation in case the schedule tree contains extension nodes.
Reviewed-by: Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D30394
llvm-svn: 296325
Marking a pass as preserved is necessary if any Polly pass uses it, even
if it is not preserved within the generated code. Not marking it would
cause the the Polly pass chain to be interrupted. It is not used by any
Polly pass anymore, hence we can remove all references to it.
llvm-svn: 295983
When generating code in the BlockGenerator we copy all (interesting)
instructions and keep track of the new values in a basic block map. To obtain
the original llvm::Value that belongs to a load memory access, we use
getAccessValue() instead of getOriginalBaseAddr(). The former always references
the instruction we use to load values from. The latter, on the other hand,
is obtaine from the corresponding ScopArrayInfo and would not be unique in
case ScopArrayInfo objects at some point allow memory accesses with different
base addresses.
This change is an update on r294566, which only clarified that we need the
original memory access, but where we still remained dependent to have one
base pointer per scop.
This change removes unnecessary uses of MemoryAddress::getOriginalBaseAddr() in
preparation for https://reviews.llvm.org/D28518.
llvm-svn: 294669
Instead of iterating over statements and their memory accesses to extract the
set of available base pointers, just directly iterate over all ScopArray
objects. This reflects more the actual intend of the code: collect all arrays
(and their base pointers) to emit alias information that specifies that accesses
to different arrays cannot alias.
This change removes unnecessary uses of MemoryAddress::getBaseAddr() in
preparation for https://reviews.llvm.org/D28518.
llvm-svn: 294574
Before this change we used the name of the base pointer to mark reductions. This
is imprecise as the canonical reference is the ScopArray itself and not the
basepointer of a reduction. Using the base pointer of reductions is problematic
in cases where a single ScopArray is referenced through two different base
pointers.
This change removes unnecessary uses of MemoryAddress::getBaseAddr() in
preparation for https://reviews.llvm.org/D28518.
llvm-svn: 294568
When regenerating code in the BlockGenerator we copy instructions that may
references scalar values, for which the new value of a given scalar is looked up
in BBMap using the original scalar llvm::Value as index. It is consequently
necessary that (re)loaded scalar values are made available in BBMap using the
original llvm::Value as key independently if the llvm::Value was (re)loaded from
the original scalar or a new access function has been specified that caused the
value to be reloaded from an array with a differnet base address. We make this
clear by using MemoryAccess::getOriginalBaseAddr() instead of
MemoryAccess::getBaseAddr() as index to BBMap.
This change removes unnecessary uses of MemoryAddress::getBaseAddr() in
preparation for https://reviews.llvm.org/D28518.
llvm-svn: 294566
Instead of keeping two separate maps from Value to Allocas, one for
MemoryType::Value and the other for MemoryType::PHI, we introduce a single map
from ScopArrayInfo to the corresponding Alloca. This change is intended, both as
a general simplification and cleanup, but also to reduce our use of
MemoryAccess::getBaseAddr(). Moving away from using getBaseAddr() makes sure
we have only a single place where the array (and its base pointer) for which we
generate code for is specified, which means we can more easily introduce new
access functions that use a different ScopArrayInfo as base. We already today
experiment with modifiable access functions, so this change does not address
a specific bug, but it just reduces the scope one needs to reason about.
Another motivation for this patch is https://reviews.llvm.org/D28518, where
memory accesses with different base pointers could possibly be mapped to a
single ScopArrayInfo object. Such a mapping is currently not possible, as we
currently generate alloca instructions according to the base addresses of the
memory accesses, not according to the ScopArrayInfo object they belong to. By
making allocas ScopArrayInfo specific, a mapping to a single ScopArrayInfo
object will automatically mean that the same stack slot is used for these
arrays. For D28518 this is not a problem, as only MemoryType::Array objects are
mapping, but resolving this inconsistency will hopefully avoid confusion.
llvm-svn: 293374
Before this change we created an additional reload in the copy of the incoming
block of a PHI node to reload the incoming value, even though the necessary
value has already been made available by the normally generated scalar loads.
In this change, we drop the code that generates this redundant reload and
instead just reuse the scalar value already available.
Besides making the generated code slightly cleaner, this change also makes sure
that scalar loads go through the normal logic, which means they can be remapped
(e.g. to array slots) and corresponding code is generated to load from the
remapped location. Without this change, the original scalar load at the
beginning of the non-affine region would have been remapped, but the redundant
scalar load would continue to load from the old PHI slot location.
It might be possible to further simplify the code in addOperandToPHI,
but this would not only mean to pull out getNewValue, but to also change the
insertion point update logic. As this did not work when trying it the first
time, this change is likely not trivial. To not introduce bugs last minute, we
postpone further simplications to a subsequent commit.
We also document the current behavior a little bit better.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D28892
llvm-svn: 292486
Making certain values 'const' to just cast it away a little later mainly
obfuscates the code. Hence, we just drop the 'const' parts.
Suggested-by: Michael Kruse <llvm@meinersbur.de>
llvm-svn: 292480
Summary:
Instead of forbidding such access functions completely, we verify that their
base pointer has been hoisted and only assert in case the base pointer was
not hoisted.
I was trying for a little while to get a test case that ensures the assert is
correctly fired in case of invariant load hoisting being disabled, but I could
not find a good way to do so, as llvm-lit immediately aborts if a command
yields a non-zero return value. As we do not generally test our asserts,
not having a test case here seems OK.
This resolves http://llvm.org/PR31494
Suggested-by: Michael Kruse <llvm@meinersbur.de>
Reviewers: efriedma, jdoerfert, Meinersbur, gareevroman, sebpop, zinob, huihuiz, pollydev
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D28798
llvm-svn: 292213
To benefit of the type safety guarantees of C++11 typed enums, which would have
caught the type mismatch fixed in r291960, we make MemoryKind a typed enum.
This change also allows us to drop the 'MK_' prefix and to instead use the more
descriptive full name of the enum as prefix. To reduce the amount of typing
needed, we use this opportunity to move MemoryKind from ScopArrayInfo to a
global scope, which means the ScopArrayInfo:: prefix is not needed. This move
also makes historically sense. In the beginning of Polly we had different
MemoryKind enums in both MemoryAccess and ScopArrayInfo, which were later
canonicalized to one. During this canonicalization we just choose the enum in
ScopArrayInfo, but did not consider to move this shared enum to global scope.
Reviewed-by: Michael Kruse <llvm@meinersbur.de>
Differential Revision: https://reviews.llvm.org/D28090
llvm-svn: 292030
Aligning data to cache lines boundaries helps to avoid overheads related to
an access to it ([1]). This patch aligns newly created arrays and adds an
option to specify the first level cache line size. By default we use 64 bytes,
which is a typical cache-line size ([2]).
In case of Intel Core i7-3820 SandyBridge and the following options,
clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME
-march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true
-DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8
-mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm
-polly-target-latency-vector-fma=8
it helps to improve the performance from 11.303 GFlops/sec (39,247% of
theoretical peak) to 12.63 GFlops/sec (43,8542% of theoretical peak).
Refs.:
[1] - http://www.alexonlinux.com/aligned-vs-unaligned-memory-access
[2] - http://igoro.com/archive/gallery-of-processor-cache-effects/
Differential Revision: https://reviews.llvm.org/D28020
Reviewed-by: Tobias Grosser <tobias@grosser.es>
llvm-svn: 290253
In '[DBG] Allow to emit the RTC value at runtime' the diagnostics were printed
without a newline at the end of each diagnostic. We add such a newline to
improve readability.
llvm-svn: 288323
Introduce the new flag -polly-codegen-generate-expressions which forces Polly
to code generate AST expressions instead of using our SCEV based access
expression generation even for cases where the original memory access relation
was not changed and the SCEV based access expression could be code generated
without any issue.
This is an experimental option for better testing the isl ast expression
generation. The default behavior of Polly remains unchanged. We also exclude
a couple of cases for which the AST expression is not yet working.
llvm-svn: 287694
The new command line flag "polly-codegen-emit-rtc-print" can be used to
place a "printf" in the generated code that will print the RTC value and
the overflow state.
llvm-svn: 287265
Providing the context to the ast generator allows for additional simplifcations
and -- more importantly -- allows to generate loops with only partially bounded
domains, assuming the domains are bounded for all parameter configurations
that are valid as defined by the context.
This change fixes the crash reported in http://llvm.org/PR30956
The original reason why we did not include the context when generating an
AST was that CLooG and later isl used to sometimes transfer some of the
constraints that bound the size of parameters from the context into the
generated AST. This resulted in operations with very large constants, which
sometimes introduced problematic integer overflows. The latest versions of
the isl AST generator are careful to not introduce such constants.
Reported-by: Eli Friedman <efriedma@codeaurora.org>
llvm-svn: 286442
This makes polly generate a CFG which is closer to what we want
in LLVM IR, with a loop preheader for the original loop. This is
just a cleanup, but it exposes some fragile assumptions.
I'm not completely happy with the changes related to expandCodeFor;
RTCBB->getTerminator() is basically a random insertion point which
happens to work due to the way we generate runtime checks. I'm not
sure what the right answer looks like, though.
Differential Revision: https://reviews.llvm.org/D26053
llvm-svn: 285864
Summary: Iterating over SeenBlocks which is a SmallPtrSet results in non-determinism in codegen
Reviewers: jdoerfert, zinob, grosser
Tags: #polly
Differential Revision: https://reviews.llvm.org/D25778
llvm-svn: 284622
Under some conditions MK_Value read accessed where converted to MK_ExitPHI read
accessed. This is unexpected because MK_ExitPHI read accesses are implicit after
the scop execution. This behaviour was introduced in r265261, which fixed a
failed assertion/crash in CodeGen.
Instead, we fix this failure in CodeGen itself. createExitPHINodeMerges(),
despite its name, also handles accesses of kind MK_Value, only to skip them
because they access values that are usually not PHI nodes in the SCoP region's
exit block. Except in the situation observed in r265261.
Do not convert value accessed to ExitPHI accesses and do not handle
value accesses like ExitPHI accessed in CodeGen anymore.
llvm-svn: 284023
The core of the change is supposed to be NFC, however it also fixes
what I believe was an undefined behavior when calling:
va_start(ValueArgs, Desc);
with Desc being a StringRef.
Differential Revision: https://reviews.llvm.org/D25342
llvm-svn: 283671
Currently Polly cannot generate code for index expressions if the base pointer
is computed within the scop. The base pointer must be generated as well, but
there is no code that triggers that.
Add an assertion to detect when this would occur and miscompile. The IR verifier
should catch it as well.
llvm-svn: 282893
generateScalarLoad() and generateScalarStore() are used for explicit (MK_Array)
memory accesses, therefore the method names were misleading. The names also
were similar to generateScalarLoads() and generateScalarStores() (plural forms)
which indeed handle scalar accesses. Presumbly, they were originally named to
contrast VectorBlockGenerator::generateLoad().
Rename the two methods to generateArrayLoad(),
respectively generateArrayStore().
llvm-svn: 282861
The code generator always adds unconditional LoadInst and StoreInst, hence the
MemoryAccess must be defined over all statement instances.
llvm-svn: 282853
In case sequential kernels are found deeper in the loop tree than any parallel
kernel, the overall scop is probably mostly sequential. Hence, run it on the
CPU.
llvm-svn: 281849
Offloading to a GPU is only beneficial if there is a sufficient amount of
compute that can be accelerated. Many kernels just have a very small number
of dynamic compute, which means GPU acceleration is not beneficial. We
compute at run-time an approximation of how many dynamic instructions will be
executed and fall back to CPU code in case this number is not sufficiently
large. To keep the run-time checking code simple, we over-approximate the
number of instructions executed in each statement by computing the volume of
the rectangular hull of its iteration space.
llvm-svn: 281848
We may generate GPU kernels that store into scalars in case we run some
sequential code on the GPU because the remaining data is expected to already be
on the GPU. For these kernels it is important to not keep the scalar values
in thread-local registers, but to store them back to the corresponding device
memory objects that backs them up.
We currently only store scalars back at the end of a kernel. This is only
correct if precisely one thread is executed. In case more than one thread may
be run, we currently invalidate the scop. To support such cases correctly,
we would need to always load and store back from a corresponding global
memory slot instead of a thread-local alloca slot.
llvm-svn: 281838
Our alias checks precisely check that the minimal and maximal accessed elements
do not overlap in a kernel. Hence, we must ensure that our host <-> device
transfers do not touch additional memory locations that are not covered in
the alias check. To ensure this, we make sure that the data we copy for a
given array is only the data from the smallest element accessed to the largest
element accessed.
We also adjust the size of the array according to the offset at which the array
is actually accessed.
An interesting result of this is: In case array are accessed with negative
subscripts ,e.g., A[-100], we automatically allocate and transfer _more_ data to
cover the full array. This is important as such code indeed exists in the wild.
llvm-svn: 281611
This is the fourth patch to apply the BLIS matmul optimization pattern on matmul
kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
BLIS implements gemm as three nested loops around a macro-kernel, plus two
packing routines. The macro-kernel is implemented in terms of two additional
loops around a micro-kernel. The micro-kernel is a loop around a rank-1
(i.e., outer product) update. In this change we perform copying to created
arrays, which is the last step to implement the packing transformation.
Reviewed-by: Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D23260
llvm-svn: 281441
We do not need the size of the outermost dimension in most cases, but if we
allocate memory for newly created arrays, that size is needed.
Reviewed-by: Michael Kruse <llvm@meinersbur.de>
Differential Revision: https://reviews.llvm.org/D23991
llvm-svn: 281234
Instead of aborting, we now bail out gracefully in case the kernel IR we
generate is invalid. This can currently happen in case the SCoP stores
pointer values, which we model as arrays, as data values into other arrays. In
this case, the original pointer value is not available on the device and can
consequently not be stored. As detecting this ahead of time is not so easy, we
detect these situations after the invalid IR has been generated and bail out.
llvm-svn: 281193
If these arrays have never been accessed we failed to derive an upper bound
of the accesses and consequently a size for the outermost dimension. We
now explicitly check for empty access sets and then just use zero as size
for the outermost dimension.
llvm-svn: 281165
LLVM's coding guideline suggests to not use @brief for one-sentence doxygen
comments to improve readability. Switch this once and for all to ensure people
do not copy @brief comments from other parts of Polly, when writing new code.
llvm-svn: 280468
Change the code around setNewAccessRelation to allow to use a an existing array
element for memory instead of an ad-hoc alloca. This facility will be used for
DeLICM/DeGVN to convert scalar dependencies into regular ones.
The changes necessary include:
- Make the code generator use the implicit locations instead of the alloca ones.
- A test case
- Make the JScop importer accept changes of scalar accesses for that test case.
- Adapt the MemoryAccess interface to the fact that the MemoryKind can change.
They are named (get|is)OriginalXXX() to get the status of the memory access
before any change by setNewAccessRelation() (some properties such as
getIncoming() do not change even if the kind is changed and are still
required). To get the modified properties, there is (get|is)LatestXXX(). The
old accessors without Original|Latest become synonyms of the
(get|is)OriginalXXX() to not make functional changes in unrelated code.
Differential Revision: https://reviews.llvm.org/D23962
llvm-svn: 280408
We already invalidated a couple of critical values earlier on, but we now
invalidate all instructions contained in a scop after the scop has been code
generated. This is necessary as later scops may otherwise obtain SCEV
expressions that reference values in the earlier scop that before dominated
the later scop, but which had been moved into the conditional branch and
consequently do not dominate the later scop any more. If these very values are
then used during code generation of the later scop, we generate used that are
dominated by the values they use.
This fixes: http://llvm.org/PR28984
llvm-svn: 279047
To do so we change the way array exents are computed. Instead of the precise
set of memory locations accessed, we now compute the extent as the range between
minimal and maximal address in the first dimension and the full extent defined
by the sizes of the inner array dimensions.
We also move the computation of the may_persist region after the construction
of the arrays, as it relies on array information. Without arrays being
constructed no useful information is computed at all.
llvm-svn: 278212
Ensure the right scalar allocations are used as the host location of data
transfers. For the device code, we clear the allocation cache before device
code generation to be able to generate new device-specific allocation and
we need to make sure to add back the old host allocations as soon as the
device code generation is finished.
llvm-svn: 278126
This increases the readability of the IR and also clarifies that the GPU
inititialization is executed _after_ the scalar initialization which needs
to before the code of the transformed scop is executed.
Besides increased readability, the IR should not change. Specifically, I
do not expect any changes in program semantics due to this patch.
llvm-svn: 278125
In case some code -- not guarded by control flow -- would be emitted directly in
the start block, it may happen that this code would use uninitalized scalar
values if the scalar initialization is only emitted at the end of the start
block. This is not a problem today in normal Polly, as all statements are
emitted in their own basic blocks, but Polly-ACC emits host-to-device copy
statements into the start block.
Additional Polly-ACC test coverage will be added in subsequent changes that
improve the handling of PHI nodes in Polly-ACC.
llvm-svn: 278124
After having generated the code for a ScopStmt, we run a simple dead-code
elimination that drops all instructions that are known to be and remain unused.
Until this change, we only considered instructions for dead-code elimination, if
they have a corresponding instruction in the original BB that belongs to
ScopStmt. However, when generating code we do not only copy code from the BB
belonging to a ScopStmt, but also generate code for operands referenced from BB.
After this change, we now also considers code for dead code elimination, which
does not have a corresponding instruction in BB.
This fixes a bug in Polly-ACC where such dead-code referenced CPU code from
within a GPU kernel, which is possible as we do not guarantee that all variables
that are used in known-dead-code are moved to the GPU.
llvm-svn: 278103
When adding code that avoids to pass values used in isl expressions and
LLVM instructions twice, we forgot to make single variable passed to the
kernel available in the ValueMap that makes it usable for instructions that
are not replaced with isl ast expressions. This change adds the variable
that is passed to the kernel to the ValueMap to ensure it is available
for such use cases as well.
llvm-svn: 278039
There is no need to reset the position of the builder, as we can just continue
to insert code at the current position of the IRBuilder, which happens to
be precisely the location we reset the builder to.
llvm-svn: 278014
... instead of adding instructions at the end of the basic block the builder
is currently at. This makes it easier to reason about where IR is generated,
as with the IRBuilder there is just a single location that specificies where
IR is generated.
llvm-svn: 278013
The map is iterated over when generating the values escaping the SCoP. The
indeterministic iteration order of DenseMap causes the output IR to change at
every compilation, adding noise to comparisons.
Replace DenseMap by a MapVector to ensure the same iteration order at every
compilation.
llvm-svn: 277832
Before this commit we generated the array type in reverse order and we also
added the outermost dimension size to the new array declaration, which is
incorrect as Polly additionally assumed an additional unsized outermost
dimension, such that we had an off-by-one error in the linearization of access
expressions.
llvm-svn: 277802
These annotations ensure that the NVIDIA PTX assembler limits the number of
registers used such that we can be certain the resulting kernel can be executed
for the number of threads in a thread block that we are planning to use.
llvm-svn: 277799
Pass the content of scalar array references to the alloca on the kernel side
and do not pass them additional as normal LLVM scalar value.
llvm-svn: 277699
Otherwise, we would try to re-optimize them with Polly-ACC and possibly even
generate kernels that try to offload themselves, which does not work as the
GPURuntime is not available on the accelerator and also does not make any
sense.
llvm-svn: 277589
Extend the jscop interface to allow the user to export arrays. It is required
that already existing arrays of the list of arrays correspond to arrays
of the SCoP. Each array that is appended to the list will be newly created.
Furthermore, we allow the user to modify access expressions to reference
any array in case it has the same element type.
Reviewed-by: Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D22828
llvm-svn: 277263
Before this change we used the array index, which would result in us accessing
the parameter array out-of-bounds. This bug was visible for test cases where not
all arrays in a scop are passed to a given kernel.
llvm-svn: 276961
Also factor out getArraySize() to avoid code dupliciation and reorder some
function arguments to indicate the direction into which data is transferred.
llvm-svn: 276636
At the beginning of each SCoP, we allocate device arrays for all arrays
used on the GPU and we free such arrays after the SCoP has been executed.
llvm-svn: 276635
There is no need to expose the selected device at the moment. We also pass back
pointers as return values, as this simplifies the interface.
llvm-svn: 276623
This allows the finalization routine of the IslNodeBuilder to be overwritten
by derived classes. Being here, we also drop the unnecessary 'Scop' postfix
and the unnecessary 'Scop' parameter.
llvm-svn: 276622
We optimize the kernel _after_ dumping the IR we generate to make the IR we
dump easier readable and independent of possible changes in the general
purpose LLVM optimizers.
llvm-svn: 276551