Sadly, this duplicates (twice) the logic from InstSimplify. There
might be some way to at least share the DAG versions of the code,
but copying the folds seems to be the standard method to ensure
that we don't miss these folds.
Unlike in IR, we don't run DAGCombiner to fixpoint, so there's no
way to ensure that we do these kinds of simplifications unless the
code is repeated at node creation time and during combines.
There were other tests that would become worthless with this
improvement that I changed as pre-commits:
rL347161
rL347164
rL347165
rL347166
rL347167
I'm not sure how to salvage the remaining tests (diffs in this patch).
So the x86 tests verify that the new code is working as intended.
The AMDGPU test is actually similar to my motivating case: we have
some undef value that has survived to machine IR in an x86 test, and
then it gets folded in some weird way, or we crash if we don't transfer
the undef flag. But we would have been better off never getting to that
point by doing these simplifications.
This will lead back to PR32023 someday...
https://bugs.llvm.org/show_bug.cgi?id=32023
llvm-svn: 347170
If a block had one of the _term instructions used for gluing
exec modifying instructions to the end of the block,
analyzeBranch would fail, preventing the verifier from catching
a broken successor list.
llvm-svn: 347027
Add a pass to fixup various vector ISel issues.
Currently we handle converting GLOBAL_{LOAD|STORE}_*
and GLOBAL_Atomic_* instructions into their _SADDR variants.
This involves feeding the sreg into the saddr field of the new instruction.
llvm-svn: 347008
The machine scheduler currently biases register copies to/from
physical registers to be closer to their point of use / def to
minimize their live ranges. This change extends this to also physical
register assignments from immediate values.
This causes a reduction in reduction in overall register pressure and
minor reduction in spills and indirectly fixes an out-of-registers
assertion (PR39391).
Most test changes are from minor instruction reorderings and register
name selection changes and direct consequences of that.
Reviewers: MatzeB, qcolombet, myatsina, pcc
Subscribers: nemanjai, jvesely, nhaehnle, eraman, hiraditya,
javed.absar, arphaman, jfb, jsji, llvm-commits
Differential Revision: https://reviews.llvm.org/D54218
llvm-svn: 346894
An extractelement with non-constant index will be lowered either to
scratch or movrel loop in most cases. This patch converts such
instruction into a set of selects if vector size is not too big.
Differential Revision: https://reviews.llvm.org/D54351
llvm-svn: 346800
Legalizer used to request an ext load from i8 to i1 when promoting
vector element type to i8. Fixed.
Differential Revision: https://reviews.llvm.org/D54440
llvm-svn: 346795
Sometimes after basic block placement we end up with a code like:
sreg = s_mov_b64 -1
vcc = s_and_b64 exec, sreg
s_cbranch_vccz
This happens as a join of a block assigning -1 to a saved mask and
another block which consumes that saved mask with s_and_b64 and a
branch.
This is essentially a single s_cbranch_execz instruction when moved
into a single new basic block.
Differential Revision: https://reviews.llvm.org/D54164
llvm-svn: 346690
The function only checks that instruction reads a super-register
containing requested physical register. In case if a sub-register
if being read that is also a use of a super-reg, so added the check.
In particular MI->readsRegister() is broken because of the missing
check. The resulting check is essentially regsOverlap().
Differential Revision: https://reviews.llvm.org/D54128
llvm-svn: 346686
Promote alloca can vectorize a small array by bitcasting it to a
vector type. Extend vectorization for the case when alloca is
already a vector type. We still want to replace GEPs with an
insert/extract element instructions in this case.
Differential Revision: https://reviews.llvm.org/D54219
llvm-svn: 346376
This allows testing AMDGPU alias analysis like any
other alias analysis pass. This fixes the existing
test pointlessly running opt -O3 when it really
just wants to run the one analysis.
Before there was no way to test this using -aa-eval
with opt, since the default constructed pass
is run. The wrapper subclass allows the
default constructor to pass the necessary callback.
llvm-svn: 346353
Set `LiveReg::PhysReg` to zero when freeing a register instead of
removing it from the entry from `LiveRegMap`. This way no iterators get
invalidated and we can avoid passing around and updating iterators all
over the place.
This does not change any allocator decisions. It is not completely NFC
because the arbitrary iteration order through `LiveRegMap` in
`spillAll()` changes so we may get a different order in those spill
sequences (the amount of spills does not change).
This is in preparation of https://reviews.llvm.org/D52010.
llvm-svn: 346298
Add this option for debugging and providing workaround.
By default it is off so no behavior change in backend.
Differential Revision: https://reviews.llvm.org/D54158
llvm-svn: 346267
This patch makes LICM use `ICFLoopSafetyInfo` that is a smarter version
of LoopSafetyInfo that leverages power of Implicit Control Flow Tracking
to keep track of throwing instructions and give less pessimistic answers
to queries related to throws.
The ICFLoopSafetyInfo itself has been introduced in rL344601. This patch
enables it in LICM only.
Differential Revision: https://reviews.llvm.org/D50377
Reviewed By: apilipenko
llvm-svn: 346201
The new atomic optimizer I previously added in D51969 did not work
correctly when a pixel shader was using derivatives, and had helper
lanes active.
To fix this we add an llvm.amdgcn.ps.live call that guards a branch
around the entire atomic operation - ensuring that all helper lanes are
inactive within the wavefront when we compute our atomic results.
I've added a test case that can cause derivatives, and exposes the
problem.
Differential Revision: https://reviews.llvm.org/D53930
llvm-svn: 346128
Summary: Different variants of idot8 codegen dag patterns are not generated by llvm-tablegen due to a huge
increase in the compile time. Support the pattern that clang FE generates after reordering the
additions in integer-dot8 source language pattern.
Author: FarhanaAleen
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D53937
llvm-svn: 345902
SimplifySetCC could shrink a load without checking for
profitability or legality of such shink with a target.
Added checks to prevent shrinking of aligned scalar loads
in AMDGPU below dword as scalar engine does not support it.
Differential Revision: https://reviews.llvm.org/D53846
llvm-svn: 345778
lowerRangeToAssertZExt currently relies on something like EarlyCSE having
eliminated the constant range [0,1). At -O0 this leads to an assert.
Differential Revision: https://reviews.llvm.org/D53888
llvm-svn: 345770
This feature is only relevant to shaders, and is no longer used. When disabled,
lowering of reserved registers for shaders causes a compiler crash.
Remove the feature and add a test for compilation of shaders at OptNone.
Differential Revision: https://reviews.llvm.org/D53829
llvm-svn: 345763
Summary:
Instead of writing boolean values temporarily into 32-bit VGPRs
if they are involved in PHIs or are observed from outside a loop,
we use bitwise masking operations to combine lane masks in a way
that is consistent with wave control flow.
Move SIFixSGPRCopies to before this pass, since that pass
incorrectly attempts to move SGPR phis to VGPRs.
This should recover most of the code quality that was lost with
the bug fix in "AMDGPU: Remove PHI loop condition optimization".
There are still some relevant cases where code quality could be
improved, in particular:
- We often introduce redundant masks with EXEC. Ideally, we'd
have a generic computeKnownBits-like analysis to determine
whether masks are already masked by EXEC, so we can avoid this
masking both here and when lowering uniform control flow.
- The criterion we use to determine whether a def is observed
from outside a loop is conservative: it doesn't check whether
(loop) branch conditions are uniform.
Change-Id: Ibabdb373a7510e426b90deef00f5e16c5d56e64b
Reviewers: arsenm, rampitec, tpr
Subscribers: kzhuravl, jvesely, wdng, mgorny, yaxunl, dstuttard, t-tye, eraman, llvm-commits
Differential Revision: https://reviews.llvm.org/D53496
llvm-svn: 345719
Summary:
The optimization to early break out of loops if all threads are dead was
never fully implemented.
But the PHI node analyzing is actually causing a number of problems, so
remove all the extra code for it.
(This does actually regress code quality in a few places because it
ends up relying more heavily on phi's of i1, which we don't do a
great job with. However, since it fixes real bugs in the wild, we
should take this change. I have some prototype changes to improve
i1 lowering in general -- not just for control flow -- which should
help recover the code quality, I just need to make those changes
fit for general consumption. -- Nicolai)
Change-Id: I6fc6c6c8961857ac6009fcfb9f7e5e48dc23fbb1
Patch-by: Christian König <christian.koenig@amd.com>
Reviewers: arsenm, rampitec, tpr
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D53359
llvm-svn: 345718
Our a16 support was only enabled for sample/gather and buffer
load/store, but not for image load/store operations (which take an i16
as the pixel index rather than a half).
Fix our isel lowering and add test cases to prove it out.
Differential Revision: https://reviews.llvm.org/D53750
llvm-svn: 345710
The debug-use flag must be set exactly for uses on DBG_VALUEs. This is
so obvious that it can be trivially inferred while parsing. This will
reduce noise when printing while omitting an information that has little
value to the user.
The parser will keep recognizing the flag for compatibility with old
`.mir` files.
Differential Revision: https://reviews.llvm.org/D53903
llvm-svn: 345671
The SchedModel allows the addition of ReadAdvances to express that certain
operands of the instructions are needed at a later point than the others.
RegAlloc may add pseudo operands that are not part of the instruction
descriptor, and therefore cannot have any read advance entries. This meant
that in some cases the desired read advance was nullified by such a pseudo
operand, which still had the original latency.
This patch fixes this by making sure that such pseudo operands get a zero
latency during DAG construction.
Review: Matthias Braun, Ulrich Weigand.
https://reviews.llvm.org/D49671
llvm-svn: 345606
Similar to FoldCONCAT_VECTORS, this patch adds FoldBUILD_VECTOR to simplify cases that can avoid the creation of the BUILD_VECTOR - if all the operands are UNDEF or if the BUILD_VECTOR simplifies to a copy.
This exposed an assumption in some AMDGPU code that getBuildVector was guaranteed to be a BUILD_VECTOR node that I've tried to handle.
Differential Revision: https://reviews.llvm.org/D53760
llvm-svn: 345578
- Relex hard coded registers and stack frame sizes
- Some test cleanups
- Change phi-dbg.ll to match on mir output after phi elimination instead
of going through the whole codegen pipeline.
This is in preparation for https://reviews.llvm.org/D52010
I'm committing all the test changes upfront that work before and after
independently.
llvm-svn: 345532
AMDGPU currently only supports direct calls, but at lower optimisation levels it
fails to lower statically direct calls which appear indirect due to a bitcast.
Add a pass to visit all CallSites and use CallPromotionUtils to "devirtualize"
calls.
Differential Revision: https://reviews.llvm.org/D52741
llvm-svn: 345382
Introduce new versions that follow the IEEE semantics
to help with legalization that may need quieted inputs.
There are some regressions from inserting unnecessary
canonicalizes when these are matched from fast math
fcmp + select which should be fixed in a future commit.
llvm-svn: 344914
Summary:
Add selection patterns to support one bit Sub.
Reviewers:
rampitec, arsenm
Differential Revision:
https://reviews.llvm.org/D52946
llvm-svn: 344815
Summary:
To workaround a hardware issue in the (base + offset) calculation
when base is negative. The impact on code quality should be limited
since SILoadStoreOptimizer still runs afterwards and is able to
combine loads/stores based on known sign information.
This fixes visible corruption in Hitman on SI (easily reproducible
by running benchmark mode).
Change-Id: Ia178d207a5e2ac38ae7cd98b532ea2ae74704e5f
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99923
Reviewers: arsenm, mareko
Subscribers: jholewinski, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D53160
llvm-svn: 344698
Summary:
Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.
If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.
There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.
Change-Id: I170e6816323beb1348677b358c9d380865cd1a19
Reviewers: arsenm, alex-t, rampitec, tpr
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D53283
llvm-svn: 344696