DBG_VALUES placed between memory instructions would change
codegen. Skip over these and re-insert them after the bundle instead
of giving up on bundling.
This would assert with amdgpu-spill-sgpr-to-vgpr disabled when trying to
spill the FP.
Fixes: SWDEV-262704
Reviewed By: RamNalamothu
Differential Revision: https://reviews.llvm.org/D95768
With a context instruction, this would produce a context
error. However, it would continue on and do an out of bounds access of
the empty allocation order array.
SCC was not correctly preserved when entering WWM.
Current lit test was unable to detect this as entry block is
handled differently.
Additionally fix an issue where SCC was unnecessarily preserved
when exiting from WWM to Exact mode.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D95500
V_SET_INACTIVE is implemented with S_NOT which clobbers SCC.
Mark sure it is marked appropriately.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D95509
These are widened to a wider UADDE/USUBE, with the overflow value
unused, and with the same synthesis of a new overflow value as for the
O operations.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D95326
Look throught G_PTRTOINT and G_PTR_ADD nodes when looking for constant
offset for buffer stores. This also helps with merging of these instructions
later on.
Differential Revision: https://reviews.llvm.org/D95242
Before the patch it was possible to trigger a constant bus
violation when folding immediates into a shrunk instruction.
The patch adds a check to enforce the legality of the new operand.
Differential Revision: https://reviews.llvm.org/D95527
We cannot call LRM::unassign() if LRM::assign() was never called
before, these are symmetrical calls. There are two ways of
assigning a physical register to virtual, via LRM::assign() and
via VRM::assignVirt2Phys(). LRM::assign() will call the VRM to
assign the register and then update LiveIntervalUnion. Inline
spiller calls VRM directly and thus LiveIntervalUnion never gets
updated. A call to LRM::unassign() then asserts about inconsistent
liveness.
We have to note that not all callers of the InlineSpiller even
have LRM to pass, RegAllocPBQP does not have it, so we cannot
always pass LRM into the spiller.
The only way to get into that spiller LRE_DidCloneVirtReg() call
is from LiveRangeEdit::eliminateDeadDefs if we split an LI.
This patch refuses to reassign a LiveInterval created by a split
to workaround the problem. In fact we cannot reassign a spill
anyway as all registers of the needed class are occupied and we
are spilling.
Fixes: SWDEV-267996
Differential Revision: https://reviews.llvm.org/D95489
Just use the existing `Known.sextInReg` implementation.
- Update KnownBitsTest.cpp.
- Update combine-redundant-and.mir for a more concrete example.
Differential Revision: https://reviews.llvm.org/D95484
Support for XNACK and SRAMECC is not static on some GPUs. We must be able
to differentiate between different scenarios for these dynamic subtarget
features.
The possible settings are:
- Unsupported: The GPU has no support for XNACK/SRAMECC.
- Any: Preference is unspecified. Use conservative settings that can run anywhere.
- Off: Request support for XNACK/SRAMECC Off
- On: Request support for XNACK/SRAMECC On
GCNSubtarget will track the four options based on the following criteria. If
the subtarget does not support XNACK/SRAMECC we say the setting is
"Unsupported". If no subtarget features for XNACK/SRAMECC are requested we
must support "Any" mode. If the subtarget features XNACK/SRAMECC exist in the
feature string when initializing the subtarget, the settings are "On/Off".
The defaults are updated to be conservatively correct, meaning if no setting
for XNACK or SRAMECC is explicitly requested, defaults will be used which
generate code that can be run anywhere. This corresponds to the "Any" setting.
Differential Revision: https://reviews.llvm.org/D85882
If a function has stack objects, and a call, we require an FP. If we
did not initially have any stack objects, and only introduced them
during PrologEpilogInserter for CSR VGPR spills, SILowerSGPRSpills
would end up spilling the FP register as if it were a normal
register. This would result in an assert in a debug build, or
redundant handling of the FP register in a release build.
Try to predict that we will have an FP later, although this is ugly.
Frame-base materialization may insert vector instructions before EXEC is initialised.
Fix this by moving lowering of llvm.amdgcn.init.exec later in backend.
Also remove SI_INIT_EXEC_LO pseudo as this is not necessary.
Reviewed By: ruiling
Differential Revision: https://reviews.llvm.org/D94645
In RISC-V there is a single addressing mode of the form imm(reg) where
imm is a signed integer of 12-bit with a range of [-2048..2047] bytes
from reg.
The test MultiSource/UnitTests/C++11/frame_layout of the LLVM test-suite
exercises several scenarios with the stack, including function calls
where the stack will need to be realigned to to a local variable having
a large alignment of 4096 bytes.
In situations of large stacks, the RISC-V backend (in
RISCVFrameLowering) reserves an extra emergency spill slot which can be
used (if no free register is found) by the register scavenger after the
frame indexes have been eliminated. PrologEpilogInserter already takes
care of keeping the emergency spill slots as close as possible to the
stack pointer or frame pointer (depending on what the function will
use). However there is a final alignment step to honour the maximum
alignment of the stack that, when using the stack pointer to access the
emergency spill slots, has the side effect of setting them farther from
the stack pointer.
In the case of the frame_layout testcase, the net result is that we do
have an emergency spill slot but it is so far from the stack pointer
(more than 2048 bytes due to the extra alignment of a variable to 4096
bytes) that it becomes unreachable via any immediate offset.
During elimination of the frame index, many (regular) offsets of the
stack may be immediately unreachable already. Their address needs to be
computed using a register. A virtual register is created and later
RegisterScavenger should be able to find an unused (physical) register.
However if no register is available, RegisterScavenger will pick a
physical register and spill it onto an emergency stack slot, while we
compute the offset (restoring the chosen register after all this). This
assumes that the emergency stack slot is easily reachable (this is,
without requiring another register!).
This is the assumption we seem to break when we perform the extra
alignment in PrologEpilogInserter.
We can "float" the emergency spill slots by increasing (in absolute
value) their offsets from the incoming stack pointer. This way the
emergency spill slots will remain close to the stack pointer (once the
function has allocated storage for the stack, including the needed
realignment). The new size computed in PrologEpilogInserter is padding
so it should be OK to move the emergency spill slots there. Also because
we're increasing the alignment, the new location should stay aligned for
the purpose of the emergency spill slots.
Note that this change also impacts other backends as shown by the tests.
Changes are minor adjustments to the emergency stack slot offset.
Differential Revision: https://reviews.llvm.org/D89239
The widenScalar implementation for signed and unsigned overflowing
operations were very similar: both are checked by truncating the result
and then re-sign/zero-extending it and checking that it matches the
computed operation.
Using a truncate + zero-extend for the unsigned case instead of manually
producing the AND instruction like before leads to an extra copy
instruction during legalization, but this should be harmless.
Differential Revision: https://reviews.llvm.org/D95035
Allow parsing generated mir with custom pseudo source value tokens.
Also rename pseudo source values to have more meaningful names.
Relands ba7dcd8542, which had memory leaks.
Differential Revision: https://reviews.llvm.org/D95215
During instruction selection, there is an inconsistency in choosing
the initial soffset value. With certain early passes, this value is
getting modified and that brought additional fixup during
eliminateFrameIndex to work for all cases. This whole transformation
looks trivial and can be handled better.
This patch clearly defines the initial value for soffset and keeps it
unchanged before eliminateFrameIndex. The initial value must be zero
for MUBUF with a frame index. The non-frame index MUBUF forms that
use a raw offset from SP will have the stack register for soffset.
During frame elimination, the soffset remains zero for entry functions
with zero dynamic allocas and no callsites, or else is updated to the
appropriate frame/stack register.
Also, did some code clean up and made all asserts around soffset
stricter to match.
Reviewed By: scott.linder
Differential Revision: https://reviews.llvm.org/D95071
Having a custom inliner doesn't really fit in with the new PM's
pipeline. It's also extra technical debt.
amdgpu-inline only does a couple of custom things compared to the normal
inliner:
1) It disables inlining if the number of BBs in a function would exceed
some limit
2) It increases the threshold if there are pointers to private arrays(?)
These can all be handled as TTI inliner hooks.
There already exists a hook for backends to multiply the inlining
threshold.
This way we can remove the custom amdgpu-inline pass.
This caused inline-hint.ll to fail, and after some investigation, it
looks like getInliningThresholdMultiplier() was previously getting
applied twice in amdgpu-inline (https://reviews.llvm.org/D62707 fixed it
not applying at all, so some later inliner change must have fixed
something), so I had to change the threshold in the test.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D94153
This test case demonstrates that the Call Frame Information generation is
totally biased towards whether exceptions are enabled or not. Currently
LLVM does not generate CFI i.e. a .debug_frame for debug purpose even
if --force-dwarf-frame-section is enabled unless exceptions are enabled.
Reviewed By: scott.linder
Differential Revision: https://reviews.llvm.org/D94801
If a function doesn't contain loops and does not call non-willreturn
functions, then it is willreturn. Loops are detected by checking
for backedges in the function. We don't attempt to handle finite
loops at this point.
Differential Revision: https://reviews.llvm.org/D94633
This pass is required to get correct codegen for image instructions with
the tfe or lwe bits set.
Differential Revision: https://reviews.llvm.org/D95132
Allow parsing generated mir with custom pseudo source value tokens.
Also rename pseudo source values to have more meaningful names.
Differential Revision: https://reviews.llvm.org/D94768
Add DemandedElts support inside the TRUNCATE analysis.
REAPPLIED - this was reverted by @hans at rGa51226057fc3 due to an issue with vector shift amount types, which was fixed in rG935bacd3a724 and an additional test case added at rG0ca81b90d19d
Differential Revision: https://reviews.llvm.org/D56387
In case of indirect calls or address taken functions,
skip propagating any attributes to them. We just
propagate features to such functions.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D94585
It caused "Vector shift amounts must be in the same as their first arg"
asserts in Chromium builds. See the code review for repro instructions.
> Add DemandedElts support inside the TRUNCATE analysis.
>
> Differential Revision: https://reviews.llvm.org/D56387
This reverts commit cad4275d69.
If constants are hidden behind G_ANYEXT we can treat them same way as G_SEXT.
For that purpose we extend getConstantVRegValWithLookThrough with option
to handle G_ANYEXT same way as G_SEXT.
Differential Revision: https://reviews.llvm.org/D92219
Use the KnownBits icmp comparisons to determine when a ISD::UMIN/UMAX op is unnecessary should either op be known to be ULT/ULE or UGT/UGE than the other.
Differential Revision: https://reviews.llvm.org/D94532
Add pseudo instruction to allow early termination of pixel shader
anywhere based on the value of SCC. The intention is to use this
when a mask of live lanes is updated, e.g. live lanes in WQM pass.
This facilitates early termination of shaders even when EXEC is
incomplete, e.g. in non-uniform control flow.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D88777
Previously, instructions which could be
expressed as VOP3 in addition to another
encoding had a _e64 suffix on the tablegen
record name, while those
only available as VOP3 did not. With this
patch, all VOP3s will have the _e64 suffix.
The assembly does not change, only the mir.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D94341
Change-Id: Ia8ec8890d47f8f94bbbdac43745b4e9dd2b03423
This seems to only have overridden cold handling, which we probably
shouldn't do. As far as I can tell the wrapper library functions are
still inlined as appropriate.
If SETO/SETUO aren't legal, they'll be expanded and we'll end up
with 3 comparisons.
SETONE is equivalent to (SETOGT || SETOLT)
so if one of those operations is supported use that expansion. We
don't need both since we can commute the operands to make the other.
SETUEQ can be implemented with !(SETOGT || SETOLT) or (SETULE && SETUGE).
I've only implemented the first because it didn't look like most of the
affected targets had legal SETULE/SETUGE.
Reviewed By: frasercrmck, tlively, nemanjai
Differential Revision: https://reviews.llvm.org/D94450
In ST mode, flat scratch instructions have neither an sgpr nor a vgpr
for the address. This lead to an assertion when inserting hard clauses.
Differential Revision: https://reviews.llvm.org/D94406
Memory operands store a base alignment that does not factor in
the effect of the offset on the alignment.
Previously the printing code only printed the base alignment if
it was different than the size. If there is an offset, the reader
would need to figure out the effective alignment themselves. This
has confused me before and someone else was recently confused on
IRC.
This patch prints the possibly offset adjusted alignment if it is
different than the size. And prints the base alignment if it is
different than the alignment. The MIR parser has been updated to
read basealign in addition to align.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D94344
This patch finishes addressing unused prefixes under CodeGen: 2
remaining tests fixed, and then undo-ing the lit.local.cfg changes under
various subdirs and moving the policy under CodeGen.
Differential Revision: https://reviews.llvm.org/D94430
We are checking the unsafe-fp-math for sqrt but not for fpow, which behaves inconsistent.
As the direction is to remove this global option, we need to remove the unsafe-fp-math
check for sqrt and update the test with afn fast-math flags.
Reviewed By: Spatel
Differential Revision: https://reviews.llvm.org/D93891
Treat a non-atomic volatile load and store as a relaxed atomic at
system scope for the address spaces accessed. This will ensure all
relevant caches will be bypassed.
A volatile atomic is not changed and still only bypasses caches upto
the level specified by the SyncScope operand.
Differential Revision: https://reviews.llvm.org/D94214
There are various hacks working around limitations in
handleAssignments, and the logical split between different parts isn't
correct. Start separating the type legalization to satisfy going
through the DAG infrastructure from the code required to split into
register types. The type splitting should be moved to generic code.
The fdiv lowering is currently split between an IR pass and codegen,
so make sure this works end to end. We also currently differ from the
DAG on some edge cases, which this will show in a future change.
Summary:
This is to avoid unnecessary analysis since amdgpu.noclobber is only used for globals.
Reviewers:
arsenm
Fixes:
SWDEV-239161
Differential Revision:
https://reviews.llvm.org/D94107
Convert it to v_fma_legacy_f32 if it is profitable to do so, just like
other mac instructions that are converted to their mad equivalents.
Differential Revision: https://reviews.llvm.org/D94010
This patch disables the FSUB(-0,X)->FNEG(X) DAG combine when we're flushing subnormals. It requires updating the existing AMDGPU tests to use the fneg IR instruction, in place of the old fsub(-0,X) canonical form, since AMDGPU is the only backend currently checking the DenormalMode flags.
Note that this will require follow-up optimizations to make sure the FSUB(-0,X) form is handled appropriately
Differential Revision: https://reviews.llvm.org/D93243
An AMDGPUAA class already existed that was supposed to work with the new
PM, but it wasn't tested and was a bit broken.
Fix up the existing classes to have the right keys/parameters.
Wire up AMDGPUAA inside AMDGPUTargetMachine.
Add it to the list of alias analyses for the "default" AAManager since
in adjustPassManager() amdgpu-aa is added into the pipeline at the
beginning.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D93914
The legacy PM doesn't run EP_ModuleOptimizerEarly on -O0, so skip
running it here when given O0.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D93886
This is a (last big?) part of the patch series to make SimplifyCFG
preserve DomTree. Currently, it still does not actually preserve it,
even thought it is pretty much fully updated to preserve it.
Once the default is flipped, a valid DomTree must be passed into
simplifyCFG, which means that whatever pass calls simplifyCFG,
should also be smart about DomTree's.
As far as i can see from `check-llvm` with default flipped,
this is the last LLVM test batch (other than bugpoint tests)
that needed fixes to not break with default flipped.
The changes here are boringly identical to the ones i did
over 42+ times/commits recently already,
so while AMDGPU is outside of my normal ecosystem,
i'm going to go for post-commit review here,
like in all the other 42+ changes.
Note that while the pass is taught to preserve {,Post}DomTree,
it still doesn't do that by default, because simplifycfg
still doesn't do that by default, and flipping default
in this pass will implicitly flip the default for simplifycfg.
That will happen, but not right now.
As mentioned in D93793, there are quite a few places where unary `IRBuilder::CreateShuffleVector(X, Mask)` can be used
instead of `IRBuilder::CreateShuffleVector(X, Undef, Mask)`.
Let's update them.
Actually, it would have been more natural if the patches were made in this order:
(1) let them use unary CreateShuffleVector first
(2) update IRBuilder::CreateShuffleVector to use poison as a placeholder value (D93793)
The order is swapped, but in terms of correctness it is still fine.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D93923
And add it to the AMDGPU opt pipeline.
This is a function pass instead of a module pass (like the legacy pass)
because it's getting added to a CGSCCPassManager, and you can't put a
module pass in a CGSCCPassManager.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D93885
And add to AMDGPU opt pipeline.
Don't pin an opt run to the legacy PM when -enable-new-pm=1 if these
passes (or passes introduced in https://reviews.llvm.org/D93863) are in
the list of passes.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D93875
And add them to the pipeline via
AMDGPUTargetMachine::registerPassBuilderCallbacks(), which mirrors
AMDGPUTargetMachine::adjustPassManager().
These passes can't be unconditionally added to PassRegistry.def since
they are only present when the AMDGPU backend is enabled. And there are
no target-specific headers in llvm/include, so parsing these pass names
must occur somewhere in the AMDGPU directory. I decided the best place
was inside the TargetMachine, since the PassBuilder invokes
TargetMachine::registerPassBuilderCallbacks() anyway. If we come up with
a cleaner solution for target-specific passes in the future that's fine,
but there aren't too many target-specific IR passes living in
target-specific directories so it shouldn't be too bad to change in the
future.
Reviewed By: ychen, arsenm
Differential Revision: https://reviews.llvm.org/D93863
Basic block containing "if" not necessarily dominates block that is the "false" target for the if.
That "false" target block may have another predecessor besides the "if" block. IR value corresponding to the Exec mask is generated by the
si_if intrinsic and then used by the end_cf intrinsic. In this case IR verifier complains that 'Def does not dominate all uses'.
This change split the edge between the "if" block and "false" target block to make it dominated by the "if" block.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D91435
Currently undef is used as a don’t-care vector when constructing a vector using a series of insertelement.
However, this is problematic because undef isn’t undefined enough.
Especially, a sequence of insertelement can be optimized to shufflevector, but using undef as its placeholder makes shufflevector a poison-blocking instruction because undef cannot be optimized to poison.
This makes a few straightforward optimizations incorrect, such as:
```
; https://bugs.llvm.org/show_bug.cgi?id=44185
define <4 x float> @insert_not_undef_shuffle_translate_commute(float %x, <4 x float> %y, <4 x float> %q) {
%xv = insertelement <4 x float> %q, float %x, i32 2
%r = shufflevector <4 x float> %y, <4 x float> %xv, <4 x i32> { 0, 6, 2, undef }
ret <4 x float> %r ; %r[3] is undef
}
=>
define <4 x float> @insert_not_undef_shuffle_translate_commute(float %x, <4 x float> %y, <4 x float> %q) {
%r = insertelement <4 x float> %y, float %x, i32 1
ret <4 x float> %r ; %r[3] = %y[3], incorrect if %y[3] = poison
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source
```
I’d like to suggest
1. Using poison as insertelement’s placeholder value (IRBuilder::CreateVectorSplat should be patched too)
2. Updating shufflevector’s semantics to return poison element if mask is undef
Note that poison is currently lowered into UNDEF in SelDag, so codegen part is okay.
m_Undef() matches PoisonValue as well, so existing optimizations will still fire.
The only concern is hidden miscompilations that will go incorrect when poison constant is given.
A conservative way is copying all tests having `insertelement undef` & replacing it with `insertelement poison` & run Alive2 on it, but it will create many tests and people won’t like it. :(
Instead, I’ll simply locally maintain the tests and run Alive2.
If there is any bug found, I’ll report it.
Relevant links: https://bugs.llvm.org/show_bug.cgi?id=43958 , http://lists.llvm.org/pipermail/llvm-dev/2019-November/137242.html
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D93586
Currently, the compiler crashes in instruction selection of global
load/stores in gfx600 due to the lack of FLAT instructions. This patch
fix the crash by selecting MUBUF instructions for global load/stores
in gfx600.
Authored-by: Praveen Velliengiri <Praveen.Velliengiri@amd.com>
Reviewed by: t-tye
Differential revision: https://reviews.llvm.org/D92483
Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
of other paths.
New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.
One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.
In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.
Reviewed By: yrouban
Differential Revision: https://reviews.llvm.org/D79485
It does not seem to fold offsets but this is not specific
to the flat scratch as getPtrBaseWithConstantOffset() does
not return the split for these tests unlike its SDag
counterpart.
Differential Revision: https://reviews.llvm.org/D93670
Adjust SITargetLowering::allowsMisalignedMemoryAccessesImpl for
unaligned flat scratch support. Mostly needed for global isel.
Differential Revision: https://reviews.llvm.org/D93669
Errors from MCAssembler, MCObjectStreamer and *ObjectWriter typically cause a crash:
```
% cat c.c
int bar;
extern int foo __attribute__((alias("bar")));
% clang -c -fcommon c.c
fatal error: error in backend: Common symbol 'bar' cannot be used in assignment expr
PLEASE submit a bug report to ...
Stack dump:
...
```
`LLVMTargetMachine::addPassesToEmitFile` constructs `MachineModuleInfoWrapperPass`
which creates a MCContext without SourceMgr. `MCContext::reportError` calls
`report_fatal_error` which gets captured by Clang `LLVMErrorHandler` and gets translated
to the output above.
Since `MCContext::reportError` errors indicate user errors, such a crashing style error
is inappropriate. So this patch changes `report_fatal_error` to `SourceMgr().PrintMessage`.
```
% clang -c -fcommon c.c
<unknown>:0: error: Common symbol 'bar' cannot be used in assignment expr
```
Ideally we should at least recover the original filename (the line information
is generally lost). That requires general improvement to MC diagnostics,
because currently in many cases SMLoc information is lost.
Fast register allocator skips bundled MIs, as the main assignment
loop uses MachineBasicBlock::iterator (= MachineInstrBundleIterator)
This was causing SIInsertWaitcnts to crash which expects all
instructions to have registers assigned.
This patch makes sure to set everything inside bundle to the same
assignments done on BUNDLE header.
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D90369
This PR implements the function splitBasicBlockBefore to address an
issue
that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore.
The issue occurs in SplitEdge when the Succ has a single predecessor
and the edge between the BB and Succ is not critical. This produces
the result ‘BB->Succ->New’. The new function splitBasicBlockBefore
was added to splitBlockBefore to handle the issue and now produces
the correct result ‘BB->New->Succ’.
Below is an example of splitting the block bb1 at its first instruction.
/// Original IR
bb0:
br bb1
bb1:
%0 = mul i32 1, 2
br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlock
bb0:
br bb1
bb1:
br bb1.split
bb1.split:
%0 = mul i32 1, 2
br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore
bb0:
br bb1.split
bb1.split
br bb1
bb1:
%0 = mul i32 1, 2
br bb2
bb2:
Differential Revision: https://reviews.llvm.org/D92200
This PR implements the function splitBasicBlockBefore to address an
issue
that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore.
The issue occurs in SplitEdge when the Succ has a single predecessor
and the edge between the BB and Succ is not critical. This produces
the result ‘BB->Succ->New’. The new function splitBasicBlockBefore
was added to splitBlockBefore to handle the issue and now produces
the correct result ‘BB->New->Succ’.
Below is an example of splitting the block bb1 at its first instruction.
/// Original IR
bb0:
br bb1
bb1:
%0 = mul i32 1, 2
br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlock
bb0:
br bb1
bb1:
br bb1.split
bb1.split:
%0 = mul i32 1, 2
br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore
bb0:
br bb1.split
bb1.split
br bb1
bb1:
%0 = mul i32 1, 2
br bb2
bb2:
Differential Revision: https://reviews.llvm.org/D92200
This PR implements the function splitBasicBlockBefore to address an
issue
that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore.
The issue occurs in SplitEdge when the Succ has a single predecessor
and the edge between the BB and Succ is not critical. This produces
the result ‘BB->Succ->New’. The new function splitBasicBlockBefore
was added to splitBlockBefore to handle the issue and now produces
the correct result ‘BB->New->Succ’.
Below is an example of splitting the block bb1 at its first instruction.
/// Original IR
bb0:
br bb1
bb1:
%0 = mul i32 1, 2
br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlock
bb0:
br bb1
bb1:
br bb1.split
bb1.split:
%0 = mul i32 1, 2
br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore
bb0:
br bb1.split
bb1.split
br bb1
bb1:
%0 = mul i32 1, 2
br bb2
bb2:
Differential Revision: https://reviews.llvm.org/D92200
Undef subranges are not present in the live range values, except when
they cross block boundaries. In this situation, a identity copy is
inside a loop, and one of the lanes is undefined. It only appears
alive inside the loop due to the copy. Once the copy was erased, it
would leave behind a segment inside the loop body with no
corresponding def anywhere in the program.
When RenameIndependentSubregs processed this dummy interval, it would
introduce a "Multiple connected components in live interval" verifier
error when IMPLICIT_DEFs were added to the other two blocks. I believe
there is a missing verifier check for this type of dummy interval.
I have found additional cases from the same fundamental problem in
other areas I haven't managed to fix yet (e.g. the commented out
prune_subrange_phi_value_* cases).
Summary:
If a store defines (must alias) a load, it clobbers the load.
Fixes: SWDEV-258915
Reviewers:
arsenm
Differential Revision:
https://reviews.llvm.org/D92951
- Once an instruction is simplified, foldable candidates from it should
be invalidated or skipped as the operand index is no longer valid.
Differential Revision: https://reviews.llvm.org/D93174
Both ds_read_b128 and ds_read2_b64 are valid for 128bit 16-byte aligned
loads but the one that will be selected is determined either by the order in
tablegen or by the AddedComplexity attribute. Currently ds_read_b128 has
priority.
While ds_read2_b64 has lower alignment requirements, we cannot always
restrict ds_read_b128 to 16-byte alignment because of unaligned-access-mode
option. This was causing ds_read_b128 to be selected for 8-byte aligned
loads regardles of chosen access mode.
To resolve this we use two patterns for selecting ds_read_b128. One
requires alignment of 16-byte and the other requires
unaligned-access-mode option.
Same goes for ds_write2_b64 and ds_write_b128.
Differential Revision: https://reviews.llvm.org/D92767
It is possible for copies or spills to be inserted in the middle of indirect
addressing sequences which use VGPR indexing. Spills to accvgprs could be
effected by the indexing mode.
Add new pseudo instructions that are expanded after register allocation to avoid
the problematic spill or copy placement.
Differential Revision: https://reviews.llvm.org/D91048
getMaxWavesPerEU and getVGPRAllocGranule both changed in GFX10.3 and
they both affect the occupancy calculation.
Differential Revision: https://reviews.llvm.org/D92839
* Rename some tests to try to make a convention (where all components
are optional) of:
<addrspace>_<syncscope>_<memory-orders>_<operation>
* Split up at a level of granularity appropriate for the different RUN
lines (i.e. split on addrspace so GFX6 can avoid FLAT) and that makes
running a specific test reasonable in terms of wall time taken. This
also means when run as part of the test suite the testing is not one
serial bottleneck.
* Auto-generate check lines with `update_llc_test_checks.py` to make
future maintenance more tractable.
Reviewed By: rampitec, t-tye
Differential Revision: https://reviews.llvm.org/D91545
Currently, we have some confusion in the codebase regarding the
meaning of LocationSize::unknown(): Some parts (including most of
BasicAA) assume that LocationSize::unknown() only allows accesses
after the base pointer. Some parts (various callers of AA) assume
that LocationSize::unknown() allows accesses both before and after
the base pointer (but within the underlying object).
This patch splits up LocationSize::unknown() into
LocationSize::afterPointer() and LocationSize::beforeOrAfterPointer()
to make this completely unambiguous. I tried my best to determine
which one is appropriate for all the existing uses.
The test changes in cs-cs.ll in particular illustrate a previously
clearly incorrect AA result: We were effectively assuming that
argmemonly functions were only allowed to access their arguments
after the passed pointer, but not before it. I'm pretty sure that
this was not intentional, and it's certainly not specified by
LangRef that way.
Differential Revision: https://reviews.llvm.org/D91649
Currently, `-indvars` runs first, and then immediately after `-loop-idiom` does.
I'm not really sure if `-loop-idiom` requires `-indvars` to run beforehand,
but i'm *very* sure that `-indvars` requires `-loop-idiom` to run afterwards,
as it can be seen in the phase-ordering test.
LoopIdiom runs on two types of loops: countable ones, and uncountable ones.
For uncountable ones, IndVars obviously didn't make any change to them,
since they are uncountable, so for them the order should be irrelevant.
For countable ones, well, they should have been countable before IndVars
for IndVars to make any change to them, and since SCEV is used on them,
it shouldn't matter if IndVars have already canonicalized them.
So i don't really see why we'd want the current ordering.
Should this cause issues, it will give us a reproducer test case
that shows flaws in this logic, and we then could adjust accordingly.
While this is quite likely beneficial in-the-wild already,
it's a required part for the full motivational pattern
behind `left-shift-until-bittest` loop idiom (D91038).
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91800
Add .shader_functions to pal metadata, which contains the stack frame
size for all non-entry-point functions.
Differential Revision: https://reviews.llvm.org/D90036
Also use DataLayout to get type size. Relying on the IR type size is
also pretty broken here, since this won't perfectly capture how types
are legalized.
Extract the scratch offset from the scratch buffer descriptor that is
stored in the global table.
Differential Revision: https://reviews.llvm.org/D91701
These tests implicitly depend on the target supporting generic pointers,
so to prepare for testing them on GFX6 (which lacks FLAT) remove the
dependency where possible.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D91666
We have workarounds for two different cases where vccz can get out of
sync with the value in vcc. This fixes them in two ways:
1. Fix the case where the def of vcc was in a previous basic block, by
pessimistically assuming that vccz might be incorrect at a basic block
boundary.
2. Fix the handling of pre-existing waitcnt instructions by calling
generateWaitcntInstBefore before examining ScoreBrackets to determine
whether there's an outstanding smem read operation.
Differential Revision: https://reviews.llvm.org/D91636
- In certain cases, a generic pointer could be assumed as a pointer to
the global memory space or other spaces. With a dedicated target hook
to query that address space from a given value, infer-address-space
pass could infer and propagate that to all its users.
Differential Revision: https://reviews.llvm.org/D91121
This patch adds a new pass to add !annotation metadata for entries in
@llvm.global.anotations, which is generated using
__attribute__((annotate("_name"))) on functions in Clang.
This has been discussed on llvm-dev as part of
RFC: Combining Annotation Metadata and Remarks
http://lists.llvm.org/pipermail/llvm-dev/2020-November/146393.html
Reviewed By: thegameg
Differential Revision: https://reviews.llvm.org/D91195
This patch adds a new !annotation metadata kind which can be used to
attach annotation strings to instructions.
It also adds a new pass that emits summary remarks per function with the
counts for each annotation kind.
The intended uses cases for this new metadata is annotating
'interesting' instructions and the remarks should provide additional
insight into transformations applied to a program.
To motivate this, consider these specific questions we would like to get answered:
* How many stores added for automatic variable initialization remain after optimizations? Where are they?
* How many runtime checks inserted by a frontend could be eliminated? Where are the ones that did not get eliminated?
Discussed on llvm-dev as part of 'RFC: Combining Annotation Metadata and Remarks'
(http://lists.llvm.org/pipermail/llvm-dev/2020-November/146393.html)
Reviewed By: thegameg, jdoerfert
Differential Revision: https://reviews.llvm.org/D91188
Also fix a similar issue in SIInsertWaitcnts, but I don't think that fix
has any effect in practice.
Differential Revision: https://reviews.llvm.org/D91290
We can use KnownBitsAnalysis to cover cases when mask is not trivial. It can
also help with cases when mask is not constant but can still be folded into
one. Since 'and' is comutative we should treat both operands as possible
replacements.
Differential Revision: https://reviews.llvm.org/D90674
This sequence of instructions can be simplified if they are single use and
some operands are constants. Additional combines may be applied afterwards.
Differential Revision: https://reviews.llvm.org/D90223
Sequence of same shift instructions with constant operands can be combined into
a single shift instruction.
Differential Revision: https://reviews.llvm.org/D90217
If the source of S_MOV_{B32,B64}_term is an immediate then it
cannot be lowered to a COPY.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D90451
Add a calling convention called amdgpu_gfx for real function calls
within graphics shaders. For the moment, this uses the same calling
convention as other calls in amdgpu, with registers excluded for return
address, stack pointer and stack buffer descriptor.
Differential Revision: https://reviews.llvm.org/D88540
Fix a crash when SCC is defined until end of block and mode change
must be inserted in SCC live region.
Reviewed By: mceier
Differential Revision: https://reviews.llvm.org/D90997
Results of convergent operations are implicitly affected by the
enclosing control flows and should not be hoisted out of arbitrary
loops.
Patch by Xiaoqing Wu <xiaoqing_wu@apple.com>
Differential Revision: https://reviews.llvm.org/D90361
Removed "implicit def VCC" from declarations of AMDGPU VOPC instructions since they do not implicitly write to VCC in SDWA mode.
Differential Revision: https://reviews.llvm.org/D89168
This change adds a real glc operand to the return atomic
instead of just string " glc" in the middle of the asm
string.
Improves asm parser diagnostics.
Differential Revision: https://reviews.llvm.org/D90730
Convert GISelKnownBits.computeKnownBitsImpl shift handling to use the common KnownBits implementations, which makes use of the known leading/trailing bits for shifted values in cases where we don't know the shift amount value, as detailed in https://blog.regehr.org/archives/1709
Differential Revision: https://reviews.llvm.org/D90527
Summary:
For vector element types which are not byte-sized, we would generate
incorrect scalar offsets and produce incorrect codegen.
This optimization could potentially be supported in the future, e.g. by
loading in bytes, then shifting and masking out the remaining bits of
the vector element. However, without an upstream target to test against
it's best to avoid the bad codegen in the simplest possible way.
Related to this bug:
https://bugs.llvm.org/show_bug.cgi?id=27600
Reviewed by: foad
Differential Revision: https://reviews.llvm.org/D78568
Pseudo-registers allow different register encodings
between gpu generations. Make sure we resolve the
pseudo regs to real regs whenever we get their
hardware encoding.
Using the correct encodings revealed a register
bank conflict and an unnecessary write dependency.
Tests have been updated to match.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D90721
Change-Id: I73c154cd24aecc820993b50bebaf4df97a5710ca
The insertion of waterfall loops splits the current basic block into
three blocks. So the basic block that we iterate over must be updated.
This failed assert(!NodePtr->isKnownSentinel()) in ilist_iterator for
divergent calls in branches before.
Differential Revision: https://reviews.llvm.org/D90596
Previously, the default value for ieee mode was
- on for compute kernels and compute shaders,
- off for all shaders except compute shaders.
This commit changes the default to be
- on for compute kernels,
- off for shaders.
This aligns the default value with the settings that are actually in
use. To my knowledge, all users of shader calling conventions (mesa and
llpc) disable the ieee mode by default.
Differential Revision: https://reviews.llvm.org/D89388
As discussed on D90527, we should be be trying to move shift handling functionality into KnownBits to avoid code duplication in SelectionDAG/GlobalISel/ValueTracking.
The refactor to use the KnownBits fixed/min/max constant helpers allows us to hit a couple of cases that we were missing before.
We still need the getValidMinimumShiftAmountConstant case as KnownBits doesn't handle per-element vector cases.
This differentiates the Ryzen 4000/4300/4500/4700 series APUs that were
previously included in gfx909.
Differential Revision: https://reviews.llvm.org/D90419
Change-Id: Ia901a7157eb2f73ccd9f25dbacec38427312377d
These instructions use a scaled offset. We were wrongly selecting them
even when the required offset was not a multiple of the scale factor.
Differential Revision: https://reviews.llvm.org/D90607
It should be enabled only when the load alignment is at least 8-byte.
Fixes: SWDEV-256824
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D90404
* Factor out common elements of the input YAML document and use sed to
macro replace the run line specific elements.
* Add checks for the common elements which depend on the ELF class.
* Use non-numeric suffix for temporary files to avoid merge conflicts.
* Sort tests by GFX# ascending.
* Group ELF and YAML tests by GFX#.
Reviewed By: t-tye
Differential Revision: https://reviews.llvm.org/D90245
This reverts r227987 "R600/SI: Determine target-specific encoding of READLANE and WRITELANE early v2".
All the codegen changes are caused by the post-RA scheduler no longer
treating readlane/writelane as scheduling barriers due to having
unmodelled side effects. (The pseudos are hasSideEffects = 0, but the
real instructions are hasSideEffects = ? which TableGen conservatively
treats as 1.)
Differential Revision: https://reviews.llvm.org/D90401
Reset the tracked emitted instructions when starting scheduling on a new
region.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D90347
V_DIV_SCALE_F32/F64 are VOP3B encoded so they can't use the ABS src
modifier, but they can still use NEG and the usual output modifiers.
This partially reverts 3b99f12a4e "AMDGPU: Remove modifiers from v_div_scale_*".
Differential Revision: https://reviews.llvm.org/D90296
https://reviews.llvm.org/D88060
This adds the following combines
1) build_vector formation from insert_vec_elts
2) insert_vec_elts (build_vector) -> build_vector
SIPreAllocateWWMRegs was being inserted after RegisterCoalescer
but this pass does not exist during FastAlloc so pre-allocation
pass was never being run.
Insert pre-allocation after TwoAddressInstructionPass instead.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D90236
- Add an internal option `-amdgpu-use-aa-in-codegen` to enable or
disable this feature. By Default, it's enabled.
Differential Revision: https://reviews.llvm.org/D89320
Exec mask manipulation inserted by SIWholeQuadMode barriers to
instruction scheduling. Move the entire pass after the machine
instruction scheduler and make changes so pass is correct for
non-SSA operation. These changes should leave the pass still
usable pre-scheduler, although tests have be updated to reflect
post-scheduler results.
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D88081
The support is disabled by default. So far there is instruction
selection, spilling, and frame elimination. It also changes SP
from unswizzled to swizzled as used by flat scratch instructions,
so it cannot be mixed with MUBUF stack access.
At the very least missing:
- GlobalISel;
- Some optimizations in frame elimination in between vector
and scalar ALU;
- It shall finally allow to always materialize frame index
as an SGPR, but that is not implemented and frame elimination
cannot handle it yet;
- Unaligned and/or multidword flat scratch shall work, but it
is legalized now for MUBUF;
- Operand folding cannot optimize FI like with MUBUF yet;
- It will need scaling the value of the SP/FP in the DWARF
expression to recover the unswizzled scratch address;
Differential Revision: https://reviews.llvm.org/D89170
If no pal metadata is given, default to the msgpack format instead of
the legacy metadata. This makes tests better readable.
Differential Revision: https://reviews.llvm.org/D90035
We use an absolute address for stack objects and
it would be necessary to have a constant 0 for soffset field.
Fixes: SWDEV-228562
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D89234
This commit marks i16 MULH as expand in AMDGPU backend,
which is necessary after the refactoring in D80485.
Differential Revision: https://reviews.llvm.org/D89965
If the end instruction of the scheduling region was a DBG_VALUE, the
uses of the debug instruction were tracked as if they were real
uses. This would then hit the deadDefHasNoUse assertion in
addVRegDefDeps if the only use was the debug instruction.
1. Fixed liveness issue with implicit kills.
2. Fixed potential problem with an indirect mov.
Fixes: SWDEV-256848
Differential Revision: https://reviews.llvm.org/D89599
Fixes being overly conservative with the register counts in called
functions. This should try to do a conservative range merge, but for
now just clone.
Also fix not being able to functionally run the pass standalone.
Passes that are run after the post-RA scheduler may insert instructions like
waitcnt which eliminate the need for certain noops. After this patch the
scheduler is still aware of possible latency from hazards but noops will
not be inserted until the dedicated hazard recognizer pass is run.
Depends on D89753.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D89754
If a target can encode multiple wait-states into a noop allow emitting such
instructions directly.
Reviewed By: rampitec, dmgreen
Differential Revision: https://reviews.llvm.org/D89753
Change waitcnt insertion to check the memory operand tokens to see if
flat memory operations access VMEM in the same way it does to check if
accessing LDS. This avoids adding waitcnt for counters for address
spaces that are not accessed.
In addition, only generate the pessimistic waitcnt 0 if a flat memory
operation appears to access both VMEM and LDS.
This benefits flat memory operations that explicitly specify the
address space as GLOBAL or LOCAL.
Differential Revision: https://reviews.llvm.org/D89618
- In general, a generic point may alias to pointers in all other address
spaces. However, for certain cases enforced by the programming model,
we may found a generic point won't alias to pointers to local objects.
* When a generic pointer is loaded from the constant address space, it
could only be a pointer to the GLOBAL or CONSTANT address space.
Thus, it won't alias to pointers to the PRIVATE or LOCAL address
space.
* When a generic pointer is passed as a kernel argument, it also could
only be a pointer to the GLOBAL or CONSTANT address space. Thus, it
also won't alias to pointers to the PRIVATE or LOCAL address space.
Differential Revision: https://reviews.llvm.org/D89525
Remove immediate operand from SI_ELSE which indicates if EXEC has
been modified. Instead always emit code that handles EXEC and
remove unnecessary instructions during pre-RA optimisation.
This facilitates passes (i.e. SIWholeQuadMode) adding exec mask
manipulation post control flow lowering, and pre control flow
lower passes do not need to be aware of SI_ELSE handling.
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D89644
D70365 allows us to make attributes default. This is a follow up to
actually make nosync, nofree and willreturn default. The approach we
chose, for now, is to opt-in to default attributes to avoid introducing
problems to target specific intrinsics. Intrinsics with default
attributes can be created using `DefaultAttrsIntrinsic` class.
S_CMP_LG_U64 was added in gfx8 and is guarded by hasScalarCompareEq64().
Rewrite S_CMP_LG_U64 to S_OR_B32 + S_CMP_LG_U32 for targets that
do not support 64-bit scalar compare.
Differential Revision: https://reviews.llvm.org/D89536
This broke Chromium's PGO build, it seems because hot-cold-splitting got turned
on unintentionally. See comment on the code review for repro etc.
> This patch adds -f[no-]split-cold-code CC1 options to clang. This allows
> the splitting pass to be toggled on/off. The current method of passing
> `-mllvm -hot-cold-split=true` to clang isn't ideal as it may not compose
> correctly (say, with `-O0` or `-Oz`).
>
> To implement the -fsplit-cold-code option, an attribute is applied to
> functions to indicate that they may be considered for splitting. This
> removes some complexity from the old/new PM pipeline builders, and
> behaves as expected when LTO is enabled.
>
> Co-authored by: Saleem Abdulrasool <compnerd@compnerd.org>
> Differential Revision: https://reviews.llvm.org/D57265
> Reviewed By: Aditya Kumar, Vedant Kumar
> Reviewers: Teresa Johnson, Aditya Kumar, Fedor Sergeev, Philip Pfaffe, Vedant Kumar
This reverts commit 273c299d5d.
If instructions were removed in peephole passes after the hazard recognizer was
run it is possible that new hazards could be introduced.
Fixes: SWDEV-253090
Reviewed By: rampitec, arsenm
Differential Revision: https://reviews.llvm.org/D89077
This would end up killing part of the result super-register, resulting
in a verifier error on a later use of the overlapping registers. We
could add kills of any non-aliasing registers, but we should be moving
away from relying on kill flags.
After investigation by @asbirlea, the issue that caused the
revert appears to be an issue in the original source, rather
than a problem with the compiler.
This patch enables MemorySSA DSE again.
This reverts commit 915310bf14.
This patch adds -f[no-]split-cold-code CC1 options to clang. This allows
the splitting pass to be toggled on/off. The current method of passing
`-mllvm -hot-cold-split=true` to clang isn't ideal as it may not compose
correctly (say, with `-O0` or `-Oz`).
To implement the -fsplit-cold-code option, an attribute is applied to
functions to indicate that they may be considered for splitting. This
removes some complexity from the old/new PM pipeline builders, and
behaves as expected when LTO is enabled.
Co-authored by: Saleem Abdulrasool <compnerd@compnerd.org>
Differential Revision: https://reviews.llvm.org/D57265
Reviewed By: Aditya Kumar, Vedant Kumar
Reviewers: Teresa Johnson, Aditya Kumar, Fedor Sergeev, Philip Pfaffe, Vedant Kumar
removeMBBifRedundant normally tries to keep predecessors fallthrough when removing redundant MBB.
It has to change MBBs layout to keep the new successor to immediately follow the predecessor of removed MBB.
It only may be allowed in case the new successor itself has no successors to which it fall through.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D89397
This does unfortunately end up with extra waitcnts getting inserted
that were avoided before. Ideally we would avoid the spills of these
undef components in the first place.
Generate the minimal set of s_mov instructions required when
expanding a SGPR copy operation in copyPhysReg.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D89187
Implement computeKnownBitsForTargetInstr for G_AMDGPU_BUFFER_LOAD_UBYTE
and G_AMDGPU_BUFFER_LOAD_USHORT. This allows generic combines to remove
some unnecessary G_ANDs.
Differential Revision: https://reviews.llvm.org/D89316
When the first operand is a null pointer we can avoid making a G_PTR_ADD and
make a G_INTTOPTR with the offset operand.
This helps us avoid making add with 0 later on for targets such as AMDGPU.
Differential Revision: https://reviews.llvm.org/D87140
This can fix an asan failure like below.
==15856==ERROR: AddressSanitizer: use-after-poison on address ...
READ of size 8 at 0x6210001a3cb0 thread T0
#0 llvm::MachineInstr::getParent()
#1 llvm::LiveVariables::VarInfo::findKill()
#2 TwoAddressInstructionPass::rescheduleMIBelowKill()
#3 TwoAddressInstructionPass::tryInstructionTransform()
#4 TwoAddressInstructionPass::runOnMachineFunction()
We need to update the Kills if we replace instructions. The Kills
may be later accessed within TwoAddressInstruction pass.
Differential Revision: https://reviews.llvm.org/D89092
Extend loadSRsrcFromVGPR to allow moving a range of instructions into
the loop. The call instruction is surrounded by copies into physical
registers which should be part of the waterfall loop.
Differential Revision: https://reviews.llvm.org/D88291
At AMD, in an internal audit of our code, we found some corner cases
where we were not quite differentiating targets enough for some old
hardware. This commit is part of fixing that by adding three new
targets:
* The "Oland" and "Hainan" variants of gfx601 are now split out into
gfx602. LLPC (in the GPUOpen driver) and other front-ends could use
that to avoid using the shaderZExport workaround on gfx602.
* One variant of gfx703 is now split out into gfx705. LLPC and other
front-ends could use that to avoid using the
shaderSpiCsRegAllocFragmentation workaround on gfx705.
* The "TongaPro" variant of gfx802 is now split out into gfx805.
TongaPro has a faster 64-bit shift than its former friends in gfx802,
and a subtarget feature could be set up for that to take advantage of
it. This commit does not make that change; it just adds the target.
V2: Add clang changes. Put TargetParser list in order.
V3: AMDGCNGPUs table in TargetParser.cpp needs to be in GPUKind order,
so fix the GPUKind order.
Differential Revision: https://reviews.llvm.org/D88916
Change-Id: Ia901a7157eb2f73ccd9f25dbacec38427312377d
Following on from D88890, this makes the newly added patterns
conditional on NoFP32Denormals. mad/mac f32 instructions always flush
denormals regardless of the MODE register setting, and I believe the
legacy variants do the same.
Differential Revision: https://reviews.llvm.org/D89123
Note that all subtargets up to GFX10.1 have v_mad_legacy_f32, but GFX8/9
lack v_mac_legacy_f32. GFX10.3 has no mad/mac f32 instructions at all.
Differential Revision: https://reviews.llvm.org/D88890
ExpandUnalignedLoad/Store can sometimes produce unnecessary copies to
temporary stack slot. We should prefer splitting vectors if possible.
Differential Revision: https://reviews.llvm.org/D88882
Summary:
This implements a workaround for a hardware bug in gfx8 and gfx9,
where register usage is not estimated correctly for image_store and
image_gather4 instructions when D16 is used.
Change-Id: I4e30744da6796acac53a9b5ad37ac1c2035c8899
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81172
When unbundling COPY bundles in VirtRegRewriter the start of the
bundle is not correctly referenced in the unbundling loop.
The effect of this is that unbundled instructions are sometimes
inserted out-of-order, particular in cases where multiple
reordering have been applied to avoid clobbering dependencies.
The resulting instruction sequence clobbers dependencies.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D88821
If a CSEMIRBuilder query hits the instruction at the current insert point,
move insert point ahead one so that subsequent uses of the builder don't end up with
uses before defs.
This fix also shows that AMDGPU was also affected by this bug often, but got away
with it because it was using a G_IMPLICIT_DEF before the use.
Differential Revision: https://reviews.llvm.org/D88605
This tends to increase code size but more importantly it reduces vgpr
usage, and could avoid costly readfirstlanes if the result needs to be
in an sgpr.
Differential Revision: https://reviews.llvm.org/D88580