The default output format of the update_mir_test_checks.py script has
changed since some of these tests were generated.
Also, an upcoming commit will introduce differences between GFX9 and
GFX10 in the legalization of G_MUL.
MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
can be accessed. A load from LDS to VMEM does not need a wait.
Differential Revision: https://reviews.llvm.org/D124626
As older waves execute long sequences of VALU instructions, this may
prevent younger waves from address calculation and then issuing their
VMEM loads, which in turn leads the VALU unit to idle. This patch tries
to prevent this by temporarily raising the wave's priority.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D124246
The vector register spills have no dependency with
SILowerSGPRSpills pass anymore. The entire handling
has been moved to PrologEpilogInserter with D55301.
Last chance recoloring didn't try recoloring a done register with the
same class since it believed there was no point. This doesn't
necessarily apply if the members in that class overlap. Allow the
recoloring to proceed if the assigned interfering physical register
overlaps with the candidate register.
This avoids an allocation failure with overlapping tuples. This
testcase could be handled better, and I don't believe should reach
last chance recoloring. The failure only manifests with the mutually
unsatisfiable register hints to overlapping tuples. The earlier
assignment decisions probably should have figured out that using these
hints was a bad idea.
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.
Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.
Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
The most common situation where G_ASSERT_ZEXT appears for AMDGPU is a
copy from a physical register, which happens to use set the actual
register class on the virtual register. After copy coalescing, the
assert's source operand had a vreg with a set class. The verifier was
strictly rejecting cases where the set class/bank weren't an exact
match. Additionally, RegBankSelect was also expecting a register bank
to be set on the register, not a class.
This is much stricter than regular copies so relax this behavior. This
now allows these 2 cases:
1. Source register has either class or bank, and the result does not
2. Source register has a register class, and the result is a register
with a matching bank.
This should avoid needing some kind of special handling to avoid
violating this constraint when folding copies.
This is to avoid relying on the post-isel hook.
This change also enable the saddr pattern selection for atomic
intrinsics in GlobalISel.
Differential Revision: https://reviews.llvm.org/D123583
This wasn't accounting for the canonicalize of the input, or checking
the output fneg isn't folded as intended. Avoids test failure in
unrelated patch which happens to change register numberings.
Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-
ds_write_b64 aligned by 8: 3.2 sec
ds_write2_b32 aligned by 8: 3.2 sec
ds_write_b16 * 4 aligned by 8: 7.0 sec
ds_write_b8 * 8 aligned by 8: 13.2 sec
ds_write_b64 aligned by 1: 7.3 sec
ds_write2_b32 aligned by 1: 7.5 sec
ds_write_b16 * 4 aligned by 1: 14.0 sec
ds_write_b8 * 8 aligned by 1: 13.2 sec
ds_write_b64 aligned by 2: 7.3 sec
ds_write2_b32 aligned by 2: 7.5 sec
ds_write_b16 * 4 aligned by 2: 7.1 sec
ds_write_b8 * 8 aligned by 2: 13.3 sec
ds_write_b64 aligned by 4: 4.6 sec
ds_write2_b32 aligned by 4: 3.2 sec
ds_write_b16 * 4 aligned by 4: 7.1 sec
ds_write_b8 * 8 aligned by 4: 13.3 sec
ds_read_b64 aligned by 8: 2.3 sec
ds_read2_b32 aligned by 8: 2.2 sec
ds_read_u16 * 4 aligned by 8: 4.8 sec
ds_read_u8 * 8 aligned by 8: 8.6 sec
ds_read_b64 aligned by 1: 4.4 sec
ds_read2_b32 aligned by 1: 7.3 sec
ds_read_u16 * 4 aligned by 1: 14.0 sec
ds_read_u8 * 8 aligned by 1: 8.7 sec
ds_read_b64 aligned by 2: 4.4 sec
ds_read2_b32 aligned by 2: 7.3 sec
ds_read_u16 * 4 aligned by 2: 4.8 sec
ds_read_u8 * 8 aligned by 2: 8.7 sec
ds_read_b64 aligned by 4: 4.4 sec
ds_read2_b32 aligned by 4: 2.3 sec
ds_read_u16 * 4 aligned by 4: 4.8 sec
ds_read_u8 * 8 aligned by 4: 8.7 sec
Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030
ds_write_b64 aligned by 8: 4.4 sec
ds_write2_b32 aligned by 8: 4.3 sec
ds_write_b16 * 4 aligned by 8: 7.9 sec
ds_write_b8 * 8 aligned by 8: 13.0 sec
ds_write_b64 aligned by 1: 23.2 sec
ds_write2_b32 aligned by 1: 23.1 sec
ds_write_b16 * 4 aligned by 1: 44.0 sec
ds_write_b8 * 8 aligned by 1: 13.0 sec
ds_write_b64 aligned by 2: 23.2 sec
ds_write2_b32 aligned by 2: 23.1 sec
ds_write_b16 * 4 aligned by 2: 7.9 sec
ds_write_b8 * 8 aligned by 2: 13.1 sec
ds_write_b64 aligned by 4: 13.5 sec
ds_write2_b32 aligned by 4: 4.3 sec
ds_write_b16 * 4 aligned by 4: 7.9 sec
ds_write_b8 * 8 aligned by 4: 13.1 sec
ds_read_b64 aligned by 8: 3.5 sec
ds_read2_b32 aligned by 8: 3.4 sec
ds_read_u16 * 4 aligned by 8: 5.3 sec
ds_read_u8 * 8 aligned by 8: 8.5 sec
ds_read_b64 aligned by 1: 13.1 sec
ds_read2_b32 aligned by 1: 22.7 sec
ds_read_u16 * 4 aligned by 1: 43.9 sec
ds_read_u8 * 8 aligned by 1: 7.9 sec
ds_read_b64 aligned by 2: 13.1 sec
ds_read2_b32 aligned by 2: 22.7 sec
ds_read_u16 * 4 aligned by 2: 5.6 sec
ds_read_u8 * 8 aligned by 2: 7.9 sec
ds_read_b64 aligned by 4: 13.1 sec
ds_read2_b32 aligned by 4: 3.4 sec
ds_read_u16 * 4 aligned by 4: 5.6 sec
ds_read_u8 * 8 aligned by 4: 7.9 sec
```
GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.
Differential Revision: https://reviews.llvm.org/D123956
Fix isVCC for register that was assigned register class during
inst-selection. This happens when register has multiple uses.
For wave32, uniform i1 to vcc copy was selected like vcc to vcc
copy when uniform i1 had assigned register class.
Uniform i1 register with assigned register class will have s1 LLT,
be defined using G_TRUNC and class will be SReg_32RegClass.
Vcc i1 register with assigned register class will have s1 LLT,
class will be SReg_32RegClass for wave32 and SReg_64RegClass for
wave64 and register will not be defined by G_TRUNC.
Differential Revision: https://reviews.llvm.org/D124163
This fixes the assertion failure "Loop in the Block Graph!".
SIMachineScheduler groups instructions into blocks (also referred to
as coloring or groups) and then performs a two-level scheduling:
inter-block scheduling, and intra-block scheduling.
This approach requires that the dependency graph on the blocks which
is obtained by contracting the blocks in the original dependency graph
is acyclic. In other words: Whenever A and B end up in the same block,
all vertices on a path from A to B must be in the same block.
When compiling an example consisting of an export followed by
a buffer store, we see a dependency between these two. This dependency
may be false, but that is a different issue.
This dependency was not correctly accounted for by SiMachineScheduler.
A new test case si-scheduler-exports.ll demonstrating this is
also added in this commit.
The problematic part of SiMachineScheduler was a post-optimization of
the block assignment that tried to group all export instructions into
a separate export block for better execution performance. This routine
correctly checked that any paths from exports to exports did not
contain any non-exports, but not vice-versa: In case of an export with
a non-export successor dependency, that single export was moved
to a separate block, which could then be both a successor and a
predecessor block of a non-export block.
As fix, we now skip export grouping if there are exports with direct
non-export successor dependencies. This fixes the issue at hand,
but is slightly pessimistic:
We *could* group all exports into a separate block that have neither
direct nor indirect export successor dependencies.
We will review the potential performance impact and potentially
revisit with a more sophisticated implementation.
Note that just grouping all exports without direct non-export successor
dependencies could still lead to illegal blocks, since non-export A
could depend on export B that depends on export C. In that case,
export C has no non-export successor, but still may not be grouped
into an export block.
Based on available register budget, reserve highest available VGPR for
AGPR copy before RA. After RA, shift it to lowest unused VGPR if the one
exist.
Fixes SWDEV-330006.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D123525
[1]. Move the test which reject the usage of agpr before gfx908 into a separate file - reject-agpr-usage-before-gfx908.ll.
[2]. Move those tests which are applicable to both gfx900 and gfx908 into a separate file - spill-vgpr.ll.
[3]. Keep those tests which are specific to only gfx908 in the file spill-vgpr-to-agpr.ll.
Above split is required to properly update the tests in D123525.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D123973
When folding a COPY of exec into another COPY, the call to
TII->isOperandLegal would crash because COPYs don't have defined
register classes for their operands.
Differential Revision: https://reviews.llvm.org/D122737
Note that GlobalISel does not actually use the SVS addressing mode
for these cases yet because it chooses the VGPR bank for
G_FRAME_INDEX; see the TODO comment in
AMDGPURegisterBankInfo::getInstrMapping.
These don't seem to be very well used or tested, but try to make the
behavior a bit more consistent with LDS globals.
I'm not sure what the definition for amdgpu-gds-size is supposed to
mean. For now I assumed it's allocating a static size at the beginning
of the allocation, and any known globals are allocated after it.
There's no reason to create these immediately. They can be created in
the prolog/epilog code like CSR spills. There's probably a cleaner way
to do this by utilizing the CSR spill code.
This makes the frame index used transient state for
PrologEpilogInserter, and thus makes serialization easier. Really this
doesn't need to be saved here but there isn't really a better place
for it.
The sramecc feature changes the behaviour of d16 loads so they do not
preserve the unused 16 bits of the result register, but it has no impact
on d16 stores, so we should make use of them even when the feature is
enabled.
Differential Revision: https://reviews.llvm.org/D104912
Certain applications crashed for us with the AMDGPU backend. While this
is not a proper fix it allows us to compile the code for now. I left a
TODO for someone that understands DWARF.
Differential Revision: https://reviews.llvm.org/D123717
Legacy PM for optimization pipeline was deprecated in 13.0.0 and Clang dropped
legacy PM support in D123609. This change removes legacy PM passes for PGO so
that downstream projects won't be able to use it. It seems appropriate to start
removing such "add-on" features like instrumentations, before we remove more
stuff after 15.x is branched.
I have checked many LLVM users and only ldc[1] uses the legacy PGO pass.
[1]: https://github.com/ldc-developers/ldc/issues/3961
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D123834
The NewDefault was used to simplify the updating of PHI nodes, but it
causes some inefficiency for target that will run structurizer later. For
example, for a simple two-case switch, the extra NewDefault is causing
unstructured CFG like:
O
/ \
O O
/ \ / \
C1 ND C2
\ | /
\ | /
D
The change is to avoid the ND(NewDefault) block, that is we will get a
structured CFG for above example like:
O
/ \
/ \
O O
/ \ / \
C1 \ / C2
\-> D <-/
The IR change introduced by this patch should be trivial to other targets,
so I am doing this unconditionally.
Fall-through among the cases will also cause unstructured CFG, but it need
more work and will be addressed in a separate change.
Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D123607
This is a preliminary change to update the test so that it does not
depend on how switch-case will be lowered. The following change will
lower switch-case more optimally that will make the test no longer
valid.
Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D123606
This testcase fails register allocation, but at the failure point
there were also new split virtual registers. Previously this was
assigning the failing register and not enqueueing the newly created
split virtual registers. These would then never be allocated and
assert in VirtRegRewriter.
This code was pattern matching the ID computation expression as it
appears in the library. This was a compare and select, but now that
umin is canonical, we were no longer matching. Update to match the
intrinsic instead.
The condition in canEvictInterferenceBasedOnCost is slightly different
from the assertion in evictInteference.
canEvictInterferenceBasedOnCost uses a <= check for the cascade number
for legality, but the assert was checking for <. For equal cascade
numbers for an urgent eviction, canEvictInterferenceBasedOnCost could
return success. The actual eviction would then hit this assert. Avoid
ever returning true for equivalent cascade numbers.
The resulting failed allocation seems a bit off to me. e.g. in
illegal-eviction-assert.mir, I wuold assume %0 gets allocated starting
at $vgpr0. That was its initial allocation choice, but was later
evicted. In this example no evictions can help improve anything.