Commit Graph

5423 Commits

Author SHA1 Message Date
Matt Arsenault 574ca03ef3 AMDGPU: Remove some invalid kill flags in tests
These killed registers need to be live out of the block but the
verifier wasn't catching it.
2022-05-04 00:05:15 +01:00
Matt Arsenault 5dfe4b7cf2 AMDGPU: Regenerate test checks 2022-05-04 00:05:15 +01:00
Nicolai Hähnle cdc5b64ed6 AMDGPU/GISel: Update some MIR tests to reduce future churn
The default output format of the update_mir_test_checks.py script has
changed since some of these tests were generated.

Also, an upcoming commit will introduce differences between GFX9 and
GFX10 in the legalization of G_MUL.
2022-05-03 07:08:56 -05:00
hsmahesha 589b9df4e1 [AMDGPU] Fix scalar_to_vector for v8i16/v8f16
so that the stack access is avoided.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124734
2022-05-03 07:28:15 +05:30
Stanislav Mekhanoshin 51e02409f0 [AMDGPU] Produce waitcounts for LDS DMA
MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
can be accessed. A load from LDS to VMEM does not need a wait.

Differential Revision: https://reviews.llvm.org/D124626
2022-04-29 11:14:11 -07:00
Jay Foad 5fa169335f [AMDGPU] Simplify the test case for D124450 2022-04-29 12:12:40 +01:00
Ivan Kosarev 6ddf2a824d [AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling.
As older waves execute long sequences of VALU instructions, this may
prevent younger waves from address calculation and then issuing their
VMEM loads, which in turn leads the VALU unit to idle. This patch tries
to prevent this by temporarily raising the wave's priority.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D124246
2022-04-27 14:37:18 +01:00
Jay Foad 6753bb2c41 [AMDGPU] Precommit a test case for D124450 2022-04-26 18:44:50 +01:00
Christudasan Devadasan 8f9dd5e608 [AMDGPU] Vector register spill test cleanup (NFC)
The vector register spills have no dependency with
SILowerSGPRSpills pass anymore. The entire handling
has been moved to PrologEpilogInserter with D55301.
2022-04-26 13:17:16 +05:30
Matt Arsenault 7714e03175 RegAllocGreedy: Allow last chance recolor to retry overlapping tuples
Last chance recoloring didn't try recoloring a done register with the
same class since it believed there was no point. This doesn't
necessarily apply if the members in that class overlap. Allow the
recoloring to proceed if the assigned interfering physical register
overlaps with the candidate register.

This avoids an allocation failure with overlapping tuples. This
testcase could be handled better, and I don't believe should reach
last chance recoloring. The failure only manifests with the mutually
unsatisfiable register hints to overlapping tuples. The earlier
assignment decisions probably should have figured out that using these
hints was a bad idea.
2022-04-25 17:07:17 -04:00
Arthur Eubanks 6f73bd7813 [test] Remove legacy PM pipeline test
The legacy PM for the optimization pipeline is deprecated and in the process of being cleaned up.
2022-04-25 10:02:38 -07:00
Christudasan Devadasan 16d87efc2a [AMDGPU] Lit test pre-commit changes (NFC)
Run line change needed for an upcoming patch.
2022-04-25 21:22:33 +05:30
Christudasan Devadasan 9f631cf7c6 [AMDGPU] Regenerate lit test pattern (NFC). 2022-04-25 21:12:33 +05:30
Matt Arsenault 794a0bb547 AMDGPU: Directly implement computeKnownBits for workitem intrinsics
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.

Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.

Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
2022-04-22 10:49:50 -04:00
Matt Arsenault 40bc9112c0 GlobalISel: Relax handling of G_ASSERT_* with source register classes
The most common situation where G_ASSERT_ZEXT appears for AMDGPU is a
copy from a physical register, which happens to use set the actual
register class on the virtual register. After copy coalescing, the
assert's source operand had a vreg with a set class. The verifier was
strictly rejecting cases where the set class/bank weren't an exact
match. Additionally, RegBankSelect was also expecting a register bank
to be set on the register, not a class.

This is much stricter than regular copies so relax this behavior. This
now allows these 2 cases:

1. Source register has either class or bank, and the result does not
2. Source register has a register class, and the result is a register
with a matching bank.

This should avoid needing some kind of special handling to avoid
violating this constraint when folding copies.
2022-04-22 10:49:50 -04:00
Abinav Puthan Purayil 45ca94334e [AMDGPU] Select no-return atomic intrinsics in tblgen
This is to avoid relying on the post-isel hook.

This change also enable the saddr pattern selection for atomic
intrinsics in GlobalISel.

Differential Revision: https://reviews.llvm.org/D123583
2022-04-22 09:37:40 +05:30
Matt Arsenault 667899a154 AMDGPU: Fix fneg combine test not checking full result
This wasn't accounting for the canonicalize of the input, or checking
the output fneg isn't folded as intended. Avoids test failure in
unrelated patch which happens to change register numberings.
2022-04-21 20:58:27 -04:00
Stanislav Mekhanoshin ac94073daa [AMDGPU] Refine 64 bit misaligned LDS ops selection
Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64                       aligned by  8:  3.2 sec
ds_write2_b32                      aligned by  8:  3.2 sec
ds_write_b16 * 4                   aligned by  8:  7.0 sec
ds_write_b8 * 8                    aligned by  8: 13.2 sec
ds_write_b64                       aligned by  1:  7.3 sec
ds_write2_b32                      aligned by  1:  7.5 sec
ds_write_b16 * 4                   aligned by  1: 14.0 sec
ds_write_b8 * 8                    aligned by  1: 13.2 sec
ds_write_b64                       aligned by  2:  7.3 sec
ds_write2_b32                      aligned by  2:  7.5 sec
ds_write_b16 * 4                   aligned by  2:  7.1 sec
ds_write_b8 * 8                    aligned by  2: 13.3 sec
ds_write_b64                       aligned by  4:  4.6 sec
ds_write2_b32                      aligned by  4:  3.2 sec
ds_write_b16 * 4                   aligned by  4:  7.1 sec
ds_write_b8 * 8                    aligned by  4: 13.3 sec
ds_read_b64                        aligned by  8:  2.3 sec
ds_read2_b32                       aligned by  8:  2.2 sec
ds_read_u16 * 4                    aligned by  8:  4.8 sec
ds_read_u8 * 8                     aligned by  8:  8.6 sec
ds_read_b64                        aligned by  1:  4.4 sec
ds_read2_b32                       aligned by  1:  7.3 sec
ds_read_u16 * 4                    aligned by  1: 14.0 sec
ds_read_u8 * 8                     aligned by  1:  8.7 sec
ds_read_b64                        aligned by  2:  4.4 sec
ds_read2_b32                       aligned by  2:  7.3 sec
ds_read_u16 * 4                    aligned by  2:  4.8 sec
ds_read_u8 * 8                     aligned by  2:  8.7 sec
ds_read_b64                        aligned by  4:  4.4 sec
ds_read2_b32                       aligned by  4:  2.3 sec
ds_read_u16 * 4                    aligned by  4:  4.8 sec
ds_read_u8 * 8                     aligned by  4:  8.7 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b64                       aligned by  8:  4.4 sec
ds_write2_b32                      aligned by  8:  4.3 sec
ds_write_b16 * 4                   aligned by  8:  7.9 sec
ds_write_b8 * 8                    aligned by  8: 13.0 sec
ds_write_b64                       aligned by  1: 23.2 sec
ds_write2_b32                      aligned by  1: 23.1 sec
ds_write_b16 * 4                   aligned by  1: 44.0 sec
ds_write_b8 * 8                    aligned by  1: 13.0 sec
ds_write_b64                       aligned by  2: 23.2 sec
ds_write2_b32                      aligned by  2: 23.1 sec
ds_write_b16 * 4                   aligned by  2:  7.9 sec
ds_write_b8 * 8                    aligned by  2: 13.1 sec
ds_write_b64                       aligned by  4: 13.5 sec
ds_write2_b32                      aligned by  4:  4.3 sec
ds_write_b16 * 4                   aligned by  4:  7.9 sec
ds_write_b8 * 8                    aligned by  4: 13.1 sec
ds_read_b64                        aligned by  8:  3.5 sec
ds_read2_b32                       aligned by  8:  3.4 sec
ds_read_u16 * 4                    aligned by  8:  5.3 sec
ds_read_u8 * 8                     aligned by  8:  8.5 sec
ds_read_b64                        aligned by  1: 13.1 sec
ds_read2_b32                       aligned by  1: 22.7 sec
ds_read_u16 * 4                    aligned by  1: 43.9 sec
ds_read_u8 * 8                     aligned by  1:  7.9 sec
ds_read_b64                        aligned by  2: 13.1 sec
ds_read2_b32                       aligned by  2: 22.7 sec
ds_read_u16 * 4                    aligned by  2:  5.6 sec
ds_read_u8 * 8                     aligned by  2:  7.9 sec
ds_read_b64                        aligned by  4: 13.1 sec
ds_read2_b32                       aligned by  4:  3.4 sec
ds_read_u16 * 4                    aligned by  4:  5.6 sec
ds_read_u8 * 8                     aligned by  4:  7.9 sec
```

GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.

Differential Revision: https://reviews.llvm.org/D123956
2022-04-21 09:37:16 -07:00
Petar Avramovic e06290e53f AMDGPU/GlobalISel: Fix isVCC for uniform s1 with reg class on wave32
Fix isVCC for register that was assigned register class during
inst-selection. This happens when register has multiple uses.
For wave32, uniform i1 to vcc copy was selected like vcc to vcc
copy when uniform i1 had assigned register class.
Uniform i1 register with assigned register class will have s1 LLT,
be defined using G_TRUNC and class will be SReg_32RegClass.
Vcc i1 register with assigned register class will have s1 LLT,
class will be SReg_32RegClass for wave32 and SReg_64RegClass for
wave64 and register will not be defined by G_TRUNC.

Differential Revision: https://reviews.llvm.org/D124163
2022-04-21 16:12:04 +02:00
Petar Avramovic 4e0dacb2cf AMDGPU/GlobalISel: Precommit test for D124163 2022-04-21 16:12:03 +02:00
Jannik Silvanus 607f8ced39 [AMDGPU]: Fix failing assertion in SIMachineScheduler
This fixes the assertion failure "Loop in the Block Graph!".

SIMachineScheduler groups instructions into blocks (also referred to
as coloring or groups) and then performs a two-level scheduling:
inter-block scheduling, and intra-block scheduling.

This approach requires that the dependency graph on the blocks which
is obtained by contracting the blocks in the original dependency graph
is acyclic. In other words: Whenever A and B end up in the same block,
all vertices on a path from A to B must be in the same block.

When compiling an example consisting of an export followed by
a buffer store, we see a dependency between these two. This dependency
may be false, but that is a different issue.
This dependency was not correctly accounted for by SiMachineScheduler.

A new test case si-scheduler-exports.ll demonstrating this is
also added in this commit.

The problematic part of SiMachineScheduler was a post-optimization of
the block assignment that tried to group all export instructions into
a separate export block for better execution performance. This routine
correctly checked that any paths from exports to exports did not
contain any non-exports, but not vice-versa: In case of an export with
a non-export successor dependency, that single export was moved
to a separate block, which could then be both a successor and a
predecessor block of a non-export block.

As fix, we now skip export grouping if there are exports with direct
non-export successor dependencies. This fixes the issue at hand,
but is slightly pessimistic:
We *could* group all exports into a separate block that have neither
direct nor indirect export successor dependencies.
We will review the potential performance impact and potentially
revisit with a more sophisticated implementation.

Note that just grouping all exports without direct non-export successor
dependencies could still lead to illegal blocks, since non-export A
could depend on export B that depends on export C. In that case,
export C has no non-export successor, but still may not be grouped
into an export block.
2022-04-21 14:52:29 +01:00
hsmahesha 5bd87350a5 [AMDGPU] On gfx908, reserve VGPR for AGPR copy based on register budget.
Based on available register budget, reserve highest available VGPR for
AGPR copy before RA. After RA, shift it to lowest unused VGPR if the one
exist.

Fixes SWDEV-330006.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D123525
2022-04-21 07:57:26 +05:30
hsmahesha 7895c87367 [AMDGPU] Split the lit test spill-vgpr-to-agpr.ll to different tests
[1]. Move the test which reject the usage of agpr before gfx908 into a separate file - reject-agpr-usage-before-gfx908.ll.
[2]. Move those tests which are applicable to both gfx900 and gfx908 into a separate file - spill-vgpr.ll.
[3]. Keep those tests which are specific to only gfx908 in the file spill-vgpr-to-agpr.ll.

Above split is required to properly update the tests in D123525.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D123973
2022-04-21 07:16:58 +05:30
Christudasan Devadasan 0d4a49b0f1 [AMDGPU] Regenerate lit test pattern (NFC). 2022-04-20 23:06:35 +05:30
Jay Foad 879ac41089 [AMDGPU] Fix crash in SIOptimizeExecMaskingPreRA
When folding a COPY of exec into another COPY, the call to
TII->isOperandLegal would crash because COPYs don't have defined
register classes for their operands.

Differential Revision: https://reviews.llvm.org/D122737
2022-04-20 14:42:48 +01:00
Jay Foad e13d2efed6 [AMDGPU] Add GlobalISel checks for flat scratch SVS addressing
Note that GlobalISel does not actually use the SVS addressing mode
for these cases yet because it chooses the VGPR bank for
G_FRAME_INDEX; see the TODO comment in
AMDGPURegisterBankInfo::getInstrMapping.
2022-04-20 12:06:39 +01:00
Matt Arsenault b5ec131267 AMDGPU: Fix allocating GDS globals to LDS offsets
These don't seem to be very well used or tested, but try to make the
behavior a bit more consistent with LDS globals.

I'm not sure what the definition for amdgpu-gds-size is supposed to
mean. For now I assumed it's allocating a static size at the beginning
of the allocation, and any known globals are allocated after it.
2022-04-19 22:14:48 -04:00
Matt Arsenault e0d585d75a AMDGPU: Defer creation of WWM VGPR spill slots
There's no reason to create these immediately. They can be created in
the prolog/epilog code like CSR spills. There's probably a cleaner way
to do this by utilizing the CSR spill code.

This makes the frame index used transient state for
PrologEpilogInserter, and thus makes serialization easier. Really this
doesn't need to be saved here but there isn't really a better place
for it.
2022-04-19 21:07:13 -04:00
Nicolai Hähnle b39d34de5e AMDGPU: More mad_64_32 test cases for multiple uses
Also use gfx90a for the gfx9 test, whose code gen should be affected by
faster multiply-add instructions.
2022-04-19 18:00:05 -05:00
Jay Foad f707e1255e [AMDGPU] Select d16 stores even when sramecc is enabled
The sramecc feature changes the behaviour of d16 loads so they do not
preserve the unused 16 bits of the result register, but it has no impact
on d16 stores, so we should make use of them even when the feature is
enabled.

Differential Revision: https://reviews.llvm.org/D104912
2022-04-19 09:34:32 +01:00
Austin Kerbow 7f97ac94f7 Revert "[AMDGPU] Omit unnecessary waitcnt before barriers"
This reverts commit 8d0c34fd4f.
2022-04-18 21:24:08 -07:00
hsmahesha 66c1fc19d6 [AMDGPU] Pre-checkin updated lit tests for D123525.
Fix indentation within the lit test - agpr-copy-no-free-registers.ll.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D123809
2022-04-17 10:45:47 +05:30
Johannes Doerfert 3f7a6ce0de [DWARF][FIX] Handle the use of multiple registers gracefully
Certain applications crashed for us with the AMDGPU backend. While this
is not a proper fix it allows us to compile the code for now. I left a
TODO for someone that understands DWARF.

Differential Revision: https://reviews.llvm.org/D123717
2022-04-15 13:43:50 -05:00
Fangrui Song 04e094a336 [PGO] Remove legacy PM passes
Legacy PM for optimization pipeline was deprecated in 13.0.0 and Clang dropped
legacy PM support in D123609. This change removes legacy PM passes for PGO so
that downstream projects won't be able to use it. It seems appropriate to start
removing such "add-on" features like instrumentations, before we remove more
stuff after 15.x is branched.

I have checked many LLVM users and only ldc[1] uses the legacy PGO pass.

[1]: https://github.com/ldc-developers/ldc/issues/3961

Reviewed By: davidxl

Differential Revision: https://reviews.llvm.org/D123834
2022-04-15 10:26:43 -07:00
Nicolai Hähnle f097088b05 AMDGPU: Add more mad_64_32 test cases
Test the behavior when a MUL is used multiple times, as well as when it
is uniform.

Run the tests for gfx9 as well, which added S_MUL_HI_[IU]32.
2022-04-15 00:38:37 -05:00
Nicolai Hähnle 90a17ef6cc AMDGPU: Add mixed sign/zero-extend multiply-add test
There's a missed opportunity here that a later patch will exploit.
2022-04-14 23:34:45 -05:00
Matt Arsenault df29ec2f54 AMDGPU: Select i8/i16 global and flat atomic load/store
As far as I know these should be atomic anyway, as long as the address
is aligned. Unaligned atomics hit an ugly error in AtomicExpand.
2022-04-14 20:52:05 -04:00
Matt Arsenault c528fbf882 AMDGPU: Fix assert if v_mov_b32_dpp is last instruction in the block
This can happen if the use instruction is a phi.

Fixes issue 49961
2022-04-14 20:21:22 -04:00
Matt Arsenault 9196f5dab7 MachineCSE: Report this requires SSA 2022-04-14 20:21:21 -04:00
Jay Foad defce20cbb [AMDGPU] Add a test for flat scratch SVS addressing 2022-04-14 09:39:16 +01:00
Carl Ritson 35ea326047 [AMDGPU] Try to avoid inserting duplicate s_inst_prefetch
Check for existing s_inst_prefetch instructions when
configuring prefetches during loop alignment.

Reviewed By: rampitec, foad

Differential Revision: https://reviews.llvm.org/D123569
2022-04-14 16:06:24 +09:00
Ruiling Song 1e01f95057 LowerSwitch: Avoid inserting NewDefault block
The NewDefault was used to simplify the updating of PHI nodes, but it
causes some inefficiency for target that will run structurizer later. For
example, for a simple two-case switch, the extra NewDefault is causing
unstructured CFG like:

        O
       / \
      O   O
     / \ / \
    C1  ND C2
     \  |  /
      \ | /
        D

The change is to avoid the ND(NewDefault) block, that is we will get a
structured CFG for above example like:

        O
       / \
      /   \
     O     O
    / \   / \
   C1  \ /  C2
    \-> D <-/

The IR change introduced by this patch should be trivial to other targets,
so I am doing this unconditionally.

Fall-through among the cases will also cause unstructured CFG, but it need
more work and will be addressed in a separate change.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D123607
2022-04-14 13:30:56 +08:00
Ruiling Song 7c87d75d74 test: Don't depend on behavior of switch lower in one test. NFC
This is a preliminary change to update the test so that it does not
depend on how switch-case will be lowered. The following change will
lower switch-case more optimally that will make the test no longer
valid.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D123606
2022-04-14 13:30:48 +08:00
Stanislav Mekhanoshin d951d937a0 [AMDGPU] Increate hazard for store dwordx3/4 to 2 waitstates on gfx940
Fixes: SWDEV-327053

Differential Revision: https://reviews.llvm.org/D123687
2022-04-13 14:21:45 -07:00
Matt Arsenault 1732242bee RegAlloc: Fix remaining virtual registers after allocation failure
This testcase fails register allocation, but at the failure point
there were also new split virtual registers. Previously this was
assigning the failing register and not enqueueing the newly created
split virtual registers. These would then never be allocated and
assert in VirtRegRewriter.
2022-04-13 16:25:30 -04:00
Matt Arsenault e78f70bccb AMDGPU: Relax test check on tablegen debug output
Try to match tN and pointer for asserts and non-assert builds.
2022-04-13 15:00:18 -04:00
Dmitry Preobrazhensky ab18e1a533 [AMDGPU][GFX10] Enabled op_sel for v_add_nc_u16 and v_sub_nc_u16
Differential Revision: https://reviews.llvm.org/D123594
2022-04-13 13:48:42 +03:00
Matt Arsenault c986d476cd AMDGPU: Update reqd-work-group-size optimization for umin intrinsic
This code was pattern matching the ID computation expression as it
appears in the library. This was a compare and select, but now that
umin is canonical, we were no longer matching. Update to match the
intrinsic instead.
2022-04-12 20:03:02 -04:00
Matt Arsenault d4b1be20f6 RegAllocGreedy: Fix illegal eviction assert for urgent evictions
The condition in canEvictInterferenceBasedOnCost is slightly different
from the assertion in evictInteference.
canEvictInterferenceBasedOnCost uses a <= check for the cascade number
for legality, but the assert was checking for <. For equal cascade
numbers for an urgent eviction, canEvictInterferenceBasedOnCost could
return success. The actual eviction would then hit this assert. Avoid
ever returning true for equivalent cascade numbers.

The resulting failed allocation seems a bit off to me. e.g. in
illegal-eviction-assert.mir, I wuold assume %0 gets allocated starting
at $vgpr0. That was its initial allocation choice, but was later
evicted. In this example no evictions can help improve anything.
2022-04-12 19:16:56 -04:00
Stanislav Mekhanoshin f6462a26f0 [AMDGPU] Split unaligned 4 DWORD DS operations
Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance data:

```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b128                      aligned by 16:  4.9 sec
ds_write2_b64                      aligned by 16:  5.1 sec
ds_write2_b32 * 2                  aligned by 16:  5.5 sec
ds_write_b128                      aligned by  1:  8.1 sec
ds_write2_b64                      aligned by  1:  8.7 sec
ds_write2_b32 * 2                  aligned by  1: 14.0 sec
ds_write_b128                      aligned by  2:  8.1 sec
ds_write2_b64                      aligned by  2:  8.7 sec
ds_write2_b32 * 2                  aligned by  2: 14.0 sec
ds_write_b128                      aligned by  4:  5.6 sec
ds_write2_b64                      aligned by  4:  8.7 sec
ds_write2_b32 * 2                  aligned by  4:  5.6 sec
ds_write_b128                      aligned by  8:  5.6 sec
ds_write2_b64                      aligned by  8:  5.1 sec
ds_write2_b32 * 2                  aligned by  8:  5.6 sec
ds_read_b128                       aligned by 16:  3.8 sec
ds_read2_b64                       aligned by 16:  3.8 sec
ds_read2_b32 * 2                   aligned by 16:  4.0 sec
ds_read_b128                       aligned by  1:  4.6 sec
ds_read2_b64                       aligned by  1:  8.1 sec
ds_read2_b32 * 2                   aligned by  1: 14.0 sec
ds_read_b128                       aligned by  2:  4.6 sec
ds_read2_b64                       aligned by  2:  8.1 sec
ds_read2_b32 * 2                   aligned by  2: 14.0 sec
ds_read_b128                       aligned by  4:  4.6 sec
ds_read2_b64                       aligned by  4:  8.1 sec
ds_read2_b32 * 2                   aligned by  4:  4.0 sec
ds_read_b128                       aligned by  8:  4.6 sec
ds_read2_b64                       aligned by  8:  3.8 sec
ds_read2_b32 * 2                   aligned by  8:  4.0 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b128                      aligned by 16:  6.2 sec
ds_write2_b64                      aligned by 16:  7.1 sec
ds_write2_b32 * 2                  aligned by 16:  7.6 sec
ds_write_b128                      aligned by  1: 24.1 sec
ds_write2_b64                      aligned by  1: 25.2 sec
ds_write2_b32 * 2                  aligned by  1: 43.7 sec
ds_write_b128                      aligned by  2: 24.1 sec
ds_write2_b64                      aligned by  2: 25.1 sec
ds_write2_b32 * 2                  aligned by  2: 43.7 sec
ds_write_b128                      aligned by  4: 14.4 sec
ds_write2_b64                      aligned by  4: 25.1 sec
ds_write2_b32 * 2                  aligned by  4:  7.6 sec
ds_write_b128                      aligned by  8: 14.4 sec
ds_write2_b64                      aligned by  8:  7.1 sec
ds_write2_b32 * 2                  aligned by  8:  7.6 sec
ds_read_b128                       aligned by 16:  6.2 sec
ds_read2_b64                       aligned by 16:  6.3 sec
ds_read2_b32 * 2                   aligned by 16:  7.5 sec
ds_read_b128                       aligned by  1: 12.5 sec
ds_read2_b64                       aligned by  1: 24.0 sec
ds_read2_b32 * 2                   aligned by  1: 43.6 sec
ds_read_b128                       aligned by  2: 12.5 sec
ds_read2_b64                       aligned by  2: 24.0 sec
ds_read2_b32 * 2                   aligned by  2: 43.6 sec
ds_read_b128                       aligned by  4: 12.5 sec
ds_read2_b64                       aligned by  4: 24.0 sec
ds_read2_b32 * 2                   aligned by  4:  7.5 sec
ds_read_b128                       aligned by  8: 12.5 sec
ds_read2_b64                       aligned by  8:  6.3 sec
ds_read2_b32 * 2                   aligned by  8:  7.5 sec
```

Differential Revision: https://reviews.llvm.org/D123634
2022-04-12 16:07:13 -07:00