Commit Graph

4066 Commits

Author SHA1 Message Date
Jay Foad d3f13f3edf [AMDGPU] Remove a comment. NFC.
This was obsoleted by f78687df9b which added gfx9 aligned/unaligned
tests.
2020-11-02 13:56:46 +00:00
Christudasan Devadasan 9bb2b4f0aa [AMDGPU] Add alignment check for v3 to v4 load type promotion
It should be enabled only when the load alignment is at least 8-byte.

Fixes: SWDEV-256824

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D90404
2020-11-01 12:05:34 +05:30
Scott Linder 13a56ca5a9 [AMDGPU] Refactor and extend elf-header-flags-mach tests
* Factor out common elements of the input YAML document and use sed to
  macro replace the run line specific elements.
* Add checks for the common elements which depend on the ELF class.
* Use non-numeric suffix for temporary files to avoid merge conflicts.
* Sort tests by GFX# ascending.
* Group ELF and YAML tests by GFX#.

Reviewed By: t-tye

Differential Revision: https://reviews.llvm.org/D90245
2020-10-30 18:57:04 +00:00
Matt Arsenault 790f5771fd AMDGPU: Fix missing writelane cases to skip with exec=0 2020-10-30 11:15:11 -04:00
Jay Foad 58de4b2053 [AMDGPU] Use pseudo instructions for readlane/writelane
This reverts r227987 "R600/SI: Determine target-specific encoding of READLANE and WRITELANE early v2".

All the codegen changes are caused by the post-RA scheduler no longer
treating readlane/writelane as scheduling barriers due to having
unmodelled side effects. (The pseudos are hasSideEffects = 0, but the
real instructions are hasSideEffects = ? which TableGen conservatively
treats as 1.)

Differential Revision: https://reviews.llvm.org/D90401
2020-10-29 16:00:53 +00:00
Jay Foad 7a79921edd [AMDGPU] Remove gds operand from ds_gws_* MachineInstrs
The operand value was always 1 (except in some bad MIR tests) so it was
redundant.

Differential Revision: https://reviews.llvm.org/D90378
2020-10-29 15:04:23 +00:00
Austin Kerbow de51867343 [AMDGPU] Add Reset function to GCNHazardRecognizer
Reset the tracked emitted instructions when starting scheduling on a new
region.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D90347
2020-10-28 16:32:32 -07:00
Jay Foad 5b91a6a88b [AMDGPU] Allow some modifiers on VOP3B instructions
V_DIV_SCALE_F32/F64 are VOP3B encoded so they can't use the ABS src
modifier, but they can still use NEG and the usual output modifiers.

This partially reverts 3b99f12a4e "AMDGPU: Remove modifiers from v_div_scale_*".

Differential Revision: https://reviews.llvm.org/D90296
2020-10-28 21:54:14 +00:00
Austin Kerbow 8b127a8661 [AMDGPU] Fix inserting combined s_nop in bundles
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D90334
2020-10-28 14:34:04 -07:00
Aditya Nandakumar bed8394047 [GISel]: Few InsertVecElt combines
https://reviews.llvm.org/D88060

This adds the following combines
1) build_vector formation from insert_vec_elts
2) insert_vec_elts (build_vector) -> build_vector
2020-10-28 12:27:07 -07:00
Matt Arsenault b9c21d43bb RegAlloc: Clear isSSA
The MIR parser may infer SSA, so -run-pass=regallocgreedy would hit a
verifier error after multiple vreg defs are added.
2020-10-28 12:02:16 -04:00
Sebastian Neubauer 09c7345683 [AMDGPU] Precommit tests for D89388 and D89399, NFC 2020-10-28 16:58:55 +01:00
Carl Ritson 057934a6d7 [AMDGPU] Fix insert of SIPreAllocateWWMRegs in FastRegAlloc
SIPreAllocateWWMRegs was being inserted after RegisterCoalescer
but this pass does not exist during FastAlloc so pre-allocation
pass was never being run.
Insert pre-allocation after TwoAddressInstructionPass instead.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D90236
2020-10-28 12:15:15 +09:00
Michael Liao 46c3d5cb05 [amdgpu] Add the late codegen preparation pass.
Summary:
- Teach that pass to widen naturally aligned but not DWORD aligned
  sub-DWORD loads.

Reviewers: rampitec, arsenm

Subscribers:

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D80364
2020-10-27 14:07:59 -04:00
Jay Foad d028d2b376 [AMDGPU] Add llvm.amdgcn.div.scale with fneg tests 2020-10-27 16:05:51 +00:00
Michael Liao 0d092303b4 [amdgpu] Enable use of AA during codegen.
- Add an internal option `-amdgpu-use-aa-in-codegen` to enable or
  disable this feature. By Default, it's enabled.

Differential Revision: https://reviews.llvm.org/D89320
2020-10-27 09:46:23 -04:00
Carl Ritson 7a880ab388 [AMDGPU] Move WQM Pass after MI Scheduler
Exec mask manipulation inserted by SIWholeQuadMode barriers to
instruction scheduling.  Move the entire pass after the machine
instruction scheduler and make changes so pass is correct for
non-SSA operation.  These changes should leave the pass still
usable pre-scheduler, although tests have be updated to reflect
post-scheduler results.

Reviewed By: nhaehnle

Differential Revision: https://reviews.llvm.org/D88081
2020-10-27 10:25:53 +09:00
Stanislav Mekhanoshin 038d884a50 [AMDGPU] Use flat scratch instructions where available
The support is disabled by default. So far there is instruction
selection, spilling, and frame elimination. It also changes SP
from unswizzled to swizzled as used by flat scratch instructions,
so it cannot be mixed with MUBUF stack access.

At the very least missing:

- GlobalISel;
- Some optimizations in frame elimination in between vector
  and scalar ALU;
- It shall finally allow to always materialize frame index
  as an SGPR, but that is not implemented and frame elimination
  cannot handle it yet;
- Unaligned and/or multidword flat scratch shall work, but it
  is legalized now for MUBUF;
- Operand folding cannot optimize FI like with MUBUF yet;
- It will need scaling the value of the SP/FP in the DWARF
  expression to recover the unswizzled scratch address;

Differential Revision: https://reviews.llvm.org/D89170
2020-10-26 14:40:42 -07:00
Fraser Cormack ffa6d2afa4 [DAGCombine] Add test case showing incorrect DAGCombine optimization
This optmization produces incorrect results when the vector element type
is not byte-sized. Related to D78568.
2020-10-26 12:37:31 +00:00
Sebastian Neubauer a094b4fa4b [AMDGPU] Emit new pal metadata by default
If no pal metadata is given, default to the msgpack format instead of
the legacy metadata. This makes tests better readable.

Differential Revision: https://reviews.llvm.org/D90035
2020-10-26 10:16:17 +01:00
Christudasan Devadasan 5a061041ec [AMDGPU] Avoid offset register in MUBUF for direct stack object accesses
We use an absolute address for stack objects and
it would be necessary to have a constant 0 for soffset field.

Fixes: SWDEV-228562

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D89234
2020-10-26 11:08:37 +05:30
Matt Arsenault d61996473d AMDGPU: Increase branch size estimate with offset bug
This will be relaxed to insert a nop if the offset hits the bad value,
so over estimate branch instruction sizes.
2020-10-23 10:34:24 -04:00
Matt Arsenault 549f326d32 AMDGPU: Cleanup MIR test
Remove registers section and compact block/register numbers
2020-10-22 12:54:35 -04:00
Piotr Sobczak 7ae0033ca8 [AMDGPU] Fix expansion of i16 MULH
This commit marks i16 MULH as expand in AMDGPU backend,
which is necessary after the refactoring in D80485.

Differential Revision: https://reviews.llvm.org/D89965
2020-10-22 17:05:06 +02:00
Matt Arsenault d3bcfe2a36 AMDGPU: Implement getNoPreservedMask
We don't support funclets for exception handling and I hit this when
manually reducing MIR.
2020-10-22 10:17:31 -04:00
Matt Arsenault 188df17420 ScheduleDAGInstrs: Skip debug instructions at end of scheduling region
If the end instruction of the scheduling region was a DBG_VALUE, the
uses of the debug instruction were tracked as if they were real
uses. This would then hit the deadDefHasNoUse assertion in
addVRegDefDeps if the only use was the debug instruction.
2020-10-22 10:16:45 -04:00
Stanislav Mekhanoshin 611959f004 [AMDGPU] Fixed v_swap_b32 match
1. Fixed liveness issue with implicit kills.
2. Fixed potential problem with an indirect mov.

Fixes: SWDEV-256848

Differential Revision: https://reviews.llvm.org/D89599
2020-10-21 10:14:24 -07:00
Matt Arsenault 1ed4caff1d AMDGPU: Lower the threshold reported for maximum stack size exceeded
Check the actual maximum supported stack size for a kernel.
2020-10-21 12:06:27 -04:00
Matt Arsenault 53c43431bc AMDGPU: Propagate amdgpu-flat-work-group-size attributes
Fixes being overly conservative with the register counts in called
functions. This should try to do a conservative range merge, but for
now just clone.

Also fix not being able to functionally run the pass standalone.
2020-10-21 12:06:24 -04:00
Florian Hahn 88241ffb56 [Passes] Move ADCE before DSE & LICM.
The adjustment seems to have very little impact on optimizations.
The only binary change with -O3 MultiSource/SPEC2000/SPEC2006 on X86 is
in consumer-typeset and the size there actually decreases by -0.1%, with
not significant changes in the stats.

On its own, it is mildly positive in terms of compile-time, most likely
due to LICM & DSE having to process slightly less instructions. It
should also be unlikely that DSE/LICM make much new code dead.

http://llvm-compile-time-tracker.com/compare.php?from=df63eedef64d715ce1f31843f7de9c11fe1e597f&to=e3bdfcf94a9eeae6e006d010464f0c1b3550577d&stat=instructions

With DSE & MemorySSA, it gives some nice compile-time improvements, due
to the fact that DSE can re-use the PDT from ADCE, if it does not make
any changes:

http://llvm-compile-time-tracker.com/compare.php?from=15fdd6cd7c24c745df1bb419e72ff66fd138aa7e&to=481f494515fc89cb7caea8d862e40f2c910dc994&stat=instructions

Reviewed By: xbolva00

Differential Revision: https://reviews.llvm.org/D87322
2020-10-21 10:30:56 +01:00
Austin Kerbow ebdcef20ce [AMDGPU] Avoid inserting noops during scheduling
Passes that are run after the post-RA scheduler may insert instructions like
waitcnt which eliminate the need for certain noops. After this patch the
scheduler is still aware of possible latency from hazards but noops will
not be inserted until the dedicated hazard recognizer pass is run.

Depends on D89753.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D89754
2020-10-20 17:11:36 -07:00
Austin Kerbow 37d907899f [HazardRec] Allow inserting multiple wait-states simultaneously
If a target can encode multiple wait-states into a noop allow emitting such
instructions directly.

Reviewed By: rampitec, dmgreen

Differential Revision: https://reviews.llvm.org/D89753
2020-10-20 17:03:47 -07:00
Tony 1bc7bfffdb [AMDGPU] Optimize waitcnt insertion for flat memory operations
Change waitcnt insertion to check the memory operand tokens to see if
flat memory operations access VMEM in the same way it does to check if
accessing LDS. This avoids adding waitcnt for counters for address
spaces that are not accessed.

In addition, only generate the pessimistic waitcnt 0 if a flat memory
operation appears to access both VMEM and LDS.

This benefits flat memory operations that explicitly specify the
address space as GLOBAL or LOCAL.

Differential Revision: https://reviews.llvm.org/D89618
2020-10-20 22:55:12 +00:00
Michael Liao 2a0e4d1c01 [amdgpu] Enhance AMDGPU AA.
- In general, a generic point may alias to pointers in all other address
  spaces. However, for certain cases enforced by the programming model,
  we may found a generic point won't alias to pointers to local objects.
  * When a generic pointer is loaded from the constant address space, it
    could only be a pointer to the GLOBAL or CONSTANT address space.
    Thus, it won't alias to pointers to the PRIVATE or LOCAL address
    space.
  * When a generic pointer is passed as a kernel argument, it also could
    only be a pointer to the GLOBAL or CONSTANT address space. Thus, it
    also won't alias to pointers to the PRIVATE or LOCAL address space.

Differential Revision: https://reviews.llvm.org/D89525
2020-10-20 09:54:12 -04:00
Carl Ritson be2afbd019 [AMDGPU] Remove fix up operand from SI_ELSE
Remove immediate operand from SI_ELSE which indicates if EXEC has
been modified.  Instead always emit code that handles EXEC and
remove unnecessary instructions during pre-RA optimisation.

This facilitates passes (i.e. SIWholeQuadMode) adding exec mask
manipulation post control flow lowering, and pre control flow
lower passes do not need to be aware of SI_ELSE handling.

Reviewed By: nhaehnle

Differential Revision: https://reviews.llvm.org/D89644
2020-10-20 19:15:21 +09:00
sstefan1 fbfb1c7909 [IR] Make nosync, nofree and willreturn default for intrinsics.
D70365 allows us to make attributes default. This is a follow up to
actually make nosync, nofree and willreturn default. The approach we
chose, for now, is to opt-in to default attributes to avoid introducing
problems to target specific intrinsics. Intrinsics with default
attributes can be created using `DefaultAttrsIntrinsic` class.
2020-10-20 11:57:19 +02:00
Piotr Sobczak c872faf6e0 [AMDGPU] Do not generate S_CMP_LG_U64 on gfx7
S_CMP_LG_U64 was added in gfx8 and is guarded by hasScalarCompareEq64().

Rewrite S_CMP_LG_U64 to S_OR_B32 + S_CMP_LG_U32 for targets that
do not support 64-bit scalar compare.

Differential Revision: https://reviews.llvm.org/D89536
2020-10-19 14:44:31 +02:00
Hans Wennborg 0628bea513 Revert "[PM/CC1] Add -f[no-]split-cold-code CC1 option to toggle splitting"
This broke Chromium's PGO build, it seems because hot-cold-splitting got turned
on unintentionally. See comment on the code review for repro etc.

> This patch adds -f[no-]split-cold-code CC1 options to clang. This allows
> the splitting pass to be toggled on/off. The current method of passing
> `-mllvm -hot-cold-split=true` to clang isn't ideal as it may not compose
> correctly (say, with `-O0` or `-Oz`).
>
> To implement the -fsplit-cold-code option, an attribute is applied to
> functions to indicate that they may be considered for splitting. This
> removes some complexity from the old/new PM pipeline builders, and
> behaves as expected when LTO is enabled.
>
> Co-authored by: Saleem Abdulrasool <compnerd@compnerd.org>
> Differential Revision: https://reviews.llvm.org/D57265
> Reviewed By: Aditya Kumar, Vedant Kumar
> Reviewers: Teresa Johnson, Aditya Kumar, Fedor Sergeev, Philip Pfaffe, Vedant Kumar

This reverts commit 273c299d5d.
2020-10-19 12:31:14 +02:00
Austin Kerbow 978fbd8268 [AMDGPU] Run hazard recognizer pass later
If instructions were removed in peephole passes after the hazard recognizer was
run it is possible that new hazards could be introduced.

Fixes: SWDEV-253090

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D89077
2020-10-16 12:15:51 -07:00
Jay Foad 1417abe54c [AMDGPU] Add new llvm.amdgcn.fma.legacy intrinsic
Differential Revision: https://reviews.llvm.org/D89558
2020-10-16 17:10:21 +01:00
Matt Arsenault ce16b6835b AMDGPU: Don't kill super-register with overlapping copy
This would end up killing part of the result super-register, resulting
in a verifier error on a later use of the overlapping registers.  We
could add kills of any non-aliasing registers, but we should be moving
away from relying on kill flags.
2020-10-16 09:34:35 -04:00
Florian Hahn 51ff04567b Recommit "[DSE] Switch to MemorySSA-backed DSE by default."
After investigation by @asbirlea, the issue that caused the
revert appears to be an issue in the original source, rather
than a problem with the compiler.

This patch enables MemorySSA DSE again.

This reverts commit 915310bf14.
2020-10-16 09:02:53 +01:00
Vedant Kumar 273c299d5d [PM/CC1] Add -f[no-]split-cold-code CC1 option to toggle splitting
This patch adds -f[no-]split-cold-code CC1 options to clang. This allows
the splitting pass to be toggled on/off. The current method of passing
`-mllvm -hot-cold-split=true` to clang isn't ideal as it may not compose
correctly (say, with `-O0` or `-Oz`).

To implement the -fsplit-cold-code option, an attribute is applied to
functions to indicate that they may be considered for splitting. This
removes some complexity from the old/new PM pipeline builders, and
behaves as expected when LTO is enabled.

Co-authored by: Saleem Abdulrasool <compnerd@compnerd.org>
Differential Revision: https://reviews.llvm.org/D57265
Reviewed By: Aditya Kumar, Vedant Kumar
Reviewers: Teresa Johnson, Aditya Kumar, Fedor Sergeev, Philip Pfaffe, Vedant Kumar
2020-10-15 23:13:33 +00:00
alex-t 42ed388120 [AMDGPU] SILowerControlFlow::removeMBBifRedundant should not try to change MBB layout if it can fallthrough
removeMBBifRedundant normally tries to keep predecessors fallthrough when removing redundant MBB.
         It has to change MBBs layout to keep the new successor to immediately follow the predecessor of removed MBB.
         It only may be allowed in case the new successor itself has no successors to which it fall through.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D89397
2020-10-15 23:20:54 +03:00
Stanislav Mekhanoshin d1beb95d12 [AMDGPU] gfx1032 target
Differential Revision: https://reviews.llvm.org/D89487
2020-10-15 12:41:18 -07:00
Matt Arsenault 663f16684d AMDGPU: Fix verifier error on killed spill of partially undef register
This does unfortunately end up with extra waitcnts getting inserted
that were avoided before. Ideally we would avoid the spills of these
undef components in the first place.
2020-10-15 09:45:44 -04:00
Carl Ritson b70cb50204 [AMDGPU] Minimize number of s_mov generated by copyPhysReg
Generate the minimal set of s_mov instructions required when
expanding a SGPR copy operation in copyPhysReg.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D89187
2020-10-15 22:35:02 +09:00
Carl Ritson 75357ebc50 [AMDGPU] Pre-commit test for D89187 2020-10-15 15:29:07 +09:00
Konstantin Zhuravlyov 3fdf3b1539 AMDGPU: Update AMDHSA code object version handling
Differential Revision: https://reviews.llvm.org/D89076
2020-10-14 13:04:27 -04:00
Jay Foad b59d8d7c72 [AMDGPU][GlobalISel] Compute known bits for zero-extending loads
Implement computeKnownBitsForTargetInstr for G_AMDGPU_BUFFER_LOAD_UBYTE
and G_AMDGPU_BUFFER_LOAD_USHORT. This allows generic combines to remove
some unnecessary G_ANDs.

Differential Revision: https://reviews.llvm.org/D89316
2020-10-13 16:22:00 +01:00