llvm-project

Commit Graph

Author	SHA1	Message	Date
Stanislav Mekhanoshin	ca4bf58e4e	[AMDGPU] Support unaligned flat scratch in TLI Adjust SITargetLowering::allowsMisalignedMemoryAccessesImpl for unaligned flat scratch support. Mostly needed for global isel. Differential Revision: https://reviews.llvm.org/D93669	2020-12-22 16:12:31 -08:00
Matt Arsenault	d2e52eec51	AMDGPU: Select global saddr mode from SGPR pointer Use the 64-bit SGPR base with a 0 offset, since it's 1 fewer instruction to materialize the 0 vs. the 64-bit copy.	2020-11-16 11:51:06 -05:00
Jay Foad	830ed64ccd	Revert "Revert "[AMDGPU] Reorganize GCN subtarget features for unaligned access"" This reverts commit `8b08fa0103`. The underlying problems were fixed by D90607.	2020-11-11 14:40:14 +00:00
Stanislav Mekhanoshin	d5a465866e	[AMDGPU] Omit buffer resource with flat scratch. Differential Revision: https://reviews.llvm.org/D90979	2020-11-09 08:05:20 -08:00
Stanislav Mekhanoshin	038d884a50	[AMDGPU] Use flat scratch instructions where available The support is disabled by default. So far there is instruction selection, spilling, and frame elimination. It also changes SP from unswizzled to swizzled as used by flat scratch instructions, so it cannot be mixed with MUBUF stack access. At the very least missing: - GlobalISel; - Some optimizations in frame elimination in between vector and scalar ALU; - It shall finally allow to always materialize frame index as an SGPR, but that is not implemented and frame elimination cannot handle it yet; - Unaligned and/or multidword flat scratch shall work, but it is legalized now for MUBUF; - Operand folding cannot optimize FI like with MUBUF yet; - It will need scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address; Differential Revision: https://reviews.llvm.org/D89170	2020-10-26 14:40:42 -07:00
Mirko Brkusanin	8b08fa0103	Revert "[AMDGPU] Reorganize GCN subtarget features for unaligned access" This reverts commit `f5cd7ec9f3`. Certain rocPRIM/rocThrust/hipCUB tests were failing because of this change.	2020-09-29 15:33:34 +02:00
Sebastian Neubauer	a343b9b032	Revert "[AMDGPU] Insert waitcnt after returning from call" This reverts commit `ca907bfb57`. According to michel.daenzer, > This completely broke the Mesa radeonsi driver on Navi 14. Xorg + > xterm come up with major corruption & psychedelic colours.	2020-09-23 17:16:39 +02:00
Sebastian Neubauer	ca907bfb57	[AMDGPU] Insert waitcnt after returning from call When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished. Calls need some time to be executed, so if the callee inserts the waitcnt, filling the instruction buffer and waiting for memory will be interleaved, hiding some latency. This comes at the cost of having a waitcnt inside functions that may not be needed as no memory operations are outstanding. For function calls, this is already implemented. The same principal applies to returns: If the caller inserts a waitcnt after the call, the callee does not have to wait and the return and memory operation can be run in parallel. This commit implements waiting in the caller after returning from a function call. Differential Revision: https://reviews.llvm.org/D87674	2020-09-23 12:17:59 +02:00
Mirko Brkusanin	f5cd7ec9f3	[AMDGPU] Reorganize GCN subtarget features for unaligned access Features UnalignedBufferAccess and UnalignedDSAccess are now used to determine whether hardware supports such access. UnalignedAccessMode should be used to enable them. hasUnalignedBufferAccessEnabled() and hasUnalignedDSAccessEnabled() can be now used to quickly check both. Differential Revision: https://reviews.llvm.org/D84522	2020-08-21 12:26:31 +02:00
Jay Foad	62fd7f767c	[MachineScheduler] Fix the TopDepth/BotHeightReduce latency heuristics tryLatency compares two sched candidates. For the top zone it prefers the one with lesser depth, but only if that depth is greater than the total latency of the instructions we've already scheduled -- otherwise its latency would be hidden and there would be no stall. Unfortunately it only tests the depth of one of the candidates. This can lead to situations where the TopDepthReduce heuristic does not kick in, but a lower priority heuristic chooses the other candidate, whose depth is greater than the already scheduled latency, which causes a stall. The fix is to apply the heuristic if the depth of either candidate is greater than the already scheduled latency. All this also applies to the BotHeightReduce heuristic in the bottom zone. Differential Revision: https://reviews.llvm.org/D72392	2020-07-17 11:02:13 +01:00
Scott Linder	60b1967c39	[AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in the entry function prologue. This allows us to removes the scratch wave offset register from the calling convention ABI. As part of this change, allow the use of an inline constant zero for the SOffset of MUBUF instructions accessing the stack in entry functions when a frame pointer is not requested/required. Entry functions with calls still need to set up the calling convention ABI stack pointer register, and reference it in order to address arguments of called functions. The ABI stack pointer register remains unswizzled, but is now wave-relative instead of queue-relative. Non-entry functions also use an inline constant zero SOffset for wave-relative scratch access, but continue to use the stack and frame pointers as before. When the stack or frame pointer is converted to a swizzled offset it is now scaled directly, as the scratch wave offset no longer needs to be subtracted first. Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling convention. Tags: #llvm Differential Revision: https://reviews.llvm.org/D75138	2020-03-19 15:35:16 -04:00
Matt Arsenault	86f9117d47	AMDGPU: Don't report 2-byte alignment as fast This is apparently worse than 1-byte alignment. This does not attempt to decompose 2-byte aligned wide stores, but will stop trying to produce them. Also fix bug in LoadStoreVectorizer which was decreasing the alignment and vectorizing stack accesses. It was assuming a stack object was an alloca that could have its base alignment changed, which is not true if the pointer is derived from a function argument.	2020-02-11 18:35:00 -05:00
Jay Foad	b777e551f0	[MachineScheduler] Reduce reordering due to mem op clustering Summary: Mem op clustering adds a weak edge in the DAG between two loads or stores that should be clustered, but the direction of this edge is pretty arbitrary (it depends on the sort order of MemOpInfo, which represents the operands of a load or store). This often means that two loads or stores will get reordered even if they would naturally have been scheduled together anyway, which leads to test case churn and goes against the scheduler's "do no harm" philosophy. The fix makes sure that the direction of the edge always matches the original code order of the instructions. Reviewers: atrick, MatzeB, arsenm, rampitec, t.p.northover Subscribers: jvesely, wdng, nhaehnle, kristof.beyls, hiraditya, javed.absar, arphaman, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D72706	2020-01-14 19:19:02 +00:00
Amaury Sechet	05f56a1ddd	[AMDGPU] Automatically generate various tests. NFC llvm-svn: 369787	2019-08-23 17:58:49 +00:00
Christudasan Devadasan	b2d24bd540	[AMDGPU] Created a sub-register class for the return address operand in the return instruction. Function return instruction lowering, currently uses the fixed register pair s[30:31] for holding the return address. It can be any SGPR pair other than the CSRs. Created an SGPR pair sub-register class exclusive of the CSRs, and used this regclass while lowering the return instruction. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D63924 llvm-svn: 365512	2019-07-09 16:48:42 +00:00
Matt Arsenault	d88db6d7fc	AMDGPU: Always use s33 for global scratch wave offset Every called function could possibly need this to calculate the absolute address of stack objectst, and this avoids inserting a copy around every call site in the kernel. It's also somewhat cleaner to keep this in a callee saved SGPR. llvm-svn: 363990	2019-06-20 21:58:24 +00:00
Matt Arsenault	e24b34e9c9	AMDGPU: Undo sub x, c canonicalization for v2i16 Should avoid regression from D62341 llvm-svn: 363899	2019-06-19 23:37:43 +00:00
Matt Arsenault	e8c03a2511	AMDGPU: Move d16 load matching to preprocess step When matching half of the build_vector to a load, there could still be a hidden dependency on the other half of the build_vector the pattern wouldn't detect. If there was an additional chain dependency on the other value, a cycle could be introduced. I don't think a tablegen pattern is capable of matching the necessary conditions, so move this into PreprocessISelDAG. Check isPredecessorOf for the other value to avoid a cycle. This has a warning that it's expensive, so this should probably be moved into an MI pass eventually that will have more freedom to reorder instructions to help match this. That is currently complicated by the lack of a computeKnownBits type mechanism for the selected function. llvm-svn: 355731	2019-03-08 20:58:11 +00:00
Matt Arsenault	26e76ef0e2	DAG: Don't try to cluster loads with tied inputs This avoids breaking possible value dependencies when sorting loads by offset. AMDGPU has some load instructions that write into the high or low bits of the destination register, and have a tied input for the other input bits. These can easily have the same base pointer, but be a swizzle so the high address load needs to come first. This was inserting glue forcing the opposite ordering, producing a cycle the InstrEmitter would assert on. It may be potentially expensive to look for the dependency between the other loads, so just skip any where this could happen. Fixes bug 40936 by reverting r351379, which added a hacky attempt to fix this by adding chains in this case, which I think was just working around broken glue before the InstrEmitter. The core of the patch is re-implementing the fix for that problem. llvm-svn: 355728	2019-03-08 20:46:15 +00:00
Changpeng Fang	fe9269f804	AMDGPU: Adjust the chain for loads writing to the HI part of a register. Summary: For these loads that write to the HI part of a register, we should chain them to the op that writes to the LO part of the register to maintain the appropriate order. Reviewers: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D56454 llvm-svn: 351379	2019-01-16 21:32:53 +00:00

20 Commits