llvm-project

Commit Graph

Author	SHA1	Message	Date
Matt Arsenault	b63629a58d	GlobalISel: Fix mask computation in lowerInsert This is supposed to be the high bit index, not the width. Use the wrapping form of getBitsSet and avoid the bitflip.	2020-01-29 08:25:36 -08:00
Jay Foad	0d7bd34312	[MachineScheduler] Ignore artificial edges when forming store chains Summary: BaseMemOpClusterMutation::apply forms store chains by looking for control (i.e. non-data) dependencies from one mem op to another. In the test case, clusterNeighboringMemOps successfully clusters the loads, and then adds artificial edges to the loads' successors as described in the comment: // Copy successor edges from SUa to SUb. Interleaving computation // dependent on SUa can prevent load combining due to register reuse. The effect of this is that data dependencies from one load to a store are copied as artificial dependencies from a different load to the same store. Then when BaseMemOpClusterMutation::apply looks at the stores, it finds that some of them have a control dependency on a previous load, which breaks the chains and means that the stores are not all considered part of the same chain and won't all be clustered. The fix is to only consider non-artificial control dependencies when forming chains. Subscribers: MatzeB, jvesely, nhaehnle, hiraditya, javed.absar, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D71717	2020-01-29 16:23:01 +00:00
Matt Arsenault	96352e0a1b	AMDGPU/GlobalISel: Handle LDS with relocations case	2020-01-29 08:18:55 -08:00
Connor Abbott	87d98c1495	AMDGPU: Fix handling of infinite loops in fragment shaders Summary: Due to the fact that kill is just a normal intrinsic, even though it's supposed to terminate the thread, we can end up with provably infinite loops that are actually supposed to end successfully. The AMDGPUUnifyDivergentExitNodes pass breaks up these loops, but because there's no obvious place to make the loop branch to, it just makes it return immediately, which skips the exports that are supposed to happen at the end and hangs the GPU if all the threads end up being killed. While it would be nice if the fact that kill terminates the thread were modeled in the IR, I think that the structurizer as-is would make a mess if we did that when the kill is inside control flow. For now, we just add a null export at the end to make sure that it always exports something, which fixes the immediate problem without penalizing the more common case. This means that we sometimes do two "done" exports when only some of the threads enter the discard loop, but from tests the hardware seems ok with that. This fixes dEQP-VK.graphicsfuzz.while-inside-switch with radv. Reviewers: arsenm, nhaehnle Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D70781	2020-01-29 17:13:25 +01:00
Matt Arsenault	94e8ef4d4c	AMDGPU/GlobalISel: Look through copies for source modifiers When all VOP instructions are legalized to VGPRs, any SGPR source modifiers will have a copy in the way.	2020-01-29 08:08:13 -08:00
Matt Arsenault	752e2e245a	AMDGPU/GlobalISel: Rewrite fadd select tests Convert to the style most others use with one test instruction per function, and use an implicit use to ensure the result register class is constrained. Change-Id: I6109148b0e3c80aa5535796a37abca583c19a936	2020-01-29 07:49:38 -08:00
Connor Abbott	08b205bb48	Revert "AMDGPU: Fix handling of infinite loops in fragment shaders" This reverts commit `0994c485e6`.	2020-01-29 16:14:52 +01:00
Connor Abbott	13ab22ab22	Revert "AMDGPU: Fix AMDGPUUnifyDivergentExitNodes with no normal returns" This reverts commit `323bfde20c`.	2020-01-29 16:14:49 +01:00
Matt Arsenault	02adfb5155	AMDGPU/GlobalISel: Manually select scalar f64 G_FNEG This should be no problem to support with a pattern, but it turns out there are just too many yaks to shave. The main problem is in the DAG emitter, which I have no desire to sink effort into fixing. If we had a bit to disable patterns in the DAG importer, fixing the GlobalISelEmitter is more manageable.	2020-01-29 06:49:16 -08:00
Matt Arsenault	c5c1bb3374	GlobalISel: Lower G_WRITE_REGISTER	2020-01-29 06:48:24 -08:00
Connor Abbott	323bfde20c	AMDGPU: Fix AMDGPUUnifyDivergentExitNodes with no normal returns Summary: The code was assuming in a few places that if there was only one exit from the function that it was a normal return, which is invalid. It could be an infinite loop, in which case we still need to insert the usual fake edge so that the null export happens. This fixes shaders that end with an infinite loop that discards. Reviewers: arsenm, nhaehnle, critson Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D71192	2020-01-29 15:08:46 +01:00
Connor Abbott	0994c485e6	AMDGPU: Fix handling of infinite loops in fragment shaders Summary: Due to the fact that kill is just a normal intrinsic, even though it's supposed to terminate the thread, we can end up with provably infinite loops that are actually supposed to end successfully. The AMDGPUUnifyDivergentExitNodes pass breaks up these loops, but because there's no obvious place to make the loop branch to, it just makes it return immediately, which skips the exports that are supposed to happen at the end and hangs the GPU if all the threads end up being killed. While it would be nice if the fact that kill terminates the thread were modeled in the IR, I think that the structurizer as-is would make a mess if we did that when the kill is inside control flow. For now, we just add a null export at the end to make sure that it always exports something, which fixes the immediate problem without penalizing the more common case. This means that we sometimes do two "done" exports when only some of the threads enter the discard loop, but from tests the hardware seems ok with that. This fixes dEQP-VK.graphicsfuzz.while-inside-switch with radv. Reviewers: arsenm, nhaehnle Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D70781	2020-01-29 15:08:46 +01:00
Jay Foad	4a331beadc	[AMDGPU] Fix vccz after v_readlane/v_readfirstlane to vcc_lo/hi Summary: Up to gfx9, writes to vcc_lo and vcc_hi by instructions like v_readlane and v_readfirstlane do not update vccz to reflect the new value of vcc. Fix it by reusing part of the existing vccz bug handling code, which inserts an "s_mov_b64 vcc, vcc" instruction to restore vccz just before an instruction that needs the correct value. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D69661	2020-01-28 10:52:17 +00:00
Guillaume Chatelet	d9bff3be99	Update tests for @llvm.memcpy.inline intrinsics	2020-01-28 10:32:43 +01:00
Matt Arsenault	d2a9739274	AMDGPU/GlobalISel: Eliminate SelectVOP3Mods_f32 Trivial type predicates should be moved into the tablegen pattern itself, and not checked inside complex patterns. This eliminates a redundant complex pattern, and fixes select source modifiers for GlobalISel. I have further patches which fully handle select in tablegen and remove all of the C++ selection, although it requires the ugliness to support the entire range of legal register types.	2020-01-27 17:53:54 -05:00
Matt Arsenault	c3075e6171	AMDGPU/GlobalISel: Select buffer atomics The cmpswap handling is incomplete and fails to select.	2020-01-27 15:16:44 -05:00
Matt Arsenault	a69c26a927	AMDGPU/GlobalISel: Select llvm.amdgcn.struct.buffer.store[.format]	2020-01-27 15:00:21 -05:00
Matt Arsenault	533d650e94	AMDGPU/GlobalISel: Move llvm.amdgcn.raw.buffer.store handling Treat this the same way as loads. There's less value to the intermediate nodes, but it's good to be consistent.	2020-01-27 14:59:30 -05:00
Matt Arsenault	75d66f8434	AMDGPU/GlobalISel: Select llvm.amdcn.struct.tbuffer.load	2020-01-27 14:42:04 -05:00
Matt Arsenault	09ed0e44d9	AMDGPU/GlobalISel: Select llvm.amdgcn.raw.tbuffer.load	2020-01-27 13:40:37 -05:00
Stanislav Mekhanoshin	53eb0f8c07	[AMDGPU] Attempt to reschedule withou clustering We want to have more load/store clustering but we also want to maintain low register pressure which are oposit targets. Allow scheduler to reschedule regions without mutations applied if we hit a register limit. Differential Revision: https://reviews.llvm.org/D73386	2020-01-27 10:27:16 -08:00
Matt Arsenault	97711228fd	AMDGPU/GlobalISel: Select llvm.amdgcn.struct.buffer.load.format	2020-01-27 13:23:35 -05:00
Matt Arsenault	ce7ca2caf2	AMDGPU/GlobalISel: Select llvm.amdgcn.struct.buffer.load	2020-01-27 13:05:55 -05:00
Matt Arsenault	198624c39d	AMDGPU/GlobalISel: Select llvm.amdgcn.raw.buffer.load.format	2020-01-27 13:02:19 -05:00
Matt Arsenault	fc90222a91	AMDGPU/GlobalISel: Select llvm.amdgcn.raw.buffer.load Use intermediate instructions, unlike with buffer stores. This is necessary because of the need to have an internal way to distinguish between signed and unsigned extloads. This introduces some duplication and near duplication with the buffer store selection path. The store handling should maybe be moved into legalization to match and eliminate the duplication.	2020-01-27 12:49:23 -05:00
Matt Arsenault	e60d658260	AMDGPU/GlobalISel: Handle VOP3NoMods	2020-01-27 09:03:44 -08:00
Matt Arsenault	d309b4ebe4	AMDGPU/GlobalISel: Add baseline tests for fma/fmad selection	2020-01-27 09:02:13 -08:00
Matt Arsenault	bef27175c7	AMDGPU: Fix not using f16 fsin/fcos I noticed this because this accidentally started working for GlobalISel.	2020-01-27 08:59:59 -08:00
Jay Foad	e37997cc0d	[AMDGPU] Simplify test and extend to gfx9 and gfx10 Summary: This is in preparation for adding more test cases for D69661 and other bug fixes in the same area. Reviewers: tpr, dstuttard, critson, nhaehnle, arsenm Subscribers: kzhuravl, jvesely, wdng, yaxunl, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D70708	2020-01-27 16:56:40 +00:00
Matt Arsenault	a1d33ce73a	AMDGPU/GlobalISel: Custom legalize v2s16 G_SHUFFLE_VECTOR Try to keep simple v2s16 cases as-is. This will more naturally map to how the VOP3P op_sel modifiers work compared to the expansion involving bitcasts and bitshifts. This could maybe try harder with wider source vector types, although that could be handled with a pre-legalize combine.	2020-01-27 08:28:05 -08:00
Matt Arsenault	4e69df091d	Revert "AMDGPU: Temporary drop s_mul_hi_i/u32 patterns" This reverts commit `fe23ed2c68`. It was never really clear this was responsible for the performance regressions that caused this to be reverted. It's been a long time, and we need to have scalar patterns for this to get GlobalISel working.	2020-01-27 08:07:21 -08:00
Matt Arsenault	bc3d900fa5	AMDGPU/GlobalISel: Fix not using global atomics on gfx9+ For some reason the flat/global atomics end up in the generated matcher table in a different order from SelectionDAG. Use AddedComplexity to prefer checking for global atomics first.	2020-01-27 07:42:42 -08:00
Matt Arsenault	ac0b9b4ccf	AMDPGPU/GlobalISel: Select more MUBUF global addressing modes The handling of the high bits of the resource descriptor seem weird to me, where the 3rd dword changes based on the instruction.	2020-01-27 07:28:36 -08:00
Matt Arsenault	fdaad485e6	AMDGPU/GlobalISel: Initial selection of MUBUF addr64 load/store Fixes the main reason for compile failures on SI, but doesn't really try to use the addressing modes yet.	2020-01-27 07:13:56 -08:00
Matt Arsenault	2214bc81d0	AMDGPU: Allow i16 shader arguments Not allowing this just creates unnecessary complications when writing simple tests.	2020-01-27 06:55:32 -08:00
Matt Arsenault	2a160ba5b0	GlobalISel: Reimplement widenScalar for G_UNMERGE_VALUES results Only use shifts if the requested type exactly matches the source type, and create sub-unmerges otherwise.	2020-01-27 06:18:26 -08:00
Matt Arsenault	06d9230fef	GlobalISel: Translate vector GEPs	2020-01-27 05:35:05 -08:00
Simon Pilgrim	c8de7c8f50	[TargetLowering] SimplifyDemandedBits - Remove ashr if all our demandedbits already match the sign bit Differential Revision: https://reviews.llvm.org/D73412	2020-01-25 17:36:46 +00:00
Matt Arsenault	fe9765762c	AMDGPU: Generate test checks	2020-01-24 23:25:57 -05:00
Tom Stellard	86c944d790	AMDGPU/SILoadStoreOptimizer: Improve merging of out of order offsets Summary: This improves merging of sequences like: store a, ptr + 4 store b, ptr + 8 store c, ptr + 12 store d, ptr + 16 store e, ptr + 20 store f, ptr Prior to this patch the basic block was scanned in order to find instructions to merge and the above sequence would be transformed to: store4 <a, b, c, d>, ptr + 4 store e, ptr + 20 store r, ptr With this change, we now sort all the candidate merge instructions by their offset, so instructions are visited in offset order rather than in the order they appear in the basic block. We now transform this sequnce into: store4 <f, a, b, c>, ptr store2 <d, e>, ptr + 16 Another benefit of this change is that since we have sorted the mergeable lists by offset, we can easily check if an instruction is mergeable by checking the offset of the instruction that becomes before or after it in the sorted list. Once we determine an instruction is not mergeable we can remove it from the list and avoid having to do the more expensive mergeablilty checks. Reviewers: arsenm, pendingchaos, rampitec, nhaehnle, vpykhtin Reviewed By: arsenm, nhaehnle Subscribers: kerbowa, merge_guards_bot, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D65966	2020-01-24 19:45:56 -08:00
Matt Arsenault	3b93945587	AMDGPU/GlobalISel: Select wqm, softwqm and wwm intrinsics	2020-01-24 13:06:44 -08:00
Matt Arsenault	87c46a3129	AMDGPU: Don't error on ds.ordered intrinsic in function These should be assumed to be called from a compute context. Also don't use a 2 entry switch over constants.	2020-01-24 13:06:44 -08:00
Matt Arsenault	4fdae24733	AMDGPU/GlobalISel: Add selection tests for G_ATOMICRMW_ADD	2020-01-24 12:15:09 -08:00
Stanislav Mekhanoshin	555d8f4ef5	[AMDGPU] Bundle loads before post-RA scheduler We are relying on atrificial DAG edges inserted by the MemOpClusterMutation to keep loads and stores together in the post-RA scheduler. This does not work all the time since it allows to schedule a completely independent instruction in the middle of the cluster. Removed the DAG mutation and added pass to bundle already clustered instructions. These bundles are unpacked before the memory legalizer because it does not work with bundles but also because it allows to insert waitcounts in the middle of a store cluster. Removing artificial edges also allows a more relaxed scheduling. Differential Revision: https://reviews.llvm.org/D72737	2020-01-24 11:33:38 -08:00
Stanislav Mekhanoshin	44b865fa7f	[AMDGPU] Allow narrowing muti-dword loads Currently BE allows only a little load narrowing because of the fear it will produce sub-dword ext loads. However, we can always allow narrowing if we are shrinking one multi-dword load to another multi-dword load. In particular we were unable to reduce s_load_dwordx8 into s_load_dwordx4 if identity shuffle was used to extract low 4 dwords. Differential Revision: https://reviews.llvm.org/D73133	2020-01-24 11:03:41 -08:00
Stanislav Mekhanoshin	7a94d4f4ee	Allow combining of extract_subvector to extract element Differential Revision: https://reviews.llvm.org/D73132	2020-01-24 10:50:26 -08:00
Changpeng Fang	2531535984	AMDGPU: Implement FDIV optimizations in AMDGPUCodeGenPrepare Summary: RCP has the accuracy limit. If FDIV fpmath require high accuracy rcp may not meet the requirement. However, in DAG lowering, fpmath information gets lost, and thus we may generate either inaccurate rcp related computation or slow code for fdiv. In patch implements fdiv optimizations in the AMDGPUCodeGenPrepare, which could exactly know !fpmath. FastUnsafeRcpLegal: We determine whether it is legal to use rcp based on unsafe-fp-math, fast math flags, denormals and fpmath accuracy request. RCP Optimizations: 1/x -> rcp(x) when fast unsafe rcp is legal or fpmath >= 2.5ULP with denormals flushed. a/b -> a*rcp(b) when fast unsafe rcp is legal. Use fdiv.fast: a/b -> fdiv.fast(a, b) when RCP optimization is not performed and fpmath >= 2.5ULP with denormals flushed. 1/x -> fdiv.fast(1,x) when RCP optimization is not performed and fpmath >= 2.5ULP with denormals. Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D71293	2020-01-23 16:57:43 -08:00
Matt Arsenault	86e5b56a7c	AMDGPU/GlobalISel: Fix RegBanKSelect for llvm.amdgcn.exp.compr This wasn't updated for the immarg handling change. We really need a verifier for this.	2020-01-23 13:30:46 -08:00
Matt Arsenault	618fa77ae4	AMDGPU/GlobalISel: Select V_ADD3_U32/V_XOR3_B32 The other 3-op patterns should also be theoretically handled, but currently there's a bug in the inferred pattern complexity. I'm not sure what the error handling strategy should be for potential constant bus violations. I think the correct strategy is to never produce mixed SGPR and VGPR operands in a typical VOP instruction, which will trivially avoid them. However, it's possible to still have hand written MIR (or erroneously transformed code) with these operands. When these fold, the restriction will be violated. We currently don't have any verifiers for reg bank legality. For now, just ignore the restriction. It might be worth triggering a DAG fallback on verifier error.	2020-01-23 12:04:20 -05:00
Matt Arsenault	dfec702290	AMDGPU: Check for other uses when looking through casted select Fixes mesa regression on ext_transform_feedback-max-varyings	2020-01-23 11:31:24 -05:00

1 2 3 4 5 ...

3075 Commits