llvm-project

Commit Graph

Author	SHA1	Message	Date
Eric Christopher	dbd53a1f0c	Temporarily Revert "RegAllocFast: Rewrite and improve" as it's breaking a few tests in the lldb test suite. Bot: http://lab.llvm.org:8011/builders/lldb-arm-ubuntu/builds/4226/steps/test/logs/stdio This reverts commit `c8757ff3aa`.	2020-09-18 18:11:21 -07:00
Matt Arsenault	c8757ff3aa	RegAllocFast: Rewrite and improve This rewrites big parts of the fast register allocator. The basic strategy of doing block-local allocation hasn't changed but I tweaked several details: Track register state on register units instead of physical registers. This simplifies and speeds up handling of register aliases. Process basic blocks in reverse order: Definitions are known to end register livetimes when walking backwards (contrary when walking forward then uses may or may not be a kill so we need heuristics). Check register mask operands (calls) instead of conservatively assuming everything is clobbered. Enhance heuristics to detect killing uses: In case of a small number of defs/uses check if they are all in the same basic block and if so the last one is a killing use. Enhance heuristic for copy-coalescing through hinting: We check the first k defs of a register for COPYs rather than relying on there just being a single definition. When testing this on the full llvm test-suite including SPEC externals I measured: average 5.1% reduction in code size for X86, 4.9% reduction in code on aarch64. (ranging between 0% and 20% depending on the test) 0.5% faster compiletime (some analysis suggests the pass is slightly slower than before, but we more than make up for it because later passes are faster with the reduced instruction count) Also adds a few testcases that were broken without this patch, in particular bug 47278. Patch mostly by Matthias Braun	2020-09-18 14:05:18 -04:00
Matt Arsenault	870fd53e4f	Reapply "RegAllocFast: Record internal state based on register units" The regressions this caused should be fixed when https://reviews.llvm.org/D52010 is applied. This reverts commit `a21387c654`.	2020-09-18 14:05:18 -04:00
Matt Arsenault	0576f436e5	AMDGPU: Don't sometimes allow instructions before lowered si_end_cf Since `6524a7a2b9`, this would sometimes not emit the or to exec at the beginning of the block, where it really has to be. If there is an instruction that defines one of the source operands, split the block and turn the si_end_cf into a terminator. This avoids regressions when regalloc fast is switched to inserting reloads at the beginning of the block, instead of spills at the end of the block. In a future change, this should always split the block.	2020-09-18 13:43:01 -04:00
Matt Arsenault	27df165270	Revert "[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel." This reverts commit `c3492a1aa1`. I think this is the wrong strategy and wrong place to do this transform anyway. Also reverts follow up commit `7d593d0d69`.	2020-09-18 09:48:33 -04:00
Mirko Brkusanin	ae36c02ad0	[AMDGPU] Set DS alignment requirements to be more strict Alignment requirements for ds_read/write_b96/b128 for gfx9 and onward are now the same as for other GCN subtargets. This way we can avoid any unintentional use of these instructions on systems that do not support dword alignment and instead require natural alignment. This also makes 'SH_MEM_CONFIG.alignment_mode == STRICT' the default. Differential Revision: https://reviews.llvm.org/D87821	2020-09-18 15:26:24 +02:00
Florian Hahn	9d172c8e9c	Recommit "[DSE] Switch to MemorySSA-backed DSE by default." This switches to using DSE + MemorySSA by default again, after fixing the issues reported after the first commit. Notable fixes `fc82006331`, `a0017c2bc2`. This reverts commit `3a59628f3c`.	2020-09-18 11:05:00 +01:00
Michael Liao	c3492a1aa1	[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel. - Need to lower COPY from SGPR to VGPR to a real instruction as the standard COPY is used where the source and destination are from the same register bank so that we potentially coalesc them together and save one COPY. Considering that, backend optimizations, such as CSE, won't handle them. However, the copy from SGPR to VGPR always needs materializing to a native instruction, it should be lowered into a real one before other backend optimizations. Differential Revision: https://reviews.llvm.org/D87556	2020-09-17 11:04:17 -04:00
alex-t	0efbb70b71	[AMDGPU] should expand ROTL i16 to shifts. Instruction combining pass turns library rotl implementation to llvm.fshl.i16. In the selection dag the intrinsic is turned to ISD::ROTL node that cannot be selected. Need to expand it to shifts again. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D87618	2020-09-17 17:34:33 +03:00
Jay Foad	6f6d389da5	[SplitKit] Only copy live lanes When splitting a live interval with subranges, only insert copies for the lanes that are live at the point of the split. This avoids some unnecessary copies and fixes a problem where copying dead lanes was generating MIR that failed verification. The test case for this is test/CodeGen/AMDGPU/splitkit-copy-live-lanes.mir. Without this fix, some earlier live range splitting would create %430: %430 [256r,848r:0)[848r,2584r:1) 0@256r 1@848r L0000000000000003 [848r,2584r:0) 0@848r L0000000000000030 [256r,2584r:0) 0@256r weight:1.480938e-03 ... 256B undef %430.sub2:vreg_128 = V_LSHRREV_B32_e32 16, %20.sub1:vreg_128, implicit $exec ... 848B %430.sub0:vreg_128 = V_AND_B32_e32 %92:sreg_32, %20.sub1:vreg_128, implicit $exec ... 2584B %431:vreg_128 = COPY %430:vreg_128 Then RAGreedy::tryLocalSplit would split %430 into %432 and %433 just before 848B giving: %432 [256r,844r:0) 0@256r L0000000000000030 [256r,844r:0) 0@256r weight:3.066802e-03 %433 [844r,848r:0)[848r,2584r:1) 0@844r 1@848r L0000000000000030 [844r,2584r:0) 0@844r L0000000000000003 [844r,844d:0)[848r,2584r:1) 0@844r 1@848r weight:2.831776e-03 ... 256B undef %432.sub2:vreg_128 = V_LSHRREV_B32_e32 16, %20.sub1:vreg_128, implicit $exec ... 844B undef %433.sub0:vreg_128 = COPY %432.sub0:vreg_128 { internal %433.sub2:vreg_128 = COPY %432.sub2:vreg_128 848B } %433.sub0:vreg_128 = V_AND_B32_e32 %92:sreg_32, %20.sub1:vreg_128, implicit $exec ... 2584B %431:vreg_128 = COPY %433:vreg_128 Note that the copy from %432 to %433 at 844B is a curious bundle-without-a-BUNDLE-instruction that SplitKit creates deliberately, and it includes a copy of .sub0 which is not live at this point, and that causes it to fail verification: * Bad machine code: No live subrange at use * - function: zextload_global_v64i16_to_v64i64 - basic block: %bb.0 (0x7faed48) [0B;2848B) - instruction: 844B undef %433.sub0:vreg_128 = COPY %432.sub0:vreg_128 - operand 1: %432.sub0:vreg_128 - interval: %432 [256r,844r:0) 0@256r L0000000000000030 [256r,844r:0) 0@256r weight:3.066802e-03 - at: 844B Using real bundles with a BUNDLE instruction might also fix this problem, but the current fix is less invasive and also avoids some unnecessary copies. https://bugs.llvm.org/show_bug.cgi?id=47492 Differential Revision: https://reviews.llvm.org/D87757	2020-09-17 09:26:11 +01:00
Jay Foad	d49707cf4b	[AMDGPU] Generate test checks for splitkit-copy-bundle.mir This is a pre-commit for D87757 "[SplitKit] Only copy live lanes".	2020-09-17 09:26:09 +01:00
Stanislav Mekhanoshin	91f503c3af	[AMDGPU] gfx1030 RT support Differential Revision: https://reviews.llvm.org/D87782	2020-09-16 11:40:58 -07:00
Matt Arsenault	88bdcbbf1a	GlobalISel: Lift store value widening restriction This doesn't change the memory size and doesn't need to worry about non-power-of-2 sizes.	2020-09-16 14:25:07 -04:00
Matt Arsenault	738c73a454	RegAllocFast: Make self loop live-out heuristic more aggressive This currently has no impact on code, but prevents sizeable code size regressions after D52010. This prevents spilling and reloading all values inside blocks that loop back. Add a baseline test which would regress without this patch.	2020-09-16 13:12:38 -04:00
Matt Arsenault	367248956e	AMDGPU: Clear offset register when using local stack area eliminateFrameIndex won't fix up the offset register when the direct frame index reference is moved to a separate move instruction. Switch the offset to a base 0 (which it probably should be to begin with).	2020-09-16 12:56:40 -04:00
Matt Arsenault	deae5e567d	AMDGPU: Add baseline test for incorrect SP access	2020-09-16 12:56:40 -04:00
Dmitry Preobrazhensky	06d058afec	[AMDGPU] Corrected directive to use for ELF weak refs WeakRefDirective should specify a directive to declare "a global as being a weak undefined symbol". The directive used by AMDGPU was incorrect - ".weakref" was intended for other purposes. The correct directive is ".weak" and it is already defined as default for ELF. So the redefinition was removed. Reviewers: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D87762	2020-09-16 18:51:26 +03:00
Matt Arsenault	71131db689	AMDGPU: Improve <2 x i24> arguments and return value handling This was asserting for GlobalISel. For SelectionDAG, this was passing this on the stack. Instead, scalarize this as if it were a 32-bit vector.	2020-09-16 11:21:56 -04:00
Sebastian Neubauer	833b3b0d3a	[AMDGPU] Add v3f16/v3i16 support to SDag Fix lowering and instruction selection for v3x16 types and enable InstCombine to emit them. This patch only implements it for the selection dag. GlobalISel tests in GlobalISel/llvm.amdgcn.image.load.1d.d16.ll and GlobalISel/llvm.amdgcn.image.store.2d.d16.ll still don't work. Differential Revision: https://reviews.llvm.org/D84420	2020-09-16 17:20:27 +02:00
Jay Foad	90777e2924	[AMDGPU] Enable scheduling around FP MODE-setting instructions Pre-gfx10 all MODE-setting instructions were S_SETREG_B32 which is marked as having unmodeled side effects, which makes the machine scheduler treat it as a barrier. Now that we have proper implicit $mode operands we can use a no-side-effects S_SETREG_B32_mode pseudo instead for setregs that only touch the FP MODE bits, to give the scheduler more freedom. Differential Revision: https://reviews.llvm.org/D87446	2020-09-16 16:10:47 +01:00
Jay Foad	54bb9e8649	[AMDGPU] Add -show-mc-encoding to setreg tests This is a pre-commit for D87446 "[AMDGPU] Enable scheduling around FP MODE-setting instructions"	2020-09-16 16:09:47 +01:00
Alina Sbirlea	3b3ca5c989	Fix test after D86156.	2020-09-15 19:13:39 -07:00
Volkan Keles	a4e35cc2ec	GlobalISel: Add combines for G_TRUNC https://reviews.llvm.org/D87050	2020-09-15 15:50:34 -07:00
Stanislav Mekhanoshin	277de43d88	[AMDGPU] Unify intrinsic ret/nortn interface We have a single noret intrinsic an a lot of special handling around it. Declare it just as any other but do not define rtn instructions itself instead. Differential Revision: https://reviews.llvm.org/D87719	2020-09-15 15:26:42 -07:00
Florian Hahn	3a59628f3c	Revert "[DSE] Switch to MemorySSA-backed DSE by default." This reverts commit `fb109c42d9`. Temporarily revert due to a mis-compile pointed out at D87163.	2020-09-15 18:07:56 +01:00
Hans Wennborg	a21387c654	Revert "RegAllocFast: Record internal state based on register units" This seems to have caused incorrect register allocation in some cases, breaking tests in the Zig standard library (PR47278). As discussed on the bug, revert back to green for now. > Record internal state based on register units. This is often more > efficient as there are typically fewer register units to update > compared to iterating over all the aliases of a register. > > Original patch by Matthias Braun, but I've been rebasing and fixing it > for almost 2 years and fixed a few bugs causing intermediate failures > to make this patch independent of the changes in > https://reviews.llvm.org/D52010. This reverts commit `66251f7e1d`, and follow-ups `931a68f26b` and `0671a4c508`. It also adjust some test expectations.	2020-09-15 13:25:41 +02:00
Petar Avramovic	9b4fa85434	GlobalISel/IRTranslator resetTargetOptions based on function attributes Update TargetMachine.Options with function attributes before we start to generate MIR instructions. This allows access to correct function attributes via TargetMachine.Options (it used to access attributes of the function that was translated first). This affects some existing tests with "no-nans-fp-math" attribute. Follow-up on D87456. Differential Revision: https://reviews.llvm.org/D87511	2020-09-15 10:26:09 +02:00
Quentin Colombet	b3afad0463	[GlobalISel] Add a `X, Y = G_UNMERGE(G_ZEXT Z)` -> X = G_ZEXT Z; Y = 0 combine Add a combiner helper to transform unmerge of zext into one zext and a constant 0 Differential Revision: https://reviews.llvm.org/D87427	2020-09-14 17:27:23 -07:00
Quentin Colombet	d2321129bd	[GlobalISel] Add `X,Y<dead> = G_UNMERGE Z` -> X = G_TRUNC Z Add a combiner helper that replaces G_UNMERGE where all the destination lanes are dead except the first one with a G_TRUNC. Differential Revision: https://reviews.llvm.org/D87174	2020-09-14 17:27:23 -07:00
Quentin Colombet	a36278c2f8	[GlobalISel] Add G_UNMERGE(Cst) -> Cst1, Cst2, ... combine Add a combiner helper that replaces G_UNMERGE of big constants into direct use of smaller constants. Differential Revision: https://reviews.llvm.org/D87166	2020-09-14 16:30:18 -07:00
Craig Topper	c193a689b4	[SelectionDAG] Use Align/MaybeAlign in calls to getLoad/getStore/getExtLoad/getTruncStore. The versions that take 'unsigned' will be removed in the future. I tried to use getOriginalAlign instead of getAlign in some places. getAlign factors in the minimum alignment implied by the offset in the pointer info. Since we're also passing the pointer info we can use the original alignment. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D87592	2020-09-14 13:54:50 -07:00
Austin Kerbow	f859c30ecb	[AMDGPU] Add XDL resource to scheduling model Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D87621	2020-09-14 13:48:54 -07:00
Jay Foad	c799f873cb	[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530	2020-09-14 13:40:17 +01:00
Georgii Rymar	e9c314611b	[llvm-readelf/obj] - Refine and generalize the code that is used to dump notes. There is some code that can be shared between GNU/LLVM styles. Also, this fixes 2 inconsistencies related to dumping unknown note types: 1) For GNU style we printed "Unknown note type: (0x00000003)" in some cases, and "Unknown note type (0x00000003)" (no colon) in other cases. GNU readelf always prints `:`. This patch removes the related code duplication and does the same. 2) For LLVM style in some cases we printed "Unknown note type (0x00000003)", but sometimes just "Unknown (0x00000003)". The latter is the right form, which is consistent with other unknowns that are printed in LLVM style. Rebased on top of D87453. Differential revision: https://reviews.llvm.org/D87454	2020-09-14 14:31:50 +03:00
Petar Avramovic	6e2a86ed5a	AMDGPU/GlobalISel Check for NoNaNsFPMath in isKnownNeverSNaN Check for NoNaNsFPMath function attribute in isKnownNeverSNaN. Function attributes are in held in 'TargetMachine.Options'. Among other things, this allows selection of some patterns imported in D87351 since G_FCANONICALIZE is not generated when isKnownNeverSNaN returns true in lowerFMinNumMaxNum. However we notice some incorrect results since function attributes are not correctly written in TargetMachine.Options when next function is processed. Take a look at @v_test_no_global_nnans_med3_f32_pat0_srcmod0, it has "no-nans-fp-math"="false" but TargetMachine.Options still has it set to true since first function in test file had this attribute set to true. This will be fixed in D87511. Differential Revision: https://reviews.llvm.org/D87456	2020-09-14 12:11:00 +02:00
Petar Avramovic	416346d1ca	AMDGPU/GlobalISel/Emitter Recognize additional 'same operand checks' The "name" of a non-leaf complex pattern (MY_PAT $op1, $op2) is "MY_PAT:op1:op2" and the ones with same "name" represent same operand. Add 'same operand check' for this case. Differential Revision: https://reviews.llvm.org/D87351	2020-09-14 12:10:59 +02:00
Petar Avramovic	0c8f4cd657	AMDGPU/GlobalISel Add test for non-leaf complex patterns GlobalIsel emitter does not import patterns where complex sub-operand of a non-leaf complex pattern is referenced more then once. Multiple references of complex patterns with same name and same sub-operands represent the same operand. Document this with a test.	2020-09-14 12:10:59 +02:00
Petar Avramovic	09b8871f8d	AMDGPU/GlobalISel/Emitter Support for predicate code that uses operands Predicates with 'let PredicateCodeUsesOperands = 1' want to examine matched operands. When we encounter predicate code that uses operands, analyze its named operand arguments and create a map between argument index and name. Later, when leaf node with name is encountered, emit GIM_RecordNamedOperand that will store that operand at its argument index in operand list. This operand list will be an argument to c++ code of the predicate. Differential Revision: https://reviews.llvm.org/D87285	2020-09-14 10:39:56 +02:00
Matt Arsenault	e21bb31eb6	CodeGen: Require SSA to run PeepholeOptimizer	2020-09-11 18:03:04 -04:00
Jay Foad	06e356c81e	[AMDGPU] Make movreld-bug test case more robust Without this, future optimizer improvements can optimize the entire function to "return 0".	2020-09-11 10:25:29 +01:00
Michael Liao	f787fe15d8	[EarlyCSE] Remove unnecessary operand swap. - As min/max are commutative operators, there is no need to swap operands. That breaks the convention calculating the hash value.	2020-09-11 02:14:04 -04:00
Florian Hahn	fb109c42d9	[DSE] Switch to MemorySSA-backed DSE by default. The tests have been updated and I plan to move them from the MSSA directory up. Some end-to-end tests needed small adjustments. One difference to the legacy DSE is that legacy DSE also deletes trivially dead instructions that are unrelated to memory operations. Because MemorySSA-backed DSE just walks the MemorySSA, we only visit/check memory instructions. But removing unrelated dead instructions is not really DSE's job and other passes will clean up. One noteworthy change is in llvm/test/Transforms/Coroutines/ArgAddr.ll, but I think this comes down to legacy DSE not handling instructions that may throw correctly in that case. To cover this with MemorySSA-backed DSE, we need an update to llvm.coro.begin to treat it's return value to belong to the same underlying object as the passed pointer. There are some minor cases MemorySSA-backed DSE currently misses, e.g. related to atomic operations, but I think those can be implemented after the switch. This has been discussed on llvm-dev: http://lists.llvm.org/pipermail/llvm-dev/2020-August/144417.html For the MultiSource/SPEC2000/SPEC2006 the number of eliminated stores goes from ~17500 (legayc DSE) to ~26300 (MemorySSA-backed). More numbers and details in the thread on llvm-dev. Impact on CTMark: ``` Legacy Pass Manager exec instrs size-text O3 + 0.60% - 0.27% ReleaseThinLTO + 1.00% - 0.42% ReleaseLTO-g. + 0.77% - 0.33% RelThinLTO (link only) + 0.87% - 0.42% RelLO-g (link only) + 0.78% - 0.33% ``` http://llvm-compile-time-tracker.com/compare.php?from=3f22e96d95c71ded906c67067d75278efb0a2525&to=ae8be4642533ff03803967ee9d7017c0d73b0ee0&stat=instructions ``` New Pass Manager exec instrs. size-text O3 + 0.95% - 0.25% ReleaseThinLTO + 1.34% - 0.41% ReleaseLTO-g. + 1.71% - 0.35% RelThinLTO (link only) + 0.96% - 0.41% RelLO-g (link only) + 2.21% - 0.35% ``` http://195.201.131.214:8000/compare.php?from=3f22e96d95c71ded906c67067d75278efb0a2525&to=ae8be4642533ff03803967ee9d7017c0d73b0ee0&stat=instructions Reviewed By: asbirlea, xbolva00, nikic Differential Revision: https://reviews.llvm.org/D87163	2020-09-10 22:24:32 +01:00
Matt Arsenault	85490874b2	AMDGPU: Skip all meta instructions in hazard recognizer This was not adding a necessary nop due to thinking the kill counted.	2020-09-09 19:45:40 -04:00
Matt Arsenault	82cbc9330a	AMDGPU: Fix inserting waitcnts before kill uses	2020-09-09 19:45:40 -04:00
dfukalov	c259d3a061	[AMDGPU] Fix for folding v2.16 literals. It was found some packed immediate operands (e.g. `<half 1.0, half 2.0>`) are incorrectly processed so one of two packed values were lost. Introduced new function to check immediate 32-bit operand can be folded. Converted condition about current op_sel flags value to fall-through. Fixes: SWDEV-247595 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D87158	2020-09-10 01:39:25 +03:00
Amara Emerson	e5784ef8f6	[GlobalISel] Enable usage of BranchProbabilityInfo in IRTranslator. We weren't using this before, so none of the MachineFunction CFG edges had the branch probability information added. As a result, block placement later in the pipeline was flying blind. This is enabled only with optimizations enabled like SelectionDAG. Differential Revision: https://reviews.llvm.org/D86824	2020-09-09 14:31:12 -07:00
Amara Emerson	cc76da7ada	[GlobalISel] Rewrite the elide-br-by-swapping-icmp-ops combine to do less. This combine previously tried to take sequences like: %cond = G_ICMP pred, a, b G_BRCOND %cond, %truebb G_BR %falsebb %truebb: ... %falsebb: ... and by inverting the compare predicate and swapping branch targets, delete the G_BR and instead have a single conditional branch to the falsebb. Since in an earlier patch we have a combine to fold not(icmp) into just an inverted icmp, we don't need this combine to do as much. This patch instead generalizes the combine by just looking for: G_BRCOND %cond, %truebb G_BR %falsebb %truebb: ... %falsebb: ... and then inverting the condition using a not (xor). The xor can be folded away in a separate combine. This change also lets us avoid some optimization code in the IRTranslator. I also think that deleting G_BRs in the combiner is unnecessary. That's something that targets can decide to do at selection time and could simplify generic code in future. Differential Revision: https://reviews.llvm.org/D86664	2020-09-09 13:08:16 -07:00
Ronak Chauhan	f078577f31	Revert "[AMDGPU] Support disassembly for AMDGPU kernel descriptors" This reverts commit `487a805310`. Tests fail on big endian machines.	2020-09-09 18:01:28 +05:30
Mirko Brkusanin	43af2a6faa	[AMDGPU] Workaround for LDS Misalignment bug on GFX10 Add subtarget feature check to avoid using ds_read/write_b96/128 with too low alignment if a bug is present on that specific hardware. Add this "feature" to GFX 10.1.1 as it is also affected. Add global-isel test.	2020-09-09 11:46:09 +02:00
Ronak Chauhan	487a805310	[AMDGPU] Support disassembly for AMDGPU kernel descriptors Decode AMDGPU Kernel descriptors as assembler directives. Reviewed By: scott.linder, jhenderson, kzhuravl Differential Revision: https://reviews.llvm.org/D80713	2020-09-08 21:26:11 +05:30
alex-t	2480a31e5d	[AMDGPU] SILowerControlFlow::optimizeEndCF should remove empty basic block optimizeEndCF removes EXEC restoring instruction case this instruction is the only one except the branch to the single successor and that successor contains EXEC mask restoring instruction that was lowered from END_CF belonging to IF_ELSE. As a result of such optimization we get the basic block with the only one instruction that is a branch to the single successor. In case the control flow can reach such an empty block from S_CBRANCH_EXEZ/EXECNZ it might happen that spill/reload instructions that were inserted later by register allocator are placed under exec == 0 condition and never execute. Removing empty block solves the problem. This change require further work to re-implement LIS updates. Recently, LIS is always nullptr in this pass. To enable it we need another patch to fix many places across the codegen. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D86634	2020-09-07 19:37:27 +03:00
vnalamot	aff94ec0f4	[AMDGPU] Remove the dead spill slots while spilling FP/BP to memory During the PEI pass, the dead TargetStackID::SGPRSpill spill slots are not being removed while spilling the FP/BP to memory. Fixes: SWDEV-250393 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D87032	2020-09-06 07:04:25 +05:30
Matt Arsenault	3c2a7bd286	AMDGPU: Remove code to handle tied si_else operands This has not used tied operands for a long time.	2020-09-03 19:46:05 -04:00
David Green	245f846c4e	[MemCpyOptimizer] Change required analysis order for BasicAA/PhiValuesAnalysis This is a followup to `1ccfb52a61`, which made a number of changes including the apparently innocuous reordering of required passes in MemCpyOptimizer. This however altered the creation order of BasicAA vs Phi Values analysis, meaning BasicAA did not pick up PhiValues as a cached result. Instead if we require MemoryDependence first it will require PhiValuesAnalysis allowing BasicAA to use it for better results. I don't claim this is an excellent design, but it fixes a nasty little regressions where a query later in JumpThreading was getting worse results. Differential Revision: https://reviews.llvm.org/D87027	2020-09-03 12:01:51 +01:00
Jay Foad	4bdab2e86a	[AMDGPU] Fix offset for REL32_HI relocs The addend in a REL32 reloc needs to be adjusted to account for the offset from the PC value returned by the s_getpc instruction to the point where the reloc is applied. This was being done correctly for (GOTPC)REL32_LO but not for (GOTPC)REL32_HI. This will only make a difference if the target symbol happens to get loaded almost exactly a multiple of 4G away from the relocated instructions. Differential Revision: https://reviews.llvm.org/D86938	2020-09-02 10:55:55 +01:00
Alina Sbirlea	1ccfb52a61	[MemCpyOptimizer] Preserve analyses and replace use of lambdas to get them. Summary: Analyses are preserved in MemCpyOptimizer. Get analyses before running the pass and store the pointers, instead of using lambdas and getting them every time on demand. Reviewers: lenary, deadalnix, mehdi_amini, nikic, efriedma Subscribers: hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D74494	2020-09-01 17:35:40 -07:00
Amara Emerson	520ab710fb	Revert "Revert "[GlobalISel] Fold xor(cmp(pred, _, _), 1) -> cmp(inverse(pred), _, _)" (and dependent patch "Optimize away a Not feeding a brcond by using tbz instead of tbnz.")" This reverts commit `8693ddc743`. Re-committing with the test requiring asserts.	2020-09-01 14:29:04 -07:00
Jordan Rupprecht	8693ddc743	Revert "[GlobalISel] Fold xor(cmp(pred, _, _), 1) -> cmp(inverse(pred), _, _)" (and dependent patch "Optimize away a Not feeding a brcond by using tbz instead of tbnz.") This reverts commit `8ad8f484b6`. It causes crashes when running `ninja check-llvm-codegen-aarch64-globalisel`, e.g. http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/24132/steps/test-stage1-compiler/logs/stdio. Note that the crash does not seem to reproduce in debug builds. `5ded444252` depends on this, so revert that too.	2020-09-01 13:31:57 -07:00
Michael Liao	1f4e7463b5	[amdgpu] Run SROA after loop unrolling. Summary: - There are promotable `alloca`s after loop unrolling. Reviewers: rampitec, arsenm Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, nikic, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D84252	2020-09-01 16:09:56 -04:00
Amara Emerson	8ad8f484b6	[GlobalISel] Fold xor(cmp(pred, _, _), 1) -> cmp(inverse(pred), _, _) This is needed for an upcoming change to how we translate conditional branches which might generate these. Differential Revision: https://reviews.llvm.org/D86383	2020-09-01 10:57:17 -07:00
Matt Arsenault	32a8a10b42	GlobalISel: Implement computeNumSignBits for G_SELECT	2020-09-01 12:50:19 -04:00
Matt Arsenault	35c94d3f7e	GlobalISel: Port smarter known bits for umin/umax from DAG	2020-09-01 12:50:15 -04:00
Volkan Keles	061182b7ba	GlobalISel: Add combines for extend operations https://reviews.llvm.org/D86516	2020-09-01 08:50:06 -07:00
Matt Arsenault	9e7e1b2d4b	GlobalISel: Implement computeNumSignBits for G_SEXTLOAD/G_ZEXTLOAD	2020-09-01 11:20:02 -04:00
Matt Arsenault	92090e8bd8	GlobalISel: Implement computeKnownBits for G_UNMERGE_VALUES	2020-09-01 11:19:27 -04:00
Matt Arsenault	18bbd9f15e	GlobalISel: Artifact combine unmerge of unmerge Unmerges have the same fundamental problem as G_TRUNC, and G_TRUNC could be implemented in terms of G_UNMERGE_VALUES. Reducing the number of elements in unmerge results ends up producing the original unmerge type profile, so the artifact combiner needs to eliminate the intermediate illegal registers. This avoids infinite looping in the legalizer in a future change. Assuming an unmerge has each result unmerged the same way, this ends up producing a new unmerge of the source for every definition. I'm not sure if the artifact combiner should either insert temporary merges here and erase the original merge, or if the combiner should look at uses from defs rather than defs from uses for unmerges. In a few cases this regresses from using 16-bit shifts for 8-bit values to using 32-bit shifts, but I think these can be legalized later (the other legalization rules don't try very hard to use 16-bit shifts either).	2020-09-01 11:01:33 -04:00
Matt Arsenault	7f5b4eaae4	AMDGPU: Check some offsets in test This will make updating the checks easier in a future change.	2020-09-01 11:01:02 -04:00
Matt Arsenault	4a9a4885ae	AMDGPU: Convert test to MIR Currently the dbg_value ends up in the relaxed branch block. A future commit will push the dbg_value out of this block, and I'm not sure how to coax the IR into producing the same MIR at the relevant point.	2020-09-01 11:01:02 -04:00
Fangrui Song	f2284e3405	[Sink] Optimize/simplify sink candidate finding with nearest common dominator For an instruction in the basic block BB, SinkingPass enumerates basic blocks dominated by BB and BB's successors. For each enumerated basic block, SinkingPass uses `AllUsesDominatedByBlock` to check whether the basic block dominates all of the instruction's users. This is inefficient. Use the nearest common dominator of all users to avoid enumerating the candidate. The nearest common dominator may be in a parent loop which is not beneficial. In that case, find the ancestors in the dominator tree. In the case that the instruction has no user, with this change we will not perform unnecessary move. This causes some amdgpu test changes. A stage-2 x86-64 clang is a byte identical with this change.	2020-08-30 22:51:00 -07:00
Matt Arsenault	1b201914b5	GlobalISel: Combine out redundant sext_inreg The scalar tests don't work yet, since computeNumSignBits apparently doesn't handle sextload yet, and sext folds into the load first.	2020-08-28 17:57:31 -04:00
Matt Arsenault	9145d75226	AMDGPU: Fix incorrectly deleting copies after spilling SGPR tuples The implicit def of the super register would appear to kill any live uses of components before the spill, and would be deleted by MachineCopyPropagation. We need to add implicit uses of the super register, similarly to what copyPhysReg does. VGPR tuples appear to be correctly handled already. I need to double check the SGPR->memory path.	2020-08-28 17:50:37 -04:00
Matt Arsenault	af1c1e20f4	AMDGPU/GlobalISel: Implement computeKnownBits for groupstaticsize	2020-08-27 19:39:44 -04:00
Matt Arsenault	f08bbde83f	Correctly revert "GlobalISel: Use & operator on KnownBits" I mis-resolved the revert through moving the code to another function.	2020-08-27 19:08:31 -04:00
Matt Arsenault	abc99ab572	GlobalISel: Implement known bits for min/max	2020-08-27 16:56:17 -04:00
Matt Arsenault	201f770f16	GlobalISel: Add and_trivial_mask to all_combines Also make up a new category of combines.	2020-08-27 16:42:09 -04:00
Matt Arsenault	9607ccf626	GlobalISel: Remove leftover lit.local.cfg The global-isel feature has been required for a long time and was removed in `c9455d3c57`, so this was causing all tests to be skipped.	2020-08-27 13:49:06 -04:00
Aditya Nandakumar	db464a3dbf	[GISel] Add new GISel combiners for G_SELECT https://reviews.llvm.org/D83833 Patch adds two new GICombinerRules for G_SELECT. The rules include: combining selects with undef comparisons into their first selectee value, and to combine away selects with constant comparisons. Patch additionally adds a new combiner test for the AArch64 target to test these new G_SELECT combiner rules and the existing select_same_val combiner rule. Patch by mkitzan	2020-08-27 09:40:15 -07:00
Drew Wock	0ec098e22b	[FPEnv] Allow fneg + strict_fadd -> strict_fsub in DAGCombiner This is the first of a set of DAGCombiner changes enabling strictfp optimizations. I want to test to waters with this to make sure changes like these are acceptable for the strictfp case- this particular change should preserve exception ordering and result precision perfectly, and many other possible changes appear to be able to as well. Copied from regular fadd combines but modified to preserve ordering via the chain, this change allows strict_fadd x, (fneg y) to become struct_fsub x, y and strict_fadd (fneg x), y to become strict_fsub y, x. Differential Revision: https://reviews.llvm.org/D85548	2020-08-27 08:17:01 -04:00
Piotr Sobczak	4e9d207117	[AMDGPU] Preserve vcc_lo when shrinking V_CNDMASK There is no justification for changing vcc_lo to vcc when shrinking V_CNDMASK, and such a change could later confuse live variable analysis. Make sure the original register is preserved. Differential Revision: https://reviews.llvm.org/D86541	2020-08-27 10:22:50 +02:00
Matt Arsenault	5207545a86	GlobalISel: IRTranslate minimum of pointer sizes on memcpy I forgot to squash this with `0b7f6cc71a`	2020-08-26 20:10:00 -04:00
Matt Arsenault	f78687df9b	AMDGPU: Don't assert on misaligned DS read2/write2 offsets This would assert with unaligned DS access enabled. The offset may not be aligned. Theoretically the pattern predicate should check the memory alignment, although it is possible to have the memory be aligned but not the immediate offset. In this case I would expect it to use ds_{read\|write}_b64 with unaligned access, but am not clear if there's a reason it doesn't.	2020-08-26 14:08:05 -04:00
Matt Arsenault	21ccedc24f	AMDGPU/GlobalISel: Tolerate negated control flow intrinsic outputs If the condition output is negated, swap the branch targets. This is similar to what SelectionDAG does for when SelectionDAGBuilder decides to invert the condition and swap the branches. This is leaving behind a dead constant def for some reason.	2020-08-26 08:58:54 -04:00
Matt Arsenault	eb074088c9	GlobalISel: Combine G_ADD of G_PTRTOINT to G_PTR_ADD This produces less work for addressing mode matching. I think this is safe since I don't think machine IR is supposed to give the same aliasing properties as getelementptr in the IR.	2020-08-26 08:57:15 -04:00
Jay Foad	831457c6d5	[AMDGPU][GlobalISel] Eliminate barrier if workgroup size is not greater than wavefront size If a workgroup size is known to be not greater than wavefront size the s_barrier instruction is not needed since all threads are guaranteed to come to the same point at the same time. This is the same optimization that was implemented for SelectionDAG in D31731. Differential Revision: https://reviews.llvm.org/D86609	2020-08-26 13:47:51 +01:00
QingShan Zhang	ebf3b188c6	[Scheduling] Implement a new way to cluster loads/stores Before calling target hook to determine if two loads/stores are clusterable, we put them into different groups to avoid fake cluster due to dependency. For now, we are putting the loads/stores into the same group if they have the same predecessor. We assume that, if two loads/stores have the same predecessor, it is likely that, they didn't have dependency for each other. However, one SUnit might have several predecessors and for now, we just pick up the first predecessor that has non-data/non-artificial dependency, which is too arbitrary. And we are struggling to fix it. So, I am proposing some better implementation. 1. Collect all the loads/stores that has memory info first to reduce the complexity. 2. Sort these loads/stores so that we can stop the seeking as early as possible. 3. For each load/store, seeking for the first non-dependency instruction with the sorted order, and check if they can cluster or not. Reviewed By: Jay Foad Differential Revision: https://reviews.llvm.org/D85517	2020-08-26 12:33:59 +00:00
Jay Foad	8a1926c67a	AMDGPU/GlobalISel: re-auto-generate some test checks	2020-08-25 15:54:22 +01:00
Matt Arsenault	0d2fe90063	AMDGPU/GlobalISel: Use more accurate legality rules for merge/unmerge Most notably, we were incorrectly reporting <3 x s16> as a legal type for these. Make sure these aren't legal to help make progress on fixing the artifact combiner and vector legalizer rules. Unfortunately, this means spreading the -global-isel-abort=0 hack, although this doesn't change the legalizer result in any situation.	2020-08-25 09:40:20 -04:00
Matt Arsenault	984a499f9d	AMDGPU/GlobalISel: Fix using unlegalizable values in tests Implicit uses of non-register value types places impossible to satisfy constraints on the legalizer / artifact combiner. These prevent writing sensible legalize rules for the artifacts without triggering infinite loops in the legalizer. The verifier really needs to enforce this, but I'm not sure what the exact conditions would look like yet.	2020-08-25 09:39:32 -04:00
Matt Arsenault	ef8f3b5a78	AMDGPU/GlobalISel: Apply bitcast load/store hack to pointer vectors The selection patterns will currently fail on these.	2020-08-25 09:37:41 -04:00
Matt Arsenault	77e5a195f8	AMDGPU/GlobalISel: Handle AGPRs used for SGPR operands. We would still need to waterfall if the value were somehow an AGPR, and also need to explicitly copy to a VGPR.	2020-08-24 17:54:34 -04:00
Matt Arsenault	05a3c8848a	AMDGPU: Have a few selection failure tests check both paths SelectionDAG and GlobalISel take different failure paths for these and end up producing different failure errors. Check both so the test passes when the default is switched.	2020-08-24 17:46:31 -04:00
Matt Arsenault	75e6f0b3d4	AMDGPU: Add flag to disable promotion of uniform i16 ops This interferes with GlobalISel's much better handling of the situation. This should really be disable for GlobalISel. However, the fallback only re-runs the selection passes, and doesn't go back and rerun any codegen IR passes. I haven't come up with a good solution to this problem.	2020-08-24 14:39:27 -04:00
Matt Arsenault	116affb18d	TableGen/GlobalISel: Allow inst matcher to check multiple opcodes This is to initially handleg immAllOnesV, which should match G_BUILD_VECTOR or G_BUILD_VECTOR_TRUNC. In the future, it could be used for other patterns cases that map to multiple G_* instructions, such as G_ADD and G_PTR_ADD.	2020-08-24 13:48:51 -04:00
Jay Foad	a522067692	[SDAG] Convert FSHL <--> FSHR if the target only supports one of them D77152 tried to do this but got it wrong in the shift-by-zero case. D86430 reverted the wrong code. Reimplement the optimization with different code depending on whether the shift amount is known to be non-zero (modulo bitwidth). This improves code quality for fshl tests on AMDGPU, which only has an fshr instruction. Differential Revision: https://reviews.llvm.org/D86438	2020-08-24 17:47:10 +01:00
Matt Arsenault	bdb25b3ce5	AMDGPU/GlobalISel: Use different technique for sample v3s16 values Avoid relying on implicit_def values, and odd sized G_INSERT/G_EXTRACT	2020-08-24 10:07:30 -04:00
Matt Arsenault	9b3222d560	AMDGPU/GlobalISel: Add baseline, failing unmerge tests	2020-08-24 10:07:30 -04:00
Matt Arsenault	70cd9f5b77	AMDGPU/GlobalISel: Start implementing computeKnownBitsForTargetInstr Handle workitem intrinsics. There isn't really away to adequately test this right now, since none of the known bits users are fine grained enough to test the edge conditions. This triggers a number of instances of the new 64-bit to 32-bit shift combine in the existing tests.	2020-08-24 09:53:27 -04:00
Matt Arsenault	e1644a3779	GlobalISel: Reduce G_SHL width if source is extension shl ([sza]ext x, y) => zext (shl x, y). Turns expensive 64 bit shifts into 32 bit if it does not overflow the source type: This is a port of an AMDGPU DAG combine added in `5fa289f0d8`. InstCombine does this already, but we need to do it again here to apply it to shifts introduced for lowered getelementptrs. This will help matching addressing modes that use 32-bit offsets in a future patch. TableGen annoyingly assumes only a single match data operand, so introduce a reusable struct. However, this still requires defining a separate GIMatchData for every combine which is still annoying. Adds a morally equivalent function to the existing getShiftAmountTy. Without this, we would have to do try to repeatedly query the legalizer info and guess at what type to use for the shift.	2020-08-24 09:42:40 -04:00
Bjorn Pettersson	7a4e26adc8	[SelectionDAG] Fix miscompile bug in expandFunnelShift This is a fixup of commit `0819a6416f` (D77152) which could result in miscompiles. The miscompile could only happen for targets where isOperationLegalOrCustom could return different values for FSHL and FSHR. The commit mentioned above added logic in expandFunnelShift to convert between FSHL and FSHR by swapping direction of the funnel shift. However, that transform is only legal if we know that the shift count (modulo bitwidth) isn't zero. Basically, since fshr(-1,0,0)==0 and fshl(-1,0,0)==-1 then doing a rewrite such as fshr(X,Y,Z) => fshl(X,Y,0-Z) would be incorrect if Z modulo bitwidth, could be zero. ``` $ ./alive-tv /tmp/test.ll ---------------------------------------- define i32 @src(i32 %x, i32 %y, i32 %z) { %0: %t0 = fshl i32 %x, i32 %y, i32 %z ret i32 %t0 } => define i32 @tgt(i32 %x, i32 %y, i32 %z) { %0: %t0 = sub i32 32, %z %t1 = fshr i32 %x, i32 %y, i32 %t0 ret i32 %t1 } Transformation doesn't verify! ERROR: Value mismatch Example: i32 %x = #x00000000 (0) i32 %y = #x00000400 (1024) i32 %z = #x00000000 (0) Source: i32 %t0 = #x00000000 (0) Target: i32 %t0 = #x00000020 (32) i32 %t1 = #x00000400 (1024) Source value: #x00000000 (0) Target value: #x00000400 (1024) ``` It could be possible to add back the transform, given that logic is added to check that (Z % BW) can't be zero. Since there were no test cases proving that such a transform actually would be useful I decided to simply remove the faulty code in this patch. Reviewed By: foad, lebedev.ri Differential Revision: https://reviews.llvm.org/D86430	2020-08-24 09:52:11 +02:00
Matt Arsenault	901e3317fe	GlobalISel: Merge FewerElements for G_BUILD_VECTOR/G_CONCAT_VECTORS This switches from using G_EXTRACT in odd cases to widen with undef and unmerge.	2020-08-22 10:25:53 -04:00
Stanislav Mekhanoshin	9a9a092e61	[AMDGPU] Avoid sorting stalls in regbank-reassign This is the slowest operation in the already slow pass. Instead of sorting just put a stall list into an ordered map. Differential Revision: https://reviews.llvm.org/D86253	2020-08-21 11:49:41 -07:00
Mirko Brkusanin	0654ff703d	[AMDGPU] Use ds_read/write_b96/b128 when possible for SDag Do not break down local loads and stores so ds_read/write_b96/b128 in ISelLowering can be selected on subtargets that support them and if align requirements allow them. Differential Revision: https://reviews.llvm.org/D84403	2020-08-21 12:26:31 +02:00
Mirko Brkusanin	d17ea67b92	[AMDGPU][GlobalISel] Fix 96 and 128 local loads and stores Fix local ds_read/write_b96/b128 so they can be selected if the alignment allows. Otherwise, either pick appropriate ds_read2/write2 instructions or break them down. Differential Revision: https://reviews.llvm.org/D81638	2020-08-21 12:26:31 +02:00
Mirko Brkusanin	f5cd7ec9f3	[AMDGPU] Reorganize GCN subtarget features for unaligned access Features UnalignedBufferAccess and UnalignedDSAccess are now used to determine whether hardware supports such access. UnalignedAccessMode should be used to enable them. hasUnalignedBufferAccessEnabled() and hasUnalignedDSAccessEnabled() can be now used to quickly check both. Differential Revision: https://reviews.llvm.org/D84522	2020-08-21 12:26:31 +02:00
Mirko Brkusanin	5bd1febe21	[AMDGPU] Fix alignment requirements for 96bit and 128bit local loads and stores Adjust alignment requirements for ds_read/write_b96/b128. GFX9 and onwards allow misaligned access for reads and writes but only if SH_MEM_CONFIG.alignment_mode allows it. UnalignedDSAccess is set on GCN subtargets from GFX9 onward to let us know if we can relax alignment requirements. UnalignedAccessMode acts similary to UnalignedBufferAccess for DS instructions but only from GFX9 onward and is supposed to match alignment_mode. By default alignment of 4 is required. Differential Revision: https://reviews.llvm.org/D82788	2020-08-21 12:26:31 +02:00
Jay Foad	0819a6416f	[SelectionDAG] Better legalization for FSHL and FSHR In SelectionDAGBuilder always translate the fshl and fshr intrinsics to FSHL and FSHR (or ROTL and ROTR) instead of lowering them to shifts and ORs. Improve the legalization of FSHL and FSHR to avoid code quality regressions. Differential Revision: https://reviews.llvm.org/D77152	2020-08-21 10:32:49 +01:00
Michael Liao	5257a60ee0	[amdgpu] Add codegen support for HIP dynamic shared memory. Summary: - HIP uses an unsized extern array `extern __shared__ T s[]` to declare the dynamic shared memory, which size is not known at the compile time. Reviewers: arsenm, yaxunl, kpyzhov, b-sumner Subscribers: kzhuravl, jvesely, wdng, nhaehnle, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D82496	2020-08-20 21:29:18 -04:00
Matt Arsenault	79ce9bb380	CodeGen: Don't drop AA metadata when splitting MachineMemOperands Assuming this is used to split a memory access into smaller pieces, the new access should still have the same aliasing properties as the original memory access. As far as I can tell, this wasn't intentionally dropped. It may be necessary to drop this if you are moving the operand outside of the bounds of the original object in such a way that it may alias another IR object, but I don't think any of the existing users are doing this. Some of the uses widen into unused alignment padding, which I think is OK.	2020-08-20 16:17:30 -04:00
Matt Arsenault	18b218007d	AMDGPU/GlobalISel: Legalize odd sized loads with widening Custom lower and widen odd sized loads up to the alignment. The default set of legalization actions doesn't have a way to represent this. This fixes naturally aligned <3 x s8> and <3 x s16> loads. This also starts moving towards eliminating the buggy and overcomplicated legalization rules for narrowing. All the memory size changes should be done in the lower or custom action, not NarrowScalar / FewerElements. These currently have redundant and ambiguous code with the lower action.	2020-08-20 16:15:53 -04:00
Matt Arsenault	31adc28d24	GlobalISel: Implement fewerElementsVector for G_CONCAT_VECTORS sources This fixes <6 x s16> = G_CONCAT_VECTORS from <3 x s16> handling.	2020-08-19 18:53:24 -04:00
Matt Arsenault	ff5758fec8	AMDGPU/GlobalISel: Add some bitcast tests	2020-08-19 10:38:39 -04:00
Matt Arsenault	386a5ea2b7	AMDGPU/GlobalISel: Add selection tests for pointer constants	2020-08-19 10:23:56 -04:00
Ronak Chauhan	fdf71d486c	Revert "[AMDGPU] Support disassembly for AMDGPU kernel descriptors" This reverts commit `cacfb02d28`. Reverting due to buildbot failures.	2020-08-19 13:12:29 +05:30
Ronak Chauhan	cacfb02d28	[AMDGPU] Support disassembly for AMDGPU kernel descriptors Decode AMDGPU Kernel descriptors as assembler directives. Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D80713	2020-08-19 08:49:07 +05:30
Changpeng Fang	e7081d117a	AMDGPU: Implement waterfall loop for MIMG instructions with 256-bit SRsrc Summary: When the resource descriptor is of vgpr, we need a waterfall loop to read into a sgpr. In this patchm we generalized the implementation to work for any regster class sizes, and extend the work to MIMG instructions. Fixes: SWDEV-223405 Reviewers: arsenm, nhaehnle Differential Revision: https://reviews.llvm.org/D82603	2020-08-18 16:27:36 -07:00
Matt Arsenault	5a15f6628e	GlobalISel: Implement fewerElementsVector for G_INSERT_VECTOR_ELT Add unit tests since AMDGPU will only trigger this for gigantic vectors, and won't use the annoying odd sized breakdown case.	2020-08-18 13:51:19 -04:00
Amara Emerson	40e269ea6d	[GlobalISel] Add a combine for ashr(shl x, c), c --> sext_inreg x, c' By detecting this sign extend pattern early, we can uncover opportunities for more optimizations. Differential Revision: https://reviews.llvm.org/D85965	2020-08-18 10:42:15 -07:00
Jessica Paquette	224a8c639e	[GlobalISel][CallLowering] Look through call parameters for flags We weren't looking through the parameters on calls at all. E.g., say you had ``` declare i32 @zext(i32 zeroext %x) ... %y = call i32 @zext(i32 %something) ... ``` At the point of the call, we wouldn't know that the %something should have the zeroext attribute. This sets flags in about the same way as TargetLoweringBase::ArgListEntry::setAttributes. Differential Revision: https://reviews.llvm.org/D86125	2020-08-18 08:48:56 -07:00
Matt Arsenault	2f5f5febf3	AMDGPU/GlobalISel: Select llvm.amdgcn.groupstaticsize Previously, it would successfully select and assert if not HSA or PAL when expanding the pseudoinstruction. We don't need the pseudoinstruction anymore since we know the total size after legalization.	2020-08-18 09:28:01 -04:00
Matt Arsenault	3ba7777b94	AMDGPU/GlobalISel: Fix selection of s1/s16 G_[F]CONSTANT The code to determine the value size was overcomplicated and only correct in the case where the result register already had a register class assigned. We can always take the size directly from the register's type.	2020-08-18 09:28:01 -04:00
Matt Arsenault	a9ee0589a8	AMDGPU/GlobalISel: Match global saddr addressing mode	2020-08-17 15:48:06 -04:00
Matt Arsenault	e1a2f4713c	AMDGPU: Match global saddr addressing mode The previous implementation was incorrect, and based off incorrect instruction definitions. Unfortunately we can't match natural addressing in a lot of cases due to the shift/scale applied in getelementptrs. This relies on reducing the 64-bit shift to 32-bits.	2020-08-17 15:28:14 -04:00
Matt Arsenault	087dcbe9bc	AMDGPU: Add baseline tests for global saddr matching	2020-08-17 15:23:13 -04:00
Stanislav Mekhanoshin	24182f14b6	[AMDGPU] Define spill opcodes for all AGPR sizes Since we have defined all these sizes I believe we shall be able to spill these as well. Differential Revision: https://reviews.llvm.org/D86098	2020-08-17 12:17:23 -07:00
Matt Arsenault	fe171908e9	GlobalISel: Revisit users of other merge opcodes in artifact combiner The artifact combiner searches for the uses of G_MERGE_VALUES for unmerge/trunc that need further combining. This also needs to handle the vector merge opcodes the same way. This fixes leaving behind some pairs I expected to be removed, that were if the legalizer is run a second time.	2020-08-17 13:56:53 -04:00
Matt Arsenault	c8a9872259	AMDGPU/GlobalISel: Look through copies in getPtrBaseWithConstantOffset We may have an SGPR->VGPR copy if a totally uniform pointer calculation is used for a VGPR pointer operand. Also hack around a bug in MUBUF matching which would incorrectly use MUBUF for global when flat was requested. This should really be a predicate on the parent pattern, but the DAG always checked this manually inside the complex pattern.	2020-08-17 12:31:38 -04:00
Matt Arsenault	c7b9cd31bf	AMDGPU/GlobalISel: Fix missing 256-bit AGPR mapping	2020-08-17 09:53:26 -04:00
Matt Arsenault	af162ac785	AMDGPU/GlobalISel: Fix using readfirstlane with ballot intrinsics This should use the default mapping and insert a copy to the vcc bank, and not try to insert a readfirstlane.	2020-08-17 09:53:25 -04:00
Matt Arsenault	e0375dbcb3	AMDGPU: Fix using wrong offsets for global atomic fadd intrinsics Global instructions have the signed offsets.	2020-08-17 09:19:15 -04:00
Matt Arsenault	a7455652c0	AMDGPU: Fix global atomic saddr operand class	2020-08-15 12:12:28 -04:00
Matt Arsenault	e5077b5c2a	AMDGPU: Fix matching wrong offsets for global atomic loads These used signed offsets with a different size.	2020-08-15 12:12:17 -04:00
Matt Arsenault	47af1ac69a	AMDGPU: Correct definitions for global saddr instructions The VGPR component is a 32-bit offset, not 64-bits. I'm not sure what the correct syntax is for this. This maintains the vaddr position and leaves saddr in the end "off" position. This is particularly terrible for stores, since the operand order is now <vgpr offset>, <data>, <sgpr base>, splitting the pointer operands. I suppose this is a logical consequence from the mistake of not putting the data operand first. I'm not sure what sp3 does.	2020-08-15 12:11:57 -04:00
Matt Arsenault	79298a5067	AMDGPU: Remove SIFixupVectorISel pass This was only used for matching the saddr addressing mode of global instructions, but this was not implemented correctly. The instruction definitions aren't even correct, and are defined as using a 64-bit VGPR component. Eliminate this pass to enable correcting the instruction definitions. A new matching implementation can work in GlobalISel or relying on DAG divergence information for the base address.	2020-08-15 12:11:51 -04:00
Stanislav Mekhanoshin	43a38dc251	[AMDGPU] Fix MAI ld/st hazard handling It did not process hazard for ds_permute because it does not load or store even though it is DS. Differential Revision: https://reviews.llvm.org/D86003	2020-08-14 17:07:37 -07:00
Matt Arsenault	5c5e6d951e	TableGen/GlobalISel: Partially handle immAllOnesV/immAllZerosV These should really match either G_BUILD_VECTOR or G_BUILD_VECTOR_TRUNC, but there doesn't seem to be an existing mechanism for matching alternative opcodes. There is GIM_SwitchOpcode, but it seems to assume it's oly only used for matcher optimization. I could also omit any opcode check and rely on the matcher directly checking the opcode, but the table optimizer currently assumes there has to be an opcode check. Also doesn't try to handle undef elements like the DAG version.	2020-08-14 13:55:30 -04:00
Matt Arsenault	40a142fa57	AMDGPU/GlobalISel: Match andn2/orn2 for more types Unfortunately this ends up not working as expected on targets with 16-bit operations due to AMDGPUCodeGenPrepare's promotion of uniform 16-bit ops to i32. The vector case annoyingly requires switching the checked opcode, since constants for vectors aren't directly handled. I also need to think more carefully about whether this is valid for i1.	2020-08-14 13:18:03 -04:00
Sebastian Neubauer	9aa0ff77bd	[AMDGPU] Enable .rodata for amdpal os PAL recently got support for multiple ELF sections and relocations, therefore we can now use .rodata sections instead of forcing constants into .text. Differential Revision: https://reviews.llvm.org/D85895	2020-08-14 09:05:48 +02:00
Austin Kerbow	7d1cb187fb	[AMDGPU] Fix FP/BP spills when MUBUF constant offset exceeded If we need a scratch register for the spill don't use the same scratch register that is being used for the MBUF offset. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D85772	2020-08-13 14:12:00 -07:00
Stanislav Mekhanoshin	0462aef5f3	[AMDGPU] Inhibit SDWA if target instruction has FI Differential Revision: https://reviews.llvm.org/D85918	2020-08-13 11:34:28 -07:00
Matt Arsenault	c7191e3185	DAG: Don't pass 0 alignment value to allowsMisalignedMemoryAccesses I think not unconditionally passing getDstAlign is broken, but leave that for another change.	2020-08-13 09:33:17 -04:00
Carl Ritson	d538c5837a	[AMDGPU] Fix missed SI_RETURN_TO_EPILOG in pre-emit peephole SIPreEmitPeephole does not process all terminators, which means it can fail to handle SI_RETURN_TO_EPILOG if immediately preceeded by a branch to the early exit block. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D85872	2020-08-13 21:52:41 +09:00
Carl Ritson	2781f3003b	[AMDGPU] Pre-commit test for D85872	2020-08-13 13:07:27 +09:00
Ruiling Song	18b1e67523	[AMDGPU] Fix crash when dag-combining bitcast From the code after the 'break', they are processing 64bit scalar and vector bitcast. So I think the break-condition should be (cond1 \|\| cond2) This means we only execute following code if (64bit and dest-is-vector). Also remove a previous fix which is not needed with this new fix. (introduced in: `1349a04ef5`) Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D85804	2020-08-13 10:23:13 +08:00
Matt Arsenault	e14474a39a	AMDGPU/GlobalISel: Select llvm.amdgcn.global.atomic.fadd Remove the intermediate transform in the DAG path. I believe this is the last non-deprecated intrinsic that needs handling.	2020-08-12 10:04:53 -04:00
Matt Arsenault	701228c411	AMDGPU: Handle intrinsics in performMemSDNodeCombine This avoids a possible regression in a future patch	2020-08-12 10:04:53 -04:00
Jessica Paquette	cf9588a24a	Update AMDGPU testcases after `bebe6a6449` I didn't build AMDGPU locally so I didn't see this. ``` (logic_op (op x...), (op y...)) -> (op (logic_op x, y)) ``` kicks in here. Differential Revision: https://reviews.llvm.org/D85761	2020-08-11 11:32:36 -07:00
Matt Arsenault	0dc4c36d3a	AMDGPU/GlobalISel: Manually select llvm.amdgcn.writelane Fixup the special case constant bus handling pre-gfx10.	2020-08-11 11:56:16 -04:00
Matt Arsenault	6cac661637	AMDGPU/GlobalISel: Fix test bugs and add a few more cases The wrong alignment or addrspace was used. Also add various cases to stress a future patch.	2020-08-11 11:09:05 -04:00
Matt Arsenault	e2f1b48f86	GlobalISel: Implement bitcast action for G_INSERT_VECTOR_ELT This mirrors the support for the equivalent extracts. This also creates a huge mess that would be greatly improved if we had any bit operation combines.	2020-08-11 10:39:14 -04:00
Matt Arsenault	53f21e0fb7	TableGen/GlobalISel: Hack the operand order for atomic_store ISD::ATOMIC_STORE arbitrarily has the operands in the opposite order from regular ISD::STORE, which always introduced an annoying duplication of patterns to handle both cases. Since in GlobalISel there's just the one G_STORE, we need to swap the operands to correctly emit the type check for the pointer operand. Some work started in `20aafa3156` to migrate SelectionDAG to use ISD::STORE for atomics, but that work seems to have stalled. Since this is the pretty much the last operation which matters which isn't supported for AMDGPU, use this compatibility hack to unblock declaring it functionally complete. Not sure what's going on with the pending_phis AArch64 test. It seems it didn't always use atomics, and I'm not sure what it was originally testing matters anymore.	2020-08-11 10:22:44 -04:00
Stanislav Mekhanoshin	08803f0e62	Unbundle KILL bundles in VirtRegRewriter SplitKit forms invalid COPY subreg bundles without a leading BUNDLE instruction. That manifests itself in post-RA scheduler counting instruction and asserting on "Instruction count mismatch". The bundle shall be undone by VirtRegRewriter::expandCopyBundle(), but it does not because VirtRegRewriter::handleIdentityCopy() can turn COPY bundle into a KILL bundle. Process KILLs as well. Differential Revision: https://reviews.llvm.org/D85484	2020-08-10 11:58:37 -07:00
Matt Arsenault	6fe6b29c29	AMDGPU: Fix assertion in performSHLPtrCombine for 64-bit pointers	2020-08-10 13:46:52 -04:00
Matt Arsenault	68fab44acf	AMDGPU: Fix visiting physreg dest users when folding immediate copies This can fold the immediate into the physical destination, but this should not look for further users of the register. Fixes regression introduced by `766cb615a3`.	2020-08-10 13:46:51 -04:00
Petar Avramovic	0d58d9e8fb	AMDGPU/GlobalISel: Lower G_FREM Add custom lower for G_FREM. Differential Revision: https://reviews.llvm.org/D84324	2020-08-10 10:10:46 +02:00
Piotr Sobczak	62d8b8a225	Fix 64-bit copy to SCC Fix 64-bit copy to SCC by restricting the pattern resulting in such a copy to subtargets supporting 64-bit scalar compare, and mapping the copy to S_CMP_LG_U64. Before introducing the S_CSELECT pattern with explicit SCC (`0045786f14`), there was no need for handling 64-bit copy to SCC ($scc = COPY sreg_64). The proposed handling to read only the low bits was however based on a false premise that it is only one bit that matters, while in fact the copy source might be a vector of booleans and all bits need to be considered. The practical problem of mapping the 64-bit copy to SCC is that the natural instruction to use (S_CMP_LG_U64) is not available on old hardware. Fix it by restricting the problematic pattern to subtargets supporting the instruction (hasScalarCompareEq64). Differential Revision: https://reviews.llvm.org/D85207	2020-08-09 20:50:30 +02:00
Matt Arsenault	5a0b1472c0	GlobalISel: Handle zext(sext x) in artifact combiner This eliminates the illegal intermediate s8 value in the added test.	2020-08-07 16:37:46 -04:00
Vang Thao	04bd5b5286	[AMDGPU] Fix not rescheduling without clustering Regions are sometimes skipped which should be rescheduled without memory op clustering. RegionIdx is not incremented when iterating over regions that are flagged to be skipped, causing the index to be incorrect. Thanks to Vang Thao for discovering this bug! Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D85498	2020-08-07 11:15:58 -07:00
Bevin Hansson	5de6c56f7e	[Intrinsic] Add sshl.sat/ushl.sat, saturated shift intrinsics. Summary: This patch adds two intrinsics, llvm.sshl.sat and llvm.ushl.sat, which perform signed and unsigned saturating left shift, respectively. These are useful for implementing the Embedded-C fixed point support in Clang, originally discussed in http://lists.llvm.org/pipermail/llvm-dev/2018-August/125433.html and http://lists.llvm.org/pipermail/cfe-dev/2018-May/058019.html Reviewers: leonardchan, craig.topper, bjope, jdoerfert Subscribers: hiraditya, jdoerfert, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D83216	2020-08-07 15:09:24 +02:00
QingShan Zhang	1ffb468369	[NFC][Test] Format the test with script update_llc_test_checks.py	2020-08-07 10:22:38 +00:00
QingShan Zhang	3359ea62ed	[Scheduling] Create the missing dependency edges for store cluster If it is load cluster, we don't need to create the dependency edges(SUb->reg) from SUb to SUa as they both depend on the base register "reg" +-------+ +----> reg \| \| +---+---+ \| ^ \| \| \| \| \| \| \| +---+---+ \| \| SUa \| Load 0(reg) \| +---+---+ \| ^ \| \| \| \| \| +---+---+ +----+ SUb \| Load 4(reg) +-------+ But if it is store cluster, we need to create it as follow shows to avoid the instruction store depend on scheduled in-between SUb and SUa. +-------+ +----> reg \| \| +---+---+ \| ^ \| \| Missing +-------+ \| \| +-------------------->+ y \| \| \| \| +---+---+ \| +---+-+-+ ^ \| \| SUa \| Store x 0(reg) \| \| +---+---+ \| \| ^ \| \| \| +------------------------+ \| \| \| \| +---+--++ +----+ SUb \| Store y 4(reg) +-------+ Reviewed By: evandro, arsenm, rampitec, foad, fhahn Differential Revision: https://reviews.llvm.org/D72031	2020-08-07 04:58:03 +00:00
Matt Arsenault	1ad051dd8c	GlobalISel: Implement lower for G_INSERT_VECTOR_ELT	2020-08-06 19:29:17 -04:00
Matt Arsenault	87b2af8140	AMDGPU/GlobalISel: Enable s_{and\|or}n2_{b32\|b64} patterns	2020-08-06 18:00:38 -04:00
Matt Arsenault	e00201539f	GlobalISel: Implement fewerElementsVector for G_EXTRACT_VECTOR_ELT Use the same basic strategy as LegalizeVectorTypes. Try to index into smaller pieces if there's a constant index, and otherwise fall back to a stack temporary.	2020-08-06 14:33:16 -04:00
Matt Arsenault	1a0c0944c6	AMDGPU: Define raw/struct variants of buffer atomic fadd Somehow the new FP atomic buffer intrinsics ended up using the legacy style for buffer intrinsics.	2020-08-06 13:36:19 -04:00
Matt Arsenault	90eb7d5283	AMDGPU: Fix spilling of 96-bit AGPRs	2020-08-06 12:42:07 -04:00
Matt Arsenault	56270d1d42	AMDGPU/GlobalISel: Start trying to handle AGPR bank Try to use AGPR banks for the various merge/unmerge type operations. Previously these would introduce copies to VGPR.	2020-08-06 12:39:50 -04:00
Matt Arsenault	63cdc9a49f	AMDGPU/GlobalISel: Handle llvm.amdgcn.ds.{fadd\|fmin\|fmax} These intrinsics are missing mangling for both the pointer and data type.	2020-08-06 11:09:08 -04:00
Matt Arsenault	63c4be53cf	AMDGPU/GlobalISel: Try to promote to use packed saturating add/sub This produces worse results right now for i8 vectors, but that should be addressed when we actually try to optimize packed vectors.	2020-08-06 11:08:45 -04:00
Matt Arsenault	5a503521e7	AMDGPU/GlobalISel: Implement expansion for rsq.clamp Not sure why we handle this removed instruction on newer subtargets for this one and no others, but maintain compatibility with the DAG.	2020-08-06 10:23:25 -04:00
Matt Arsenault	c015cbc68b	AMDGPU/GlobalISel: Fix trying to widen <3 x s1> boolean ops	2020-08-06 10:07:22 -04:00
Matt Arsenault	28124a0a63	AMDGPU/GlobalISel: Stop using G_EXTRACT in argument lowering We really need to put this undef padding stuff into a helper somewhere, but leave that for when this is moved to generic code.	2020-08-06 09:55:35 -04:00
Matt Arsenault	37894ba661	AMDGPU/GlobalISel: Make s16 phi legal If we were to have an operation with an s16 def that needs to be executed in a waterfall loop, not having s16 legal would place an avoidable burden on RegBankSelect to widen it.	2020-08-06 09:41:14 -04:00
Matt Arsenault	5316256709	AMDGPU/GlobalISel: Fix assert on copy to vcc This was trying to constrain a physical register. By the verifier's understanding, it's impossible to have a 1-bit copy to vcc/vcc_lo so don't try to handle physregs.	2020-08-06 09:41:14 -04:00
Ruiling Song	5ddc8b49ba	[AMDGPU] add buffer_atomic_swap for float The functionality is used when calling imageAtomicExhange() on float type imageBuffer in Graphics shaders. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D85187	2020-08-06 09:45:48 +08:00
Stanislav Mekhanoshin	0bcda1a261	[AMDGPU] Scavenge temp reg for AGPR spill Differential Revision: https://reviews.llvm.org/D85234	2020-08-05 13:29:19 -07:00
Matt Arsenault	ec8c172d01	AMDGPU: Correct prolog SP initialization logic Having callees that will read SP is not the only reason we need to reference the stack pointer.	2020-08-05 15:47:53 -04:00
Stanislav Mekhanoshin	ea7d0e2996	[AMDGPU] gfx1031 target Differential Revision: https://reviews.llvm.org/D85337	2020-08-05 12:36:26 -07:00
Matt Arsenault	b1dac0cfcd	AMDGPU: Remove leftover test	2020-08-05 14:43:21 -04:00
Matt Arsenault	3e52667433	AMDGPU: Fix verifier error with undef source producing s_bitset* This needs to preserve the undef flag.	2020-08-05 14:42:20 -04:00
Roman Lebedev	f5df5cd558	Recommit "[InstCombine] Negator: -(X << C) --> X * (-1 << C)" This reverts commit `ac70b37a00` which reverted commit `8aeb2fe13a` because codegen tests got broken and i needed time to investigate. This shows some regressions in tests, but they are all around GEP's, so i'm not really sure how important those are. https://rise4fun.com/Alive/1Gn	2020-08-05 15:59:13 +03:00
Jay Foad	8cbf4a17ac	[AMDGPU] Propagate fast math flags in frem lowering Differential Revision: https://reviews.llvm.org/D84518	2020-08-05 09:09:38 +01:00
Jay Foad	1bb07e1b91	[AMDGPU] Precommit tests for D84518 Propagate fast math flags in frem lowering	2020-08-05 09:09:02 +01:00
Jay Foad	04cf4a5a65	[AMDGPU] Lower frem f16 Without this it would fail to select on subtargets that have 16-bit instructions. Differential Revision: https://reviews.llvm.org/D84517	2020-08-05 09:08:40 +01:00
Matt Arsenault	89011fc3c9	AMDGPU/GlobalISel: Select llvm.returnaddress	2020-08-04 17:14:38 -04:00
Matt Arsenault	f8fb7835d6	GlobalISel: Add utilty for getting function argument live ins Get the argument register and ensure there's a copy to the virtual register. AMDGPU and AArch64 have similarish code to get the livein value, and I also want to use this in multiple places. This is a bit more aggressive about setting the register class than the original function, but that's probably OK. I think we're missing a few verifier checks for function live ins. I noticed AArch64's calling convention code is not actually adding liveins to functions, only the entry block (which apparently might not matter that much?). There should probably be a verifier check that entry block live ins are also live into the function. We also might need a verifier check that the copy to the livein virtual register is in the entry block.	2020-08-04 16:55:55 -04:00
Matt Arsenault	14ed5cf5c4	AMDGPU/GlobalISel: Add baseline tests for andn2/orn2 matching	2020-08-04 15:13:49 -04:00
Matt Arsenault	0de547ed4a	AMDGPU/GlobalISel: Ensure subreg is valid when selecting G_UNMERGE_VALUES Fixes verifier error with SGPR unmerges with 96-bit result types.	2020-08-04 12:27:34 -04:00
Matt Arsenault	444401c31f	GlobalISel: Hack a test to avoid a bug introducing a verifier error There seems to be an unrelated CSEMIRBuilder bug that was causing expensive checks failures in this case. Hack the test to avoid this problem for now until that's fixed.	2020-08-04 11:57:04 -04:00
Jay Foad	8ec8ad868d	[AMDGPU] Use fma for lowering frem This gives shorter f64 code and perhaps better accuracy. Differential Revision: https://reviews.llvm.org/D84516	2020-08-04 16:18:23 +01:00
Jay Foad	ee75cf36bb	[AMDGPU] Generate frem test checks Differential Revision: https://reviews.llvm.org/D84515	2020-08-04 16:18:23 +01:00
Carl Ritson	57899934ea	[AMDGPU] Make GCNRegBankReassign assign based on subreg banks When scavenging consider the sub-register of the source operand to determine the bank of a candidate register (not just sub0). Without this it is possible to introduce an infinite loop, e.g. $sgpr15_sgpr16_sgpr17 can be assigned for a conflict between $sgpr0 and SGPR_96:sub1. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D84910	2020-08-04 12:54:44 +09:00
Cameron McInally	31c7a2fd5c	[FPEnv] Don't transform FSUB(-0,X)->FNEG(X) in SelectionDAGBuilder. This patch stops unconditionally transforming FSUB(-0,X) into an FNEG(X) while building the DAG. There is also one small change to handle the new FSUB(-0,X) similarly to FNEG(X) in the AMDGPU backend. Differential Revision: https://reviews.llvm.org/D84056	2020-08-03 10:22:25 -05:00
Matt Arsenault	42a9f6c554	GlobalISel: Handle arbitrary FewerElementsVector for G_IMPLICIT_DEF	2020-08-03 09:14:08 -04:00
Matt Arsenault	2414bab5d7	AMDGPU/GlobalISel: Remove old hacks for boolean selection There were various hacks used to try to avoid making s1 SGPR vs. s1 VCC ambiguous after constraining the register before we had a strategy to deal with this. This also attempted to handle undef operands, which are now illegal gMIR.	2020-08-03 09:04:14 -04:00
Matt Arsenault	1782fbbc69	GlobalISel: Reimplement moreElementsVectorDst Use pad with undef and unmerge with unused results. This is annoyingly similar to several other places in LegalizerHelper, but they're all slightly different.	2020-08-03 09:03:48 -04:00
Matt Arsenault	fd63e46941	AMDGPU/GlobalISel: Apply load bitcast to s.buffer.load intrinsic Should also apply this to the non-scalar buffer loads.	2020-08-03 08:54:29 -04:00
Matt Arsenault	d8ef1d1251	AMDGPU/GlobalISel: Fix selecting broken copies for s32->s64 anyext These should probably not be legal in the first place, but that might also be a pain.	2020-08-03 08:36:41 -04:00
Simon Pilgrim	e7a8ee00e6	[AMDGPU] Regenerate tests to fix whitespace indentations Noticed while updating D77804	2020-08-02 18:11:18 +01:00
Matt Arsenault	212570abcf	GlobalISel: Implement bitcast action for G_EXTRACT_VECTOR_ELEMENT For AMDGPU, vectors with elements < 32 bits should be indexed in 32-bit elements and the desired bits extracted from there. For elements > 64-bits, these should be reduce to 64/32 elements to enable the normal dynamic indexing paths. In the dynamic index cases, this produces shorter code most of the time. This does immediately regress the constant index cases, but this should be fixed once we have the most basic of shift combines. The element size > 64 case is pretty much ported from the exisiting DAG implementation for extract element promote. The increasing element size case is new.	2020-08-02 10:42:07 -04:00
Matt Arsenault	57bd64ff84	Support addrspacecast initializers with isNoopAddrSpaceCast Moves isNoopAddrSpaceCast to the TargetMachine. It logically belongs with the DataLayout.	2020-07-31 10:42:43 -04:00
Vitaly Buka	89051ebace	[NFC] GetUnderlyingObject -> getUnderlyingObject I am going to touch them in the next patch anyway	2020-07-30 21:08:24 -07:00
dfukalov	aa77232a63	[NFC][AMDGPU] Improve fused fmul+fadd tests. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D84903	2020-07-31 04:00:09 +03:00
Matt Arsenault	e56e9022bc	AMDGPU: Fix liveness errors when copying AGPR tuples Avoid recursively calling copyPhysReg for AGPR handling. This was dropping the necessary super register implicit defs to avoid liveness verifier errors.	2020-07-30 18:13:04 -04:00
Stanislav Mekhanoshin	5b32518f96	[AMDGPU] Do not use undef on indirect source We are using undef on the indirect move source subreg and then using implicit super-reg. This creates a problem in RA when Greedy decides to split the register. It reassigns the implicit super-reg but does not bother to change undef source because it is really does not matter. The fix is to stop lying to RA and drop undef flag. This has also hit a problem in SIFoldOperands as it can fold immediate into an indirect move since there is no undef flag anymore. That results in multiple test failures, so added the check for this case. Differential Revision: https://reviews.llvm.org/D84899	2020-07-30 10:41:59 -07:00
hsmahesha	33fd4a18e7	[AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic Get rid of all fixmes and base heuristic on `num-clustered-dwords`. The main intuition behind this is as follows. The existing heuristic roughly summarizes as below: * Assume, all the mem ops instructions participating in the clustering process, loads/stores same num bytes * If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes * If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes * If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it properly handles both the sub-word loads and the wide loads. Reviewed By: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D84354	2020-07-30 21:41:13 +05:30
Matt Arsenault	b8c8d1b309	AMDGPU: Convert some tests to use new buffer intrinsics The legacy not struct or raw buffer intrinsics should now all be consolidated into the tests specifically for those intrinsics.	2020-07-30 10:30:43 -04:00
Matt Arsenault	0da582d9b6	GlobalISel: Handle llvm.roundeven I still think it's highly questionable that we have two intrinsics with identical behavior and only vary by the name of the libcall used if it happens to be lowered that way, but try to reduce the feature delta between SDAG and GlobalISel for recently added intrinsics. I'm not sure which opcode should be considered the canonical one, but lower roundeven back to round.	2020-07-29 20:01:12 -04:00
Stanislav Mekhanoshin	13b63be472	[AMDGPU] prefer non-mfma in post-RA schedule MFMA instructions shall not be scheduled back to back to avoid MAI SIMD stall. Tell post-RA schedule we would prefer some other instruction instead. Differential Revision: https://reviews.llvm.org/D84883	2020-07-29 12:17:50 -07:00
Matt Arsenault	59fac51ff2	AMDGPU/GlobalISel: Handle llvm.amdgcn.reloc.constant	2020-07-29 14:24:21 -04:00
Matt Arsenault	0b7de7966f	GlobalISel: Implement lower for G_EXTRACT_VECTOR_ELT Use the basic store to stack and reload.	2020-07-29 14:16:28 -04:00
Matt Arsenault	766cb615a3	AMDGPU: Relax restriction on folding immediates into physregs I never completed the work on the patches referenced by `f8bf7d7f42`, but this was intended to avoid folding immediate writes into m0 which the coalescer doesn't understand very well. Relax this to allow simple SGPR immediates to fold directly into VGPR copies. This pattern shows up routinely in current GlobalISel code since nothing is smart enough to emit VGPR constants yet.	2020-07-29 14:01:53 -04:00
Simon Pilgrim	fdc902774e	[DAG][AMDGPU][X86] Add SimplifyMultipleUseDemandedBits handling for SIGN/ZERO_EXTEND + SIGN/ZERO_EXTEND_VECTOR_INREG Peek through multiple use ops like we already do for ANY_EXTEND/ANY_EXTEND_VECTOR_INREG Differential Revision: https://reviews.llvm.org/D84863	2020-07-29 18:10:59 +01:00
Matt Arsenault	d42c7b2211	AMDGPU: Account for the size of LDS globals used through constant expressions. Also "fix" the longstanding bug where the computed size depends on the order of the visitation. We could try to predict the allocation order used by legalization, but it would never be 100% perfect. Until we start fixing the addresses somehow (or have a more reliable allocation scheme later), just try to compute the size based on the worst case padding.	2020-07-29 11:40:42 -04:00
Matt Arsenault	c230965ccf	AMDGPU: Make saturating add/sub legal for DAG path	2020-07-29 08:27:31 -04:00
Matt Arsenault	cdd45d5f9c	AMDGPU/GlobalISel: Select llvm.amdgcn.global.atomic.csub Remove the custom node boilerplate. Not sure why this tried to handle the LDS atomic stuff.	2020-07-29 08:27:31 -04:00
Matt Arsenault	44211f20a8	AMDGPU: Optimize copies to exec with other insts after exec def It's possible to have terminator instructions after a write to exec, so skip over them to find it.	2020-07-28 21:34:50 -04:00
Matt Arsenault	b6ebc77326	AMDGPU/GlobalISel: Fix selecting llvm.amdgcn.s.getreg This introduces the same bug llvm.amdgcn.s.setreg has where if the user specified an immediate outside of the valid 16-bit range, it will select into a verifier error.	2020-07-28 21:34:50 -04:00
Matt Arsenault	068808d102	AMDGPU: Don't assume call targets are registers GlobalISel let through a call to null, which would then fold into the source operand like any other inline immediate. The SelectionDAG lowering deletes calls to null and undef as a workaround from before calls were supported. We should probably drop the special handling case in the DAG lowering now, since the middle end optimizers delete null calls anyway.	2020-07-28 20:46:06 -04:00
Matt Arsenault	8860daf0ed	AMDGPU: Handle a few missing cases in getAddrModeArguments	2020-07-28 20:22:38 -04:00
Matt Arsenault	b3e63aa8a4	AMDGPU: Don't assume there is only one terminator copy This would stop on the first in reverse order, failing the verifier if there were more earlier in the block.	2020-07-28 20:22:38 -04:00
Matt Arsenault	592f2e8d1c	AMDGPU: Fix verifier error on spilling partially defined SGPRs This needs an implicit def of the super-register in case one of the lanes isn't defined, similar to copyPhysReg (or the not-VGPR spill case below). This showed up in GlobalISel testing since it currently doesn't fold out many undef instructions.	2020-07-28 20:01:57 -04:00
Matt Arsenault	ee713a2d28	AMDGPU/GlobalISel: Add some missing tests for extract selection	2020-07-28 16:49:55 -04:00
Matt Arsenault	e9b236f411	AMDGPU: Check for other defs when folding conditions into s_andn2_b64 We can't fold the masked compare value through the select if the select condition is re-defed after the and instruction. Fixes a verifier error and trying to use the outgoing value defined in the block. I'm not sure why this pass is bothering to handle physregs. It's making this more complex and forces extra liveness computation.	2020-07-28 16:36:23 -04:00
Austin Kerbow	adeeac9d5a	[AMDGPU] Spill CSR VGPR which is reserved for SGPR spills Update logic for reserving VGPR for SGPR spills. A CSR VGPR being reserved for SGPR spills could be clobbered if there were no free lower VGPR's available. Create a stack object so that it will be spilled in the prologue. Also adds more tests. Differential Revision: https://reviews.llvm.org/D83730	2020-07-28 11:53:02 -07:00
Matt Arsenault	a4edc04693	AMDGPU/GlobalISel: Use clamp modifier for [us]addsat/[us]subsat We also have never handled this for SelectionDAG, which needs additional work.	2020-07-28 11:18:05 -04:00
Matt Arsenault	5f802be4e5	GlobalISel: Don't fail translate on intrinsics with metadata	2020-07-27 19:00:25 -04:00
Matt Arsenault	ee3feef5aa	TableGen/GlobalISel: Allow output instructions with multiple defs The DAG behavior allows matchching input patterns with a single result to the first result of an output instruction that defines multiple results. The remaining defs are implicitly dead. This starts to fix using manual selection for AMDGPU add/sub (although it's still needed, mostly because it's also still needed for G_PTR_ADD).	2020-07-27 18:31:13 -04:00
Piotr Sobczak	590dd73c6e	[AMDGPU] Make generating cache invalidating instructions optional Summary: D78800 skipped generating cache invalidating instrucions altogether on AMDPAL. However, this is sometimes too restrictive - we want a more flexible option to be able to toggle this behaviour on and off while we work towards developing a correct implementation of the alternative memory model. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, dexonsmith, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D84448	2020-07-27 09:24:11 +02:00
Matt Arsenault	e97aa5609f	AMDGPU/GlobalISel: Don't assert in LegalizerInfo constructor We don't really need these asserts. The LegalizerInfo is also overly-aggressivly constructed, even when not in use. It needs to not assert on dummy targets that have manually specified, unrelated features.	2020-07-26 23:01:28 -04:00
Matt Arsenault	d35e2c101d	AMDGPU/GlobalISel: Fix not constraining ds_append/consume operands	2020-07-26 10:17:36 -04:00
Matt Arsenault	f6176f8a5f	GlobalISel: Handle G_PTR_ADD in narrowScalar	2020-07-26 10:08:17 -04:00
Matt Arsenault	3e8bb7a000	GlobalISel: Handle fewerElementsVector for G_PTR_ADD	2020-07-26 10:08:09 -04:00
Matt Arsenault	bcf5184a68	AMDGPU/GlobalISel: Make sure <2 x s1> phis are scalarized	2020-07-26 10:04:47 -04:00
Matt Arsenault	6f961a1e7e	AMDGPU/GlobalISel: Legalize GDS atomics I noticed these don't use the _gfx9, non-m0 reading variants but not sure if that's a bug or not. It's the same in the DAG.	2020-07-26 10:03:34 -04:00
Matt Arsenault	5819159995	AMDGPU/GlobalISel: Pack constant G_BUILD_VECTOR_TRUNCs when selecting	2020-07-26 09:55:34 -04:00
Matt Arsenault	61ced4b87a	GlobalISel: Handle 'n' inline asm constraint	2020-07-26 09:30:41 -04:00
Matt Arsenault	4033aa1467	AMDGPU/GlobalISel: Sign extend integer constants This matches the DAG behavior and fixes immediate folding	2020-07-26 09:30:14 -04:00
Matt Arsenault	4f6502ab33	AMDGPU/GlobalISel: Replace selection tests for G_CONSTANT/G_FCONSTANT Split into separate tests and make more consistent with the others.	2020-07-26 09:30:09 -04:00
Changpeng Fang	9162b70e51	DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204	2020-07-25 21:20:59 -07:00
Matt Arsenault	392b969c32	AMDGPU/GlobalISel: Don't assert on G_INSERT > 128-bits Just fallback for now. Really tablegen needs to generate all of the subregister index handling we need.	2020-07-25 10:05:44 -04:00
Matt Arsenault	2bd72abef0	AMDGPU: Skip other terminators before inserting s_cbranch_exec[n]z PHIElimination/createPHISourceCopy inserts non-branch terminators after the control flow pseudo if a successor phi reads a register defined by the control flow pseudo. If this happens, we need to split the expansion of the control flow pseudo to ensure all the branches are after all of the other mask management instructions. GlobalISel hit this in testscases that happened to be tail duplicated. The original testcase still does not work, since the same problem appears to be present in a later pass.	2020-07-24 16:51:59 -04:00
Dmitry Preobrazhensky	6b8948922c	[AMDGPU][MC] Added support of SP3 syntax for MTBUF format modifier Currently supported LLVM MTBUF syntax is shown below. It is not compatible with SP3. op dst, addr, rsrc, FORMAT, soffset This change adds support for SP3 syntax: op dst, addr, rsrc, soffset SP3FORMAT In addition to being compatible with SP3, this syntax allows using symbolic names for data, numeric and unified formats. Below is a list of added syntax variants. format:<expression> format:[<numeric-format-name>,<data-format-name>] format:[<data-format-name>,<numeric-format-name>] format:[<data-format-name>] format:[<numeric-format-name>] format:[<unified-format-name>] The last syntax variant is supported for GFX10 only. See llvm bug 37738 Reviewers: arsenm, rampitec, vpykhtin Differential Revision: https://reviews.llvm.org/D84026	2020-07-24 16:41:03 +03:00
Petar Avramovic	47bd41d099	AMDGPU/GlobalISel: Select set.inactive intrinsic Differential Revision: https://reviews.llvm.org/D84407	2020-07-24 10:14:14 +02:00
Matt Arsenault	891759db73	GlobalISel: Add scalarSameSizeAs LegalizeRule Widen or narrow a type to a type with the same scalar size as another. This can be used to force G_PTR_ADD/G_PTRMASK's scalar operand to match the bitwidth of the pointer type. Use this to disallow narrower types for G_PTRMASK.	2020-07-23 21:17:31 -04:00
Matt Arsenault	b9c644ec61	AMDGPU: Fix failures from overflowing uint8_t number of operands If the operand index exceeded the limit of unsigned char, it wrapped and would point to the wrong operand. Increase the size of the operand index field to avoid this, and also don't bother trying to fold into implicit operands.	2020-07-23 15:39:33 -04:00
Nikita Popov	deb4bb2b3a	[IR] Add min/max/abs intrinsics This adds the llvm.abs(), llvm.umin(), llvm.umax(), llvm.smin(), and llvm.smax() intrinsics specified in D81829. For SelectionDAG, the ISD opcodes and all the legalization and lowering already exist, so this just wires them up to the intrinsic in the SDAG builder and adds rudimentary tests. For GlobalISel only the min/max intrinsics are wired up, as llvm.abs() will require the addition of a G_ABS op, and corresponding legalization support. Differential Revision: https://reviews.llvm.org/D84125	2020-07-23 20:56:19 +02:00
Matt Arsenault	b2ee1cd2d9	AMDGPU/GlobalISel: Add some tests for stack passed pointers	2020-07-23 14:38:31 -04:00
Matt Arsenault	d2b8fcff34	AMDGPU/GlobalISel: Handle call return values The only case that I know doesn't work is the implicit sret case when the return type doesn't fit in the return registers.	2020-07-23 14:29:35 -04:00
Jay Foad	b35833b84e	[GlobalISel][AMDGPU] Legalize saturating add/subtract Add support in LegalizerHelper for lowering G_SADDSAT etc. either using add/subtract-with-overflow or using max/min instructions. Enable this lowering for AMDGPU so it can be tested. The legalization rules are still approximate and skips out on using the clamp bit to treat these as legal, which has never been used before. This also doesn't yet try to deal with expanding SALU cases.	2020-07-23 09:06:42 -04:00
Konstantin Schwarz	931488779f	[GlobalISel][InlineAsm] Add register class ID to the flags of register input operands Summary: We do this already for output operands, but missed it for (non-tied) input operands. Reviewers: arsenm, Petar.Avramovic Reviewed By: arsenm Subscribers: jvesely, wdng, nhaehnle, rovka, hiraditya, llvm-commits, kerbowa Tags: #llvm Differential Revision: https://reviews.llvm.org/D83763	2020-07-23 13:35:01 +02:00

... 3 4 5 6 7 ...

4172 Commits