llvm-project

Commit Graph

Author	SHA1	Message	Date
Carl Ritson	4c4db81630	[AMDGPU] Extend SILoadStoreOptimizer to s_load instructions Apply merging to s_load as is done for s_buffer_load. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D130742	2022-07-30 11:38:39 +09:00
Craig Topper	e637feee80	[RISCV] Add isel pattern for (setne/eq GPR, -2048) For constants in the range [-2047, 2048] we use addi. If the constant is -2048 we can use xori. If we don't match this explicitly, we'll emit an LI for the -2048 followed by an XOR.	2022-07-29 14:07:38 -07:00
Austin Kerbow	2c82a126d7	[AMDGPU] Omit unnecessary waitcnt before barriers It is not necessary to wait for all outstanding memory operations before barriers on hardware that can back off of the barrier in the event of an exception when traps are enabled. Add a new subtarget feature which tracks which HW has this ability. Reviewed By: #amdgpu, rampitec Differential Revision: https://reviews.llvm.org/D130722	2022-07-29 11:12:36 -07:00
Matt Devereau	a8b726ac65	[AArch64][SVE] Change DupLane128Combine Index comparison to 0 IdxInsert == IdxDupLane is incorrect. IdxInsert is the starting element number, whereas IdxIndex is the index of a quadword	2022-07-29 14:31:00 +00:00
Simon Pilgrim	bc2c4f6c85	[X86] combineAndnp - constant fold ANDNP(C,X) -> AND(~C,X) (REAPPLIED) If the LHS op has a single use then using the more general AND op is likely to allow commutation, load folding, generic folds etc. Updated version - original version rG057db2002bb3 didn't correctly account for multiple uses of the mask that might be folding "OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y))" in canonicalizeBitSelect	2022-07-29 15:12:26 +01:00
Mirko Brkusanin	6a1aa627fa	[AMDGPU] Enable image_gather4h instruction for gfx10 and gfx11 Differential Revision: https://reviews.llvm.org/D130764	2022-07-29 15:42:06 +02:00
Jay Foad	3cfa9b1431	[AMDGPU] user-sgpr-init16-bug does not apply to gfx1103 Differential Revision: https://reviews.llvm.org/D130347	2022-07-29 14:21:13 +01:00
Matt Arsenault	ef906f287e	AMDGPU: Fix assertion when printing unreachable functions Since `814a0abcce`, this would break if we had a function in the module that becomes dead in any codegen IR pass. The function wasn't deleted since it was initially used in dead code, but is detached from the call graph and doesn't appear in the PO traversal. Do a second walk over the module to populate the resources of any functions which weren't already processed.	2022-07-29 08:57:43 -04:00
Alexander Timofeev	d7ae1a9097	Revert "[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs" This reverts commit `76d9ae924c`. because it causes several VK CTS tests to fail	2022-07-29 14:19:07 +02:00
wanglei	56ab2f4ccd	[LoongArch] Offset folding for frameindex This patch is for frameindex calculations. Differential Revision: https://reviews.llvm.org/D130248	2022-07-29 17:27:34 +08:00
wanglei	fd6545322c	[LoongArch] Refactor insertDivByZeroTrap Ensure non-terminators don't follow terminators. This patch fixes the `sdiv-udiv-srem-urem.ll` test failure with expensive check. Differential Revision: https://reviews.llvm.org/D130247	2022-07-29 17:06:49 +08:00
David Sherwood	487fa6f8c3	[AArch64][DAGCombine] Add performBuildVectorCombine 'extract_elt ~> anyext' A build vector of two extracted elements is equivalent to an extract subvector where the inner vector is any-extended to the extract_vector_elt VT, because extract_vector_elt has the effect of an any-extend. (build_vector (extract_elt_i16_to_i32 vec Idx+0) (extract_elt_i16_to_i32 vec Idx+1)) => (extract_subvector (anyext_i16_to_i32 vec) Idx) Depends on D130697 Differential Revision: https://reviews.llvm.org/D130698	2022-07-29 09:51:09 +01:00
Changpeng Fang	2b731b30a7	AMDGPU: Take care of "tied" operand when removeOperand Summary: Flat scratch load of D16 type by default has tied vdst_in operand (with vdst). This should be taken care of at the time of "removeOperand" in eliminateFrameIndex. Otherwise we will hit an assert saying "Cannot move tied operands". This patch unties vdst_in before the move, and retie it with vdst afterwards. Reviewers: arsenm, foad Differential Revision: https://reviews.llvm.org/D130537	2022-07-28 17:30:49 -07:00
Anshil Gandhi	5c38056431	[AMDGPU][Scheduler] Avoid initializing Register pressure tracker when tracking is disabled When register pressure tracking is disabled, the scheduler attempts to load pressures at SReg_32 and VGPR_32. This causes an index out of bounds error. This patch fixes this issue by disabling the initialization of RPTracker when not needed. NFC Reviewed By: rampitec, kerbowa, arsenm Differential Revision: https://reviews.llvm.org/D129322	2022-07-28 15:39:28 -06:00
Austin Kerbow	0f93a45b11	[AMDGPU] Add isMeta flag to SCHED_GROUP_BARRIER	2022-07-28 11:04:33 -07:00
Austin Kerbow	f5b21680d1	[AMDGPU] Add amdgcn_sched_group_barrier builtin This builtin allows the creation of custom scheduling pipelines on a per-region basis. Like the sched_barrier builtin this is intended to be used either for testing, in situations where the default scheduler heuristics cannot be improved, or in critical kernels where users are trying to get performance that is close to handwritten assembly. Obviously using these builtins will require extra work from the kernel writer to maintain the desired behavior. The builtin can be used to create groups of instructions called "scheduling groups" where ordering between the groups is enforced by the scheduler. __builtin_amdgcn_sched_group_barrier takes three parameters. The first parameter is a mask that determines the types of instructions that you would like to synchronize around and add to a scheduling group. These instructions will be selected from the bottom up starting from the sched_group_barrier's location during instruction scheduling. The second parameter is the number of matching instructions that will be associated with this sched_group_barrier. The third parameter is an identifier which is used to describe what other sched_group_barriers should be synchronized with. Note that multiple sched_group_barriers must be added in order for them to be useful since they only synchronize with other sched_group_barriers. Only "scheduling groups" with a matching third parameter will have any enforced ordering between them. As an example, the code below tries to create a pipeline of 1 VMEM_READ instruction followed by 1 VALU instruction followed by 5 MFMA instructions... // 1 VMEM_READ __builtin_amdgcn_sched_group_barrier(32, 1, 0) // 1 VALU __builtin_amdgcn_sched_group_barrier(2, 1, 0) // 5 MFMA __builtin_amdgcn_sched_group_barrier(8, 5, 0) // 1 VMEM_READ __builtin_amdgcn_sched_group_barrier(32, 1, 0) // 3 VALU __builtin_amdgcn_sched_group_barrier(2, 3, 0) // 2 VMEM_WRITE __builtin_amdgcn_sched_group_barrier(64, 2, 0) Reviewed By: jrbyrnes Differential Revision: https://reviews.llvm.org/D128158	2022-07-28 10:43:14 -07:00
Craig Topper	2750873dfe	[RISCV] Update lowerFROUND to use masked instructions. This avoids a vmerge at the end and avoids spurious fflags updates. This isn't used for constrained intrinsic so we technically don't have to worry about fflags, but it doesn't cost much to support it. To support I've extend our FCOPYSIGN_VL node to support a passthru operand. Similar to what was done for VRGATHER*_VL nodes. I plan to do a similar update for trunc, floor, and ceil. Reviewed By: reames, frasercrmck Differential Revision: https://reviews.llvm.org/D130659	2022-07-28 10:05:19 -07:00
Craig Topper	89173dee71	[RISCV] Remove duplicate code. NFC The same operations are part of `FloatingPointVecReduceOps` a little bit earlier.	2022-07-28 10:05:19 -07:00
Florian Hahn	f912bab111	Revert "[X86][DAGISel] Don't widen shuffle element with AVX512" This reverts commit `5fb4134210`. This patch is causing crashes when building llvm-test-suite when optimizing for CPUs with AVX512. Reproducer crashing with llc: target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx" define i32 @test(<32 x i32> %0) #0 { entry: %1 = mul <32 x i32> %0, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %2 = tail call i32 @llvm.vector.reduce.add.v32i32(<32 x i32> %1) ret i32 %2 } ; Function Attrs: nocallback nofree nosync nounwind readnone willreturn declare i32 @llvm.vector.reduce.add.v32i32(<32 x i32>) #1 attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="skylake-avx512" } attributes #1 = { nocallback nofree nosync nounwind readnone willreturn }	2022-07-28 15:26:42 +01:00
Alexander Timofeev	76d9ae924c	[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs In the `2e29b0138c` we introduce a specific solving algorithm that analyzes the VGPR to SGPR copies use chains and either lowers the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms. Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues. At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF because of the weird mixing of SALU and VALU instructions. This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs, we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common VGPR to SGPR lowering algorithm. This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130367	2022-07-28 14:30:29 +02:00
Dmitry Preobrazhensky	2b230d69ad	[AMDGPU][MC][GFX90A] Correct MIMG dst size validation Correct validator to enable MIMG dst size checks. Differential Revision: https://reviews.llvm.org/D130512	2022-07-28 14:30:08 +03:00
Dmitry Preobrazhensky	fa7fd8ec31	[AMDGPU][MC][GFX11] Disable SGPRs for src1 of v_fma_mix*_dpp opcodes Differential Revision: https://reviews.llvm.org/D130634	2022-07-28 14:20:05 +03:00
chendewen	7eeb468ae5	[Aarch64] Add cost for missing extensions. This patch adds a cost estimate for some missing sign extensions. ref: https://reviews.llvm.org/D14730 Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D130565	2022-07-28 17:34:00 +08:00
Phoebe Wang	726d9f8e8c	[X86][MC] Avoid emitting incorrect warning for complex FMUL We will insert a new operand which is identical to the Dest for complex FMUL with a mask. https://godbolt.org/z/eTEdnYv3q Complex FMA and FMUL with maskz don't have this problem. Reviewed By: LuoYuanke, skan Differential Revision: https://reviews.llvm.org/D130638	2022-07-28 13:58:34 +08:00
Austin Kerbow	ba0d079c7a	[AMDGPU] Aggressively schedule to reduce RP in occupancy limited regions By not clustering loads and adjusting heuristics to more aggressively reduce register pressure we may be able to increase occupancy for the function if it was dropped in a first pass scheduling. Similarly, try to reduce spilling if register usage exceeds lower bound occupancy. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130329	2022-07-27 22:34:37 -07:00
Amara Emerson	93e3aeb9a8	[AArch64][GlobalISel] Fix custom legalization of rotates using sext for shift vs zext. Rotates are defined according to DAG documentation as having unsigned shifts, so we need to zero-extend instead of sign-extend here. Fixes issue 56664	2022-07-27 22:10:42 -07:00
Carl Ritson	dbda30e294	[AMDGPU][SIFoldOperands] Clear kills when folding COPY Clear all kill flags on source register when folding a COPY. This is necessary because the kills may now be out of order with the uses. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D130622	2022-07-28 11:57:55 +09:00
Craig Topper	a304d70ee9	[RISCV] Reorder (and/or/xor (shl X, C1), C2) if we can form ANDI/ORI/XORI. InstCombine and DAGCombine prefer to keep shl before binops. This patch teaches isel to convert to (shl (and/or/xor X, C1 >> C2), C2) if (C1 >> C2) is a simm12. The idea was taken from X86's isel code. There's a special case implemented for a sext_inreg between the shift and the binop. Differential Revision: https://reviews.llvm.org/D130610	2022-07-27 17:35:26 -07:00
Craig Topper	1d1d8d6025	[RISCV] Reorder code in lowerFROUND to make the diff in D130659 cleaner. NFC	2022-07-27 17:13:04 -07:00
Craig Topper	98647330bf	[RISCV] Add merge operand to RISCVISD::FCOPYSIGN_VL. Similar to what was done for VRGATHER*_VL recently. This will be used in D130659.	2022-07-27 15:25:34 -07:00
Paul Kirth	6e9bab71b6	Revert "[llvm][NFC] Refactor code to use ProfDataUtils" This reverts commit `300c9a7881`. We will reland once these issues are ironed out.	2022-07-27 21:38:11 +00:00
Paul Kirth	300c9a7881	[llvm][NFC] Refactor code to use ProfDataUtils In this patch we replace common code patterns with the use of utility functions for dealing with profiling metadata. There should be no change in functionality, as the existing checks should be preserved in all cases. Reviewed By: bogner, davidxl Differential Revision: https://reviews.llvm.org/D128860	2022-07-27 21:13:54 +00:00
Philip Reames	15c645f7ee	[RISCV] Enable (scalable) vectorization by default This change enables vectorization (using scalable vectorization only, fixed vectors are not yet enabled) for RISCV when vector instructions are available for the target configuration. At this point, the resulting configuration should be both stable (e.g. no crashes), and profitable (i.e. few cases where scalar loops beat vector ones), but is not going to be particularly well tuned (i.e. we emit the best possible vector loop). The goal of this change is to align testing across organizations and ensure the default configuration matches what downstreams are using as closely as possible. This exposes a large amount of code which hasn't otherwise been on by default, and thus may not have been fully exercised. Given that, having issues fall out is not unexpected. If you find issues, please make sure to include as much information as you can when reverting this change. Differential Revision: https://reviews.llvm.org/D129013	2022-07-27 12:36:04 -07:00
Stanislav Mekhanoshin	68901fdbeb	[AMDGPU] Consider S_SETPRIO a scheduling boundary The instruction is used to modify wave priority with the intent to affect VALU execution and currently we can reschedule VALU around it since that VALU does not have side effects. Differential Revision: https://reviews.llvm.org/D130654	2022-07-27 11:50:23 -07:00
Amara Emerson	65246d3eb4	Use hasNItemsOrLess() in MRI::hasAtMostUserInstrs().	2022-07-27 11:42:14 -07:00
Mingming Liu	34348814e1	[AArch64] Explicitly use v1i64 type for llvm.aarch64.neon.pmull64 Without this, the intrinsic will be expanded to an integer; thereby an explicit copy (from GPR to SIMD register) will be codegen'd. This matches the general convention of using "v1" types to represent scalar integer operations in vector registers. The similar approach is observed in D56616, and the pattern likely applies on other intrinsic that accepts integer scalars (e.g., int_aarch64_neon_sqdmulls_scalar) Differential Revision: https://reviews.llvm.org/D130548	2022-07-27 11:11:16 -07:00
Amara Emerson	19cdd1908b	[AArch64][GlobalISel] Add heuristics for localizing G_CONSTANT. This adds similar heuristics to G_GLOBAL_VALUE, querying the cost of materializing a specific constant in code size. Doing so prevents us from sinking constants which require multiple instructions to generate into use blocks. Code size savings on CTMark -Os: Program size.__text before after diff ClamAV/clamscan 381940.00 382052.00 0.0% lencod/lencod 428408.00 428428.00 0.0% SPASS/SPASS 411868.00 411876.00 0.0% kimwitu++/kc 449944.00 449944.00 0.0% Bullet/bullet 463588.00 463556.00 -0.0% sqlite3/sqlite3 284696.00 284668.00 -0.0% consumer-typeset/consumer-typeset 414492.00 414424.00 -0.0% 7zip/7zip-benchmark 595244.00 594972.00 -0.0% mafft/pairlocalalign 247512.00 247368.00 -0.1% tramp3d-v4/tramp3d-v4 372884.00 372044.00 -0.2% Geomean difference -0.0% Differential Revision: https://reviews.llvm.org/D130554	2022-07-27 10:51:16 -07:00
Eli Friedman	1a6d82b93f	Fix misc uses of "long" variables to use "int64_t". I don't have any evidence these particular uses are actually causing any issues, but we should avoid accidentally truncating immediate values depending on the host.	2022-07-27 09:47:19 -07:00
Craig Topper	32622d6de4	[RISCV] Add isel pattern for (mul (and X, 0xffffffff), 3<<C) with Zba. We can use slli.uw by C followed by sh1add. Similar can be done for multiples of 5 and 9. We need to make sure that C is less than 32 to stay in bounds of the 5-bit immediate for slli.uw. We have existing patterns for (mul X, 3<<C) that use sh1add followed by slli. That order doesn't allow the and to be folded. Reviewed By: reames Differential Revision: https://reviews.llvm.org/D130146	2022-07-27 09:41:59 -07:00
Craig Topper	9b27d13204	[RISCV] Disable constant hoisting for multiply by negated power of 2. A mul by a negated power of 2 is a slli followed by neg. This doesn't require any constant materialization and may be lower latency than mul. The neg may also be foldable into other arithmetic. Reviewed By: reames Differential Revision: https://reviews.llvm.org/D130047	2022-07-27 09:37:59 -07:00
Umesh Kalappa	f38ea84a9f	[PowerPC] Change long to int64_t (which is always 64 bit or 8 bytes ) We can't guarantee the long always 64 bits like WINDOWS or LLP64 data model (rare but we should consider). So use int64_t from inttypes.h and safe in this case. Fixes https://github.com/llvm/llvm-project/issues/55911 .	2022-07-27 09:34:45 -07:00
Dmitri Gribenko	b435da027d	[amdgpu][nfc] Fix build with a certan Clang version It errors out in the Bazel CI: AMDGPULowerModuleLDSPass.cpp:384:12: error: chosen constructor is explicit in copy-initialization return {SGV, std::move(Map)}; Reviewed By: rupprecht Differential Revision: https://reviews.llvm.org/D130623	2022-07-27 17:29:36 +02:00
LiaoChunyu	bf4f9a468a	[RISCV]Enable isIntDivCheap when attribute is minsize Don't expand divisions by constants when attribute is minsize. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D130543	2022-07-27 18:22:51 +08:00
Zi Xuan Wu (Zeson)	70b8b738c5	[CSKY] Fix the btsti16 instruction missing in generic processor Normally, generic processor does not have any SubtargetFeature. And it can just generate most basic instructions which have no Predicates to guard. But it needs to enbale predicate for the btsti16 instruction as one of the most basic instructions. Or the generic processor can't finish codegen process. So Add FeatureBTST16 SubtargetFeature to generic ProcessorModel.	2022-07-27 17:39:15 +08:00
David Green	39f8384964	[ARM] Correct features on pacbti instructions. Given a patch like D129506, using instructions not valid for the current feature set becomes an error. This updates the Arm hint-space instructions for pac/bti to require thumbv7m as opposed to 8.1-m.main, to make them valid when compiling for thumbv7m with -mbranch-protection. Differential Revision: https://reviews.llvm.org/D129692	2022-07-27 09:15:14 +01:00
Nikita Popov	b1b1086973	[ARM] Add target feature to force 32-bit atomics This adds a +atomic-32 target feature, which instructs LLVM to assume that lock-free 32-bit atomics are available for this target, even if they usually wouldn't be. If only atomic loads/stores are used, then this won't emit libcalls. If atomic CAS is used, then the user is responsible for providing any necessary __sync implementations (e.g. by masking interrupts for single-core privileged use cases). See https://reviews.llvm.org/D120026#3674333 for context on this change. The tl;dr is that the thumbv6m target in Rust has historically made atomic load/store only available, which is incompatible with the change from D120026, which switched these to use libatomic. Differential Revision: https://reviews.llvm.org/D130480	2022-07-27 10:00:31 +02:00
Amara Emerson	9cc1dd209d	[AArch64][GlobalISel] Lower vector G_CTTZ. Fixes issue 56398	2022-07-27 00:14:30 -07:00
Jon Chesterfield	3ccd88f209	[amdgpu][nfc] Separate processUsedLDS into independent pieces, rename it	2022-07-27 01:55:43 +01:00
Jon Chesterfield	9981afdd42	[amdgpu][nfc] Extract kernel annotation from processUsedLDS	2022-07-27 01:38:41 +01:00
Jon Chesterfield	923b90bddb	[amdgpu][nfc] Separate LDS struct creation from RAUW	2022-07-26 20:59:17 +01:00

1 2 3 4 5 ...

68267 Commits