llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	fd485d8cda	[X86][AVX] Prefer VINSERTF128 over VPERM2F128 for 128->256 subvector concatenations The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances. This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source. There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this often creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses? Noticed while investigating the quality of interleaved load/store codegen. Differential Revision: https://reviews.llvm.org/D111960	2021-11-01 10:45:50 +00:00
Kazu Hirata	476e1ee3da	[AArch64] Remove unused declaration hasSwiftExtendedFrame (NFC)	2021-10-31 22:58:56 -07:00
Chen Zheng	eeed1545b2	[PowerPC] turn off chain commoning by default.	2021-11-01 04:11:10 +00:00
Zi Xuan Wu	cf78715cae	[CSKY] First patch to construct codegen infra and generate first add instruction Ooops. It constructs codegen infra and provide only basic code to generate first add instruction successfully. Differential Revision: https://reviews.llvm.org/D112206	2021-11-01 10:06:56 +08:00
Craig Topper	ada5458521	[RISCV] Expand scalable vector bswap. Fix crash for bitreverse. Fix LegalizeVectorOps to not try shuffle or unrolling expansions for scalable vectors. Differential Revision: https://reviews.llvm.org/D112236	2021-10-31 10:01:27 -07:00
Kazu Hirata	72710af233	[CodeGen, Target] Use MachineBasicBlock::terminators (NFC)	2021-10-31 07:57:34 -07:00
Kazu Hirata	5970249439	[Hexagon] Remove chksetELFHeaderEFlags (NFC) The function was introduced without any use on Nov 9, 2015 in commit `7cd0892729`.	2021-10-30 08:43:43 -07:00
Kazu Hirata	c3d63a0697	[Hexagon] Remove ValidArch (NFC) This function seems to be unused for at least one year.	2021-10-30 08:43:41 -07:00
Kazu Hirata	c5cd371cc9	[Hexagon] Remove unused struct InstTy (NFC)	2021-10-30 08:43:39 -07:00
Christudasan Devadasan	aa2d3b59ce	GlobalISel/Utils: Use incoming regbank while constraining the superclasses Register operands with superclasses can possibly have multiple regBanks if they have different register types. The regBank ambiguity resolved during regbankselect should be used to constrain the operand regclass instead of obtaining one from the MCInstrDesc. This is a prerequisite patch for D109300 that introduces allocatable AV_* Superclasses for AMDGPU by combining both VGPRs and AGPRs and we want to restrain the regclass to either A or V based on the incoming regbank. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D112323	2021-10-30 07:20:45 -04:00
Stanislav Mekhanoshin	e5340ed30c	[AMDGPU] Fix global isel for kernels using agprs on gfx90a With Global ISel getReservedRegs() is called before function is regbank selected for the first time. Defer caching of usesAGPRs() in this case. Differential Revision: https://reviews.llvm.org/D112644	2021-10-29 14:23:14 -07:00
Sam Clegg	3b039c68f2	Revert "[WebAssembly] Fix debug locations for ExplicitLocals pass" This reverts commit `a66451ebbe`. This caused a failure when integrated with emscripten: https://ci.chromium.org/ui/p/emscripten-releases/builders/try/linux/b8832019855439718609/overview	2021-10-29 13:34:18 -07:00
Nick Desaulniers	39e5dd113f	[SparcISelLowering] avoid emitting libcalls to __muloti4 and __mulodi4 These compiler-rt-only symbols aren't available in libgcc. Similar to D108842, D108844, and D108926. Fixes: pr/52043 Reviewed By: craig.topper, rengolin Differential Revision: https://reviews.llvm.org/D112750	2021-10-29 13:14:09 -07:00
Sanjay Patel	285b8abce4	[x86] limit vector increment fold to allow load folding The tests are based on the example from: https://llvm.org/PR52032 I suspect that it looks worse than it actually is. :) That is, llvm-mca says there's no uop/timing difference with the load folding and pcmpeq vs. broadcast on Haswell (and probably other targets). The load-folding definitely makes the code smaller, so it's good for that at least. So this requires carving a narrow hole in the transform to get just this case without changing others that look good as-is (in other words, the transform still seems good for most examples). Differential Revision: https://reviews.llvm.org/D112464	2021-10-29 15:48:35 -04:00
Sanjay Patel	837518d6a0	[x86] make mayFold* helpers visible to more files; NFC The first function is needed for D112464, but we might as well keep these together in case the others can be used someday.	2021-10-29 15:48:35 -04:00
Amara Emerson	5dd9e019dd	[AArch64][GlobalISel] Fix an crash in RBS due to a new regclass being added. rdar://84674985	2021-10-29 11:47:00 -07:00
Matt Morehouse	33cc0cfd46	[X86] Don't affect jump tables under +tagged-globals. `classifyLocalReference(nullptr)` is called to get the appropriate relocation type for jump tables. We should not use @GOTPCREL for this case. The new test cases trigger assertions without this patch. Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D112832	2021-10-29 10:37:43 -07:00
Craig Topper	aefcd59895	[RISCV] Teach RISCVInsertVSETVLI::needVSETVLI to handle mask register instructions better. If the VL operand of a mask register instruction comes from an explicit vsetvli with a different VTYPE, we can still avoid needing a vsetvli as long as the SEW/LMUL ratio is the same and policy bits match. Differential Revision: https://reviews.llvm.org/D112762	2021-10-29 09:49:36 -07:00
Simon Pilgrim	6102e5d56b	[CostModel][X86] Remove old TODO comment BMI (TZCNT) scalar handling was added at rGa2db388dce77c2f23f2009d7363a0b63bb54523c	2021-10-29 17:28:45 +01:00
Bradley Smith	86972f1114	[AArch64][SVE] Use TargetFrameIndex in more SVE load/store addressing modes Add support for generating TargetFrameIndex in complex patterns for indexed addressing modes in SVE. Additionally, add missing load/stores to getMemOpInfo and getLoadStoreImmIdx. Differential Revision: https://reviews.llvm.org/D112617	2021-10-29 14:44:16 +00:00
Jay Foad	1b758925ad	[IR] Merge createReplacementInstr into ConstantExpr::getAsInstruction createReplacementInstr was a trivial wrapper around ConstantExpr::getAsInstruction, which also inserted the new instruction into a basic block. Implement this directly in getAsInstruction by adding an InsertBefore parameter and change all callers to use it. NFC. A follow-up patch will remove createReplacementInstr. Differential Revision: https://reviews.llvm.org/D112791	2021-10-29 15:02:58 +01:00
Jay Foad	21a1d4cf71	[AMDGPU] Change numBitsSigned for simplicity and document it. NFC. Change numBitsSigned to return the minimum size of a signed integer that can hold the value. This is different by one from the previous result but is more consistent with numBitsUnsigned. Update all callers. All callers are now more consistent between the signed and unsigned cases, and some callers get simpler, especially the ones that deal with quantities like numBitsSigned(LHS) + numBitsSigned(RHS). Differential Revision: https://reviews.llvm.org/D112813	2021-10-29 14:22:06 +01:00
Chen Zheng	7591d21032	[PowerPC] fix a miscompile for Solaris build	2021-10-29 12:06:25 +00:00
Bradley Smith	bf72a469ba	[AArch64][SVE] Fix build failure introduced in `13faa5f440`	2021-10-29 11:57:02 +00:00
Simon Pilgrim	154c036ebb	[X86] combineX86GatherScatter - only fold scale if the index isn't extended As mentioned on D108539, when the gather indices are smaller than the pointer size, they are sign-extended BEFORE scale is applied, making the general fold unsafe. If the index have sufficient sign-bits then folding the scale could be safe - I'll investigate this.	2021-10-29 11:48:05 +01:00
Bradley Smith	13faa5f440	[AArch64][SVE] Generate SVE >1 element structured load/stores from fixed types This adds support for SVE structured loads/stores to the relevant target hooks, such that we can support these instructions in the InterleavedAccess pass. Depends on D112078 Differential Revision: https://reviews.llvm.org/D112303	2021-10-29 09:35:57 +00:00
Cullen Rhodes	8686626244	[Sparc] NFC: Remove unused tblgen template args Identified in D109359. Reviewed By: sdesmalen Differential Revision: https://reviews.llvm.org/D109712	2021-10-29 09:16:15 +00:00
Vang Thao	52b43d1549	[AMDGPU] Fix cvt_f32_ubyte combine with shl Shift node is still needed to check if the shift is shr or shl to increment/decrement offset. Do not override the node. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D112733	2021-10-28 21:43:06 -07:00
Kazu Hirata	01b4789b62	[AMDGPU] Remove hasDefinedInitializer (NFC) The last use was removed on Sep 16, 2021 in commit `7a62a5b56d`.	2021-10-28 20:33:34 -07:00
Kazu Hirata	dd5d46b009	[AMDGPU] Remove unused BBSelectRegister in AMDGPUMachineCFGStructurizer (NFC) This field seems to be unused for at least one year.	2021-10-28 20:33:32 -07:00
Kazu Hirata	309357c01a	[AMDGPU] Remove unused declaration eliminateDeadBranchOperands (NFC)	2021-10-28 20:33:30 -07:00
Thomas Lively	fb67f3d969	[WebAssembly] Add prototype relaxed float to int trunc instructions Add i32x4.relaxed_trunc_f32x4_s, i32x4.relaxed_trunc_f32x4_u, i32x4.relaxed_trunc_f64x2_s_zero, i32x4.relaxed_trunc_f64x2_u_zero. These are only exposed as builtins, and require user opt-in. Differential Revision: https://reviews.llvm.org/D112186	2021-10-28 14:01:53 -07:00
Daniel Kiss	d8075e8781	Reland "[ARM] __cxa_end_cleanup should be called instead of _UnwindResume." This is relanding commit `da1d1a0869` . This patch additionally addresses failures found in buildbots & post review comments. ARM EHABI[1] specifies the __cxa_end_cleanup to be called after cleanup. It will call the UnwindResume. __cxa_begin_cleanup will be called from libcxxabi while __cxa_end_cleanup is never called. This will trigger a termination when a foreign exception is processed while UnwindResume is called because the global state will be wrong due to the missing __cxa_end_cleanup call. Additional test here: D109856 [1] https://github.com/ARM-software/abi-aa/blob/main/ehabi32/ehabi32.rst#941compiler-helper-functions Reviewed By: logan Differential Revision: https://reviews.llvm.org/D111703	2021-10-28 21:45:09 +02:00
Wouter van Oortmerssen	a66451ebbe	[WebAssembly] Fix debug locations for ExplicitLocals pass Differential Revision: https://reviews.llvm.org/D112487	2021-10-28 12:35:46 -07:00
Ahmed Bougacha	bef777206e	[AArch64] Rename some timm predicates for consistency. NFC. timm isn't the common case, and TImmLeafs should make it clear what they are. We're adding a plain ImmLeaf for 0_65535, so rename i64_imm0_65535 to timm64_0_65535, and imm32_0_7 to timm32_0_7.	2021-10-28 11:41:29 -07:00
Matthias Braun	e2c7ee0743	X86InstrInfo: Support immediates that are +1/-1 different in optimizeCompareInstr This extends `optimizeCompareInstr` to re-use previous comparison results if the previous comparison was with an immediate that was 1 bigger or smaller. Example: CMP x, 13 ... CMP x, 12 ; can be removed if we change the SETg SETg ... ; x > 12 changed to `SETge` (x >= 13) removing CMP Motivation: This often happens because SelectionDAG canonicalization tends to add/subtract 1 often when optimizing for fallthrough blocks. Example for `x > C` the fallthrough optimization switches true/false blocks with `!(x > C)` --> `x <= C` and canonicalization turns this into `x < C + 1`. Differential Revision: https://reviews.llvm.org/D110867	2021-10-28 10:33:56 -07:00
Matthias Braun	97a1570d8c	X86InstrInfo: Optimize more combinations of SUB+CMP `X86InstrInfo::optimizeCompareInstr` would only optimize a `SUB` followed by a `CMP` in `isRedundantFlagInstr`. This extends the code to also look for other combinations like `CMP`+`CMP`, `TEST`+`TEST`, `SUB x,0`+`TEST`. - Change `isRedundantFlagInstr` to run `analyzeCompareInstr` on the candidate instruction and compare the results. This normalizes things and gives consistent results for various comparisons (`CMP x, y`, `SUB x, y`) and immediate cases (`TEST x, x`, `SUB x, 0`, `CMP x, 0`...). - Turn `isRedundantFlagInstr` into a member function so it can call `analyzeCompare`. - We now also check `isRedundantFlagInstr` even if `IsCmpZero` is true, since we now have cases like `TEST`+`TEST`. Differential Revision: https://reviews.llvm.org/D110865	2021-10-28 10:33:56 -07:00
Daniel Kiss	66e03db814	Revert "Reland "[ARM] __cxa_end_cleanup should be called instead of _UnwindResume."" This reverts commit `b6420e575f`.	2021-10-28 17:24:53 +02:00
Daniel Kiss	b6420e575f	Reland "[ARM] __cxa_end_cleanup should be called instead of _UnwindResume." This is relanding commit `da1d1a0869` . This patch additionally addresses failures found in buildbots & post review comments. ARM EHABI[1] specifies the __cxa_end_cleanup to be called after cleanup. It will call the UnwindResume. __cxa_begin_cleanup will be called from libcxxabi while __cxa_end_cleanup is never called. This will trigger a termination when a foreign exception is processed while UnwindResume is called because the global state will be wrong due to the missing __cxa_end_cleanup call. Additional test here: D109856 [1] https://github.com/ARM-software/abi-aa/blob/main/ehabi32/ehabi32.rst#941compiler-helper-functions Reviewed By: logan Differential Revision: https://reviews.llvm.org/D111703	2021-10-28 16:49:19 +02:00
Simon Pilgrim	d29ccbecd0	[X86][AVX] Attempt to fold a scaled index into a gather/scatter scale immediate (PR13310) If the index operand for a gather/scatter intrinsic is being scaled (self-addition or a shl-by-immediate) then we may be able to fold that scaling into the intrinsic scale immediate value instead. Fixes PR13310. Differential Revision: https://reviews.llvm.org/D108539	2021-10-28 14:07:17 +01:00
Abinav Puthan Purayil	2da6ef3664	[AMDGPU] Add 24-bit mulhi intrinsics in INTRINSIC_WO_CHAIN combine. mul24 intrinsic's operands are simplified by AMDGPUTargetLowering::performIntrinsicWOChainCombine(). This change adds the mul24hi intrinsics in the combine since its operands can be simplified like that of the mul24 intrinsics. Differential Revision: https://reviews.llvm.org/D112702	2021-10-28 16:57:48 +05:30
Sebastian Neubauer	fd1cfc9094	[AMDGPU][GlobalISel] Fix waterfall loops - Move the `s_and exec` to its correct position before the content of the waterfall loop - Use the SI_WATERFALL pseudo instruction, like for sdag, to benefit from optimizations - Add support for indirect function calls To support indirect calls, add a G_SI_CALL instruction without register class restrictions and insert a waterfall loop when applying register banks. Differential Revision: https://reviews.llvm.org/D109052	2021-10-28 10:30:55 +02:00
Caroline Concatto	2186b011e9	[Driver][AArch64]Add driver support for neoverse-512tvb target The support for neoverse-512tvb mirrors the same option available in GCC[1]. There is no functional effect for this option yet. This patch ensures the driver accepts "-mcpu=neoverse-512tvb", and enough plumbing is in place to allow the new option to be used in the future. [1]https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html Differential Revision: https://reviews.llvm.org/D112406	2021-10-28 09:08:40 +01:00
Hsiangkai Wang	7051f73d69	[RISCV] Sync Zvlsseg register order as the same as vector registers. Sync the order of Zvlsseg registers with vector registers to avoid unnecessary register copies between vector instructions and zvlsseg instructions. Differential Revision: https://reviews.llvm.org/D110250	2021-10-28 13:34:53 +08:00
Kazu Hirata	cee3419d65	[AMDGPU] Remove unused declaration findNumUsedRegistersSI (NFC)	2021-10-27 21:24:02 -07:00
Phoebe Wang	2bc28c6f82	[X86] Add a dependency breaking xor before any gathers with an undef passthru value. In the instruction encoding, the passthru register is always tied to the destination register. The CPU scheduler has to wait for the last writer of this register to finish executing before the gather can start. This is true even if the initial mask is all ones so that the passthru will never be used. By explicitly zeroing the register we can break the false dependency. The zero idiom is executed completing by the register renamer and so is immedately considered ready. Authored by Craig. Reviewed By: lebedev.ri Differential Revision: https://reviews.llvm.org/D112505	2021-10-28 11:44:52 +08:00
Hsiangkai Wang	0a9b82960c	[RISCV] Use vmv.v.[v\|i] if we know COPY is under the same vl and vtype. If we know the source operand of COPY is defined by a vector instruction with tail agnostic and the same LMUL and there is no vsetvli between COPY and the define instruction to change the vl and vtype, we could use vmv.v.v or vmv.v.i to copy vector registers to get better performance than the whole vector register move instructions. If the source of COPY is from vmv.v.i, we could use vmv.v.i for the COPY. This patch only considers all these instructions within one basic block. Case 1: ``` bb.0: ... VSETVLI # The first VSETVLI before COPY and VOP. ... # Use this VSETVLI to check LMUL and tail agnostic. ... vy = VOP va, vb # Define vy. ... # There is no vsetvli between VOP and COPY. vx = COPY vy ``` Case 2: ``` bb.0: ... VSETVLI # The first VSETVLI before VOP. ... # Use this VSETVLI to check LMUL and tail agnostic. ... vy = VOP va, vb # Define vy. ... # There is no vsetvli to change vl between VOP and COPY. ... VSETVLI # The first VSETVLI before COPY. ... # This VSETVLI does not change vl and vtype. ... vx = COPY vy ``` Co-Authored-by: Zakk Chen <zakk.chen@sifive.com> Co-Authored-by: Kito Cheng <kito.cheng@sifive.com> Differential Revision: https://reviews.llvm.org/D103510	2021-10-28 11:39:04 +08:00
Craig Topper	1387483e72	[RISCV] Replace most uses of RISCVSubtarget::hasStdExtV. NFCI Add new hasVInstructions() which is currently equivalent. Replace vector uses of hasStdExtZfh/F/D with new vector specific versions. The vector spec no longer requires that the vectors implement the same types as scalar. It only requires that the scalar type is the maximum size the vectors can support. This is currently implemented using the scalar rule we were using before. Add new hasVInstructionsI64() begin using to qualify code that requires i64 vector elements. This is all NFC for now, but we can start using this to better implement D112408 which introduces the Zve extensions. Reviewed By: frasercrmck, eopXD Differential Revision: https://reviews.llvm.org/D112496	2021-10-27 19:33:48 -07:00
Ard Biesheuvel	d7e089f2d6	[ARM] Use hardware TLS register in Thumb2 mode when -mtp=cp15 is passed In ARM mode, passing -mtp=cp15 forces the use of an inline MRC system register read to move the thread pointer value into a register. Currently, in Thumb2 mode, -mtp=cp15 is ignored, and a call to the __aeabi_read_tp helper is emitted instead. This is inconsistent, and breaks the Linux/ARM build for Thumb2 targets, as the Linux kernel does not provide an implementation of __aeabi_read_tp,. Reviewed By: nickdesaulniers, peter.smith Differential Revision: https://reviews.llvm.org/D112600	2021-10-27 16:42:11 -07:00
Michael Liao	e6a4ba3aa6	[amdgpu] Handle the case where there is no scavenged register. - When an unconditional branch is expanded into an indirect branch, if there is no scavenged register, an SGPR pair needs spilling to enable the destination PC calculation. In addition, before jumping into the destination, that clobbered SGPR pair need restoring. - As SGPR cannot be spilled to or restored from memory directly, the spilling/restoring of that SGPR pair reuses the regular SGPR spilling support but without spilling it into memory. As that spilling and restoring points are fully controlled, we only need to spill that SGPR into the temporary VGPR, which needs spilling into its emergency slot. - The target-specific hook is revised to take additional restore block, where the restoring code is filled. After that, the relaxation will place that restore block directly before the destination block and insert an unconditional branch in any fall-through block into the destination block. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D106449	2021-10-27 18:37:27 -04:00
Kazu Hirata	593451bd3c	[X86] Remove getSETOpc (NFC) This function seems to be unused for at least one year.	2021-10-27 09:22:31 -07:00
Kazu Hirata	e6b6190ead	[X86] Remove NeedsRetpoline in X86AsmPrinter (NFC) This field seems to be unused for at least one year.	2021-10-27 09:22:29 -07:00
Kazu Hirata	cc73310a81	[X86] Remove CallOperand in X86Operand (NFC) This field seems to be unused for at least one year.	2021-10-27 09:22:27 -07:00
Daniel Kiss	894ddba1c9	Revert "[ARM] __cxa_end_cleanup should be called instead of _UnwindResume." This reverts commit `da1d1a0869`.	2021-10-27 14:29:35 +02:00
Sanjay Patel	6c0a2c2804	[x86] enhance mayFoldLoad to check alignment As noted in D112464, a pre-AVX target may not be able to fold an under-aligned vector load into another op, so we shouldn't report that as a load folding candidate. I only found one caller where this would make a difference -- combineCommutableSHUFP() -- so that's where I added a test to show the (minor) regression. Differential Revision: https://reviews.llvm.org/D112545	2021-10-27 07:54:25 -04:00
Matt	fc28a2f8ce	[AArch64][SVE] Combine predicated FMUL/FADD into FMA Combine FADD and FMUL intrinsics into FMA when the result of the FMUL is an FADD operand with one only use and both use the same predicate. Differential Revision: https://reviews.llvm.org/D111638	2021-10-27 11:41:23 +00:00
Alexandros Lamprineas	8689f5e6e7	[AArch64] Add support for the 'R' architecture profile. This change introduces subtarget features to predicate certain instructions and system registers that are available only on 'A' profile targets. Those features are not present when targeting a generic CPU, which is the default processor. In other words the generic CPU now means the intersection of 'A' and 'R' profiles. To maintain backwards compatibility we enable the features that correspond to -march=armv8-a when the architecture is not explicitly specified on the command line. References: https://developer.arm.com/documentation/ddi0600/latest Differential Revision: https://reviews.llvm.org/D110065	2021-10-27 12:32:30 +01:00
Daniel Kiss	da1d1a0869	[ARM] __cxa_end_cleanup should be called instead of _UnwindResume. ARM EHABI[1] specifies the __cxa_end_cleanup to be called after cleanup. It will call the UnwindResume. __cxa_begin_cleanup will be called from libcxxabi while __cxa_end_cleanup is never called. This will trigger a termination when a foreign exception is processed while UnwindResume is called because the global state will be wrong due to the missing __cxa_end_cleanup call. Additional test here: D109856 [1] https://github.com/ARM-software/abi-aa/blob/main/ehabi32/ehabi32.rst#941compiler-helper-functions Reviewed By: logan Differential Revision: https://reviews.llvm.org/D111703	2021-10-27 10:40:00 +02:00
Kazu Hirata	6af3e87d2d	[Hexagon] Remove set-but-unused variables (NFC)	2021-10-26 23:38:15 -07:00
Phoebe Wang	eb55c1f153	[X86][NFC] Add the missed `break;` for `79f9dfef0d`	2021-10-27 13:58:31 +08:00
Craig Topper	2783a5cfaf	[RISCV] Add ICmp and FCmp to shouldSinkOperands.	2021-10-26 22:23:54 -07:00
Ben Shi	97e52e1c35	[RISCV] Optimize immediate materialisation with SLLI.UW in the Zba extension Simplify "LUI+SLLI+ADDI+SLLI" and "LUI+ADDIW+SLLI+ADDI+SLLI" to "LUI+ADDIW+SLLIUW" to reduce total instruction amount. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D111933	2021-10-27 02:48:38 +00:00
Austin Kerbow	02e60f2e77	[AMDGPU] Use max waves for scheduler's initial occupancy target The scheduler should set critical/excess register usage thresholds that are guided by the maximum possible occupancy for the function. This change is focused on setting proper lower bounds on register usage which we would typically only see when a specific number of maximum waves is requested with the "waves-per-eu" attribute, or by setting "amdgpu-num-vgpr\|sgpr" directly. This was broken previously. I have a follow-on patch that will address issues with the scheduler not targeting correct upper bounds on register usage which is typical with launch bounds and min "waves-per-eu". Changes by this patch: Set the initial critical register usage thresholds to minimum values that are determined by the maximum possible occupancy for the function, or the number of allocatable registers, whichever is lower. Avoid unisgned overflow if register limits are lower than the register tracking "ErrorMargin", I.e. when using stress-regalloc=2. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D112373	2021-10-26 15:30:26 -07:00
Fangrui Song	226465efe3	[ARC] Fix `undefined symbol: llvm::MachineFunction::dump() const`	2021-10-26 11:44:18 -07:00
Kazu Hirata	c3e698e2f5	[CodeGen, Hexagon] Use MachineBasicBlock::phis (NFC)	2021-10-26 09:01:29 -07:00
Jonas Paulsson	bb506938be	[SystemZ] Improvement of emitMemMemWrapper() It was discovered that an extra register COPY remained when expanding a (variable length) memory operation with a loop and there was another use of the involved address register(s) afterwards. A simple fix for this is to COPY the address registers before the loop and use that new vreg instead. Review: Ulrich Weigand Differential Revision: https://reviews.llvm.org/D112065	2021-10-26 17:03:01 +02:00
Neubauer, Sebastian	eb16570ab0	[AMDGPU] Remove unused CSR defs CSR_AMDGPU_VGPRs_24_255 and CSR_AMDGPU_VGPRs_32_255 are not used anywhere, so remove them. Differential Revision: https://reviews.llvm.org/D112535	2021-10-26 16:01:49 +02:00
Abinav Puthan Purayil	61e3b9fefe	[AMDGPU] Add constrained shift pattern matches. The motivation for this is due to clang's conformance to https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#operators-shift which makes clang emit (<shift> a, (and b, <width> - 1)) for `a <shift> b` in OpenCL where a is an int of bit width <width>. Differential revision: https://reviews.llvm.org/D110231	2021-10-26 19:07:19 +05:30
Chen Zheng	631f44f338	[PowerPC] use right extend type for SCEV Fix an issue caused by D108750 Reviewed By: nemanjai Differential Revision: https://reviews.llvm.org/D112502	2021-10-26 13:32:03 +00:00
Abinav Puthan Purayil	781dd39b7b	[AMDGPU] Enable 48-bit mul in AMDGPUCodeGenPrepare. We were bailing out of creating 24-bit muls for results wider than 32 bits in AMDGPUCodeGenPrepare. With the 24-bit mulhi intrinsic, this change teaches AMDGPUCodeGenPrepare to generate the 48-bit mul correctly. Differential Revision: https://reviews.llvm.org/D112395	2021-10-26 18:53:07 +05:30
Abinav Puthan Purayil	9bd5cfeb1f	[AMDGPU] Implement llvm.amdgcn.mulhi.[i,u]24 intrinsics. These intrinsics maps to the 24-bit v_mul_hi instructions. This change also fixes an incorrect assumption on the associativity of 24-bit mulhi in its SDNode record in tblgen. Differential Revision: https://reviews.llvm.org/D112394	2021-10-26 18:53:07 +05:30
Sanjay Patel	2ab0148c14	[x86] use cast instead of dyn_cast for unchecked usage; NFC This was noted as an independent clean-up in D112464.	2021-10-26 08:20:19 -04:00
Neubauer, Sebastian	487f15603e	[AMDGPU] Fix setcc combine for i128 The combine asserted if constants could not be represented as uint64_t. Use APInts to fix this. Differential Revision: https://reviews.llvm.org/D112416	2021-10-26 13:39:50 +02:00
Jay Foad	c8e5aef1a0	[AMDGPU] Use standard MachineBasicBlock::getFallThrough method. NFCI. Differential Revision: https://reviews.llvm.org/D101825	2021-10-26 12:07:54 +01:00
Jonas Paulsson	9f8872779a	[SystemZ] Provide size values for PATCHPOINT, STACKMAP and FENTRY_CALL. All instructions must have a correct size value close to emission when SystemZLongBranch runs, or a necessary branch relaxation may be missed. This patch also adds an assert for instruction sizes in SystemZLongBranch. Review: Ulrich Weigand	2021-10-26 12:07:22 +02:00
Phoebe Wang	79f9dfef0d	[X86] Move splat addends from the gather/scatter index operand to the base address This can avoid a vector add and a constant pool load. Or an explicit broadcast in case of non-constant. Also reverse the transform any time we encounter a constant index addend that can't be moved to base. In that case pull the constant from base into the index. This reduces code size needed for the displacement since we needed the index add anyway. Limit this to scale of 1 to avoid divisibility and wrap issues. Authored by Craig. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D111595	2021-10-26 12:35:57 +08:00
Zarko Todorovski	e9163660b1	[PPC][LLVM] Inclusive terms: remove references to sanity check in lib/Target/PowerPC Removed references to `sanity check` in `PPCBranchCoalescing.cpp` code comments. No word substitution made in this case, as the comments and code following illustrated are sufficient IMO. Reviewed By: quinnp Differential Revision: https://reviews.llvm.org/D112452	2021-10-25 18:13:54 -04:00
Wouter van Oortmerssen	5694dbccc3	[WebAssembly] support Memory64 in target_features section Differential Revision: https://reviews.llvm.org/D112266	2021-10-25 09:31:45 -07:00
Craig Topper	e2b7aabb57	[RISCV] Reduce the number of RISCV vector builtins by an order of magnitude. All but 2 of the vector builtins are only used by clang_builtin_alias. When using clang_builtin_alias, the type string of the builtin is never checked. Only the types in the function definition used for the alias are checked. This patch takes advantage of this to share a single builtin for many different types. We already used type overloads on the IR intrinsic so the codegen for the builtins that are being merge were already the same. This extends the type overloading to the builtins. I had to make a few tweaks to make this work. -Floating point vector-vector vmerge now uses the vmerge intrinsic instead of the vfmerge intrinsic. New isel patterns and tests are added to support this. -The SemaChecking for the immediate of vset_v/vget_v has been removed. Determining the valid range is harder now. I've added masking to ManualCodegen to ensure valid IR for invalid input. This reduces the number of builtins from ~25000 to ~1100. Reviewed By: HsiangKai Differential Revision: https://reviews.llvm.org/D112102	2021-10-25 09:03:59 -07:00
Craig Topper	210b586a85	[RISCV] Add vcsr CSR name for V extension. Reviewed By: frasercrmck, kito-cheng Differential Revision: https://reviews.llvm.org/D112342	2021-10-25 08:56:25 -07:00
Danila Malyutin	2d9ee590b6	[AArch64] Handle ST1iN instructions in isAArch64FrameOffsetLegal Before the code would crash with "unhandled opcode in isAArch64FrameOffsetLegal" when there was a spill from extractelement. Fixes pr52249 Differential Revision: https://reviews.llvm.org/D112311	2021-10-25 17:05:12 +03:00
Kerry McLaughlin	1f49b71fe5	[SVE][CodeGen] Enable reciprocal estimates for scalable fdiv/fsqrt This patch enables the use of reciprocal estimates for SVE when both the -Ofast and -mrecip flags are used. Reviewed By: david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D111657	2021-10-25 11:30:44 +01:00
Jingu Kang	a502436259	[AArch64] Remove redundant ORRWrs which is generated by zero-extend %3:gpr32 = ORRWrs $wzr, %2, 0 %4:gpr64 = SUBREG_TO_REG 0, %3, %subreg.sub_32 If AArch64's 32-bit form of instruction defines the source operand of ORRWrs, we can remove the ORRWrs because the upper 32 bits of the source operand are set to zero. Differential Revision: https://reviews.llvm.org/D110841	2021-10-25 09:47:07 +01:00
Chen Zheng	80e6aff6bb	[PowerPC] common chains to reuse offsets to reduce register pressure. Add a new preparation pattern in PPCLoopInstFormPrep pass to reduce register pressure. Reviewed By: jsji Differential Revision: https://reviews.llvm.org/D108750	2021-10-25 03:27:16 +00:00
Kazu Hirata	9800731367	[Target, Transforms] Use predecessors instead of pred_begin and pred_end (NFC)	2021-10-24 17:35:35 -07:00
Kazu Hirata	4bd46501c3	Use llvm::any_of and llvm::none_of (NFC)	2021-10-24 17:35:33 -07:00
Matthias Braun	4b75d674f8	X86InstrInfo: Look across basic blocks in optimizeCompareInstr This extends `optimizeCompareInstr` to continue the backwards search when it reached the beginning of a basic block. If there is a single predecessor block then we can just continue the search in that block and mark the EFLAGS register as live-in. Differential Revision: https://reviews.llvm.org/D110862	2021-10-24 16:22:45 -07:00
Matthias Braun	683994c863	X86InstrInfo: Refactor and cleanup optimizeCompareInstr This changes the first part of `optimizeCompareInstr` being split into a loop with a forward scan for cases that re-use zero flags from a producer in case of compare with zero and a backward scan for finding an instruction equivalent to a compare. The code now uses a single backward scan searching for the next instructions that reads or writes EFLAGS. Also: - Add comments giving examples for the 3 cases handled. - Check `MI` which contains the result of the zero-compare cases, instead of re-checking `IsCmpZero`. - Tweak coding style in some loops. - Add new MIR based tests that test the optimization in isolation. This also removes a check for flag readers in situations like this: ``` = SUB32rr %0, %1, implicit-def $eflags ... we no longer stop when there are $eflag users here CMP32rr %0, %1 ; will be removed ... ``` Differential Revision: https://reviews.llvm.org/D110857	2021-10-24 16:22:45 -07:00
Fangrui Song	54405a49d8	[ARC] Fix -Wunused-variable. NFC	2021-10-24 10:31:44 -07:00
Simon Pilgrim	b09f2ee57c	[X86] findEltLoadSrc - fix shift amount variable name. NFCI. Fix the copy + paste, renaming shift amt from Idx to Amt	2021-10-23 21:24:37 +01:00
Kazu Hirata	d8e4170b0a	Ensure newlines at the end of files (NFC)	2021-10-23 08:45:29 -07:00
Jessica Clarke	2d8c18fbbd	[X86] Don't add implicit REP prefix to VIA PadLock xstore Commit `8fa3e8fa14` added an implicit REP prefix to all VIA PadLock instructions, but GNU as doesn't add one to xstore, only all the others. This resulted in a kernel panic regression in FreeBSD upon updating to LLVM 11 (https://bugs.freebsd.org/259218) which includes the commit in question. This partially reverts that commit. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D112355	2021-10-23 01:57:17 +01:00
Matt Arsenault	ec57b37551	AMDGPU: Use attributor to propagate amdgpu-flat-work-group-size This can merge the acceptable ranges based on the call graph, rather than the simple application of the attribute. Remove the handling from the old pass.	2021-10-22 16:23:50 -04:00
Matt Arsenault	8d4b74ac3f	AMDGPU: Don't consider whether amdgpu-flat-work-group-size was set It should be semantically identical if it was set to the same value as the default. Also improve the documentation.	2021-10-22 16:23:50 -04:00
Craig Topper	cd824f9e39	[X86] Fix bad formatting. NFC	2021-10-22 13:16:35 -07:00
Jay Foad	58e7ec471c	[AMDGPU] Run SIShrinkInstructions before post-RA scheduling Run post-RA SIShrinkInstructions just before post-RA scheduling, instead of afterwards. After the fixes in D112305 and D112317 this seems to make no difference, but it paves the way for scheduler tweaks that are sensitive to the e32 vs e64 encoding of VALU instructions. Differential Revision: https://reviews.llvm.org/D112341	2021-10-22 20:24:03 +01:00
Jay Foad	3f34f75a68	[AMDGPU] Fix latency for implicit vcc_lo operands on GFX10 wave32 As described in the comment, the way we change vcc to vcc_lo in these operands confuses addPhysRegDataDeps into treating them as implicit pseudo operands. Fix this by setting the correct latency from the SchedModel after addPhysRegDataDeps wrongly set it to 0. Differential Revision: https://reviews.llvm.org/D112317	2021-10-22 20:03:29 +01:00
Craig Topper	04c184bba7	[TargetLowering] Simplify the interface of expandABS. NFC Instead of returning a bool to indicate success and a separate SDValue, return the SDValue and have the callers check if it is null. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112331	2021-10-22 10:22:23 -07:00
Kazu Hirata	6fe949c4ed	[Target, Transforms] Use StringRef::contains (NFC)	2021-10-22 08:52:33 -07:00
Jonas Paulsson	12b44bf5ee	[SystemZ] Give the EXRL_Pseudo a size value of 6 bytes. This pseudo is expanded very late (AsmPrinter) and therefore has to have a correct size value, or the branch relaxation pass may make a wrong decision. Review: Ulrich Weigand	2021-10-22 17:38:51 +02:00
Bradley Smith	cfe22cd4ef	[AArch64][SVE] Add new ld<n> intrinsics that return a struct of vscale types This will allow us to reuse existing interleaved load logic in lowerInterleavedLoad that exists for neon types, but for SVE fixed types. The goal eventually will be to replace the existing ld<n> intriniscs with these, once a migration path has been sorted out. Differential Revision: https://reviews.llvm.org/D112078	2021-10-22 14:13:17 +00:00
Roman Lebedev	8fac9e95ad	[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) Right now, for such not-fully-interleaved loads we just use the costs for fully-interleaved loads. But at least in general, that is obviously overly pessimistic, because in general*, not all the shuffles needed to perform the full interleaving will end up being live. So what this does, is naively scales the interleaving cost by the fraction of the live members. I believe this should still result in the right ballpark cost estimate, although it may be over/under -estimate. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112307	2021-10-22 16:33:58 +03:00
Jay Foad	74cd4dee20	[AMDGPU] Preserve deadness of vcc when shrinking instructions This doesn't have any effect on codegen now, but it might do in the future if we shrink instructions before post-RA scheduling, which is sensitive to live vs dead defs. Differential Revision: https://reviews.llvm.org/D112305	2021-10-22 14:22:24 +01:00
Simon Pilgrim	a750332d77	AMDGPULibCalls - constify some FuncInfo& arguments. NFCI.	2021-10-22 12:10:58 +01:00
Simon Pilgrim	99a64cc9da	AMDGPULibCalls::parseFunctionName - use reference instead of pointer. NFCI. parseFunctionName allowed a default null pointer, despite it being dereferenced immediately to be used as a reference and that all callers were taking the address of an existing reference. Fixes static analyzer warning about potential dereferenced nulls	2021-10-22 11:45:25 +01:00
Fraser Cormack	74c6895b39	[RISCV] Fix missing cross-block VSETVLI insertion This patch fixes a codegen bug, the test for which was introduced in D112223. When merging VSETVLIInfo across blocks, if the 'exit' VSETVLIInfo produced by a block is found to be compatible with the VSETVLIInfo computed as the intersection of the 'exit' VSETVLIInfo produced by the block's predecessors, that blocks' 'exit' info is discarded and the intersected value is taken in its place. However, we have one authority on what constitutes VSETVLIInfo compatibility and we are using it in two different contexts. Compatibility is used in one context to elide VSETVLIs between straight-line vector instructions. But compatibility when evaluated between two blocks' exit infos ignores any info produced inside each respective block before the exit points. As such it does not guarantee that a block will not produce a VSETVLI which is incompatible with the 'previous' block. As such, we must ensure that any merging of VSETVLIInfo is performed using some notion of "strict" compatibility. I've defined this as a full vtype match, but this is perhaps too pessimistic. Given that test coverage in this regard is lacking -- the only change is in the failing test -- I think this is a good starting point. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D112228	2021-10-22 10:45:10 +01:00
Chen Zheng	86a5c32616	[PowerPC] iterate on the SmallSet directly; NFC	2021-10-22 06:18:07 +00:00
Chen Zheng	13755436bb	[PowerPC] return early if there is no preparing candidate in the loop; NFC This is to improve compiling time. Differential Revision: https://reviews.llvm.org/D112196 Reviewed By: jsji	2021-10-22 05:39:51 +00:00
Stanislav Mekhanoshin	ca0c92d6a1	[AMDGPU] Allow to use a whole register file on gfx90a for VGPRs In a kernel which does not have calls or AGPR usage we can allocate the whole vector register budget for VGPRs and have no AGPRs as long as VGPRs stay addressable (i.e. below 256). Differential Revision: https://reviews.llvm.org/D111764	2021-10-21 18:24:34 -07:00
Jessica Paquette	5dc339d982	[AArch64][GlobalISel] Fold 64-bit cmps with 64-bit adds G_ICMP is selected to an arithmetic overflow op (ADDS/SUBS/etc) with a dead destination + a CSINC instruction. We have a fold which allows us to combine 32-bit adds with G_ICMP. The problem with G_ICMP is that we model it as always having a 32-bit destination even though it can be a 64-bit operation. So, we were missing some opportunities for 64-bit folds. This patch teaches the fold to recognize 64-bit G_ICMPs + refactors some of the code surrounding CSINC accordingly. (Later down the line, I think we should probably change the way we handle G_ICMP in general.) Differential Revision: https://reviews.llvm.org/D111088	2021-10-21 13:51:44 -07:00
Yonghong Song	0472e83ffc	BPF: emit BTF_KIND_DECL_TAG for typedef types If a typedef type has __attribute__((btf_decl_tag("str"))) with bpf target, emit BTF_KIND_DECL_TAG for that type in the BTF. Differential Revision: https://reviews.llvm.org/D112259	2021-10-21 12:09:42 -07:00
Craig Topper	d55be79d75	[RISCV] Expand scalable vector CTTZ/CTLZ/CTPOP. Differential Revision: https://reviews.llvm.org/D112233	2021-10-21 10:50:04 -07:00
Anirudh Prasad	aa3519f178	[SystemZ][z/OS] Initial implementation for lowerCall on z/OS - This patch provides the initial implementation for lowering a call on z/OS according to the XPLINK64 calling convention - A series of changes have been made to SystemZCallingConv.td to account for these additional XPLINK64 changes including adding a new helper function to shadow the stack along with allocation of a register wherever appropriate - For the cases of copying a f64 to a gr64 and a f128 / 128-bit vector type to a gr64, a `CCBitConvertToType` has been added and has been bitcasted appropriately in the lowering phase - Support for the ADA register (R5) will be provided in a later patch. Reviewed By: uweigand Differential Revision: https://reviews.llvm.org/D111662	2021-10-21 09:48:59 -04:00
David Sherwood	9448cdc900	[SVE][Analysis] Tune the cost model according to the tune-cpu attribute This patch introduces a new function: AArch64Subtarget::getVScaleForTuning that returns a value for vscale that can be used for tuning the cost model when using scalable vectors. The VScaleForTuning option in AArch64Subtarget is initialised according to the following rules: 1. If the user has specified the CPU to tune for we use that, else 2. If the target CPU was specified we use that, else 3. The tuning is set to "generic". For CPUs of type "generic" I have assumed that vscale=2. New tests added here: Analysis/CostModel/AArch64/sve-gather.ll Analysis/CostModel/AArch64/sve-scatter.ll Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll Differential Revision: https://reviews.llvm.org/D110259	2021-10-21 09:33:50 +01:00
Sanjay Patel	40163f1df8	[x86] add special-case lowering for usubsat for AVX512 This is a small extension of D112095 to avoid another regression seen with D112085. In this case, we allow the same conversion from usubsat to ALU ops if the target supports vpternlog. That pattern will get converted later in X86DAGToDAGISel::tryVPTERNLOG(). This seems better than putting a magic immediate constant directly in this code to create the exact vpternlog that we need. It's possible that there are other special-cases along these lines, so we should try to keep all of the vpternlog magic in one place. Differential Revision: https://reviews.llvm.org/D112138	2021-10-20 16:41:13 -04:00
Stanislav Mekhanoshin	6185835656	[AMDGPU] Allow rematerialization of SOP with virtual registers D106408 was doing this for all targets although it was reverted due to couple performance regressions on some targets. The difference for AMDGPU is the ability to rematerialize SOP instructions with virtual register uses like we already do for VOP. Differential Revision: https://reviews.llvm.org/D110743	2021-10-20 11:46:50 -07:00
Zhi An Ng	e1fb13401e	[WebAssembly] Add prototype relaxed float min max instructions Add relaxed. f32x4.min, f32x4.max, f64x2.min, f64x2.max. These are only exposed as builtins, and require user opt-in. Differential Revision: https://reviews.llvm.org/D112146	2021-10-20 09:41:51 -07:00
Craig Topper	fe1f0de003	[RISCV][WebAssembly][TargetLowering] Allow expandCTLZ/expandCTTZ to rely on CTPOP expansion for vectors. Our fallback expansion for CTLZ/CTTZ relies on CTPOP. If CTPOP isn't legal or custom for a vector type we would scalarize the CTLZ/CTTZ. This is different than CTPOP itself which would use a vector expansion. This patch teaches expandCTLZ/CTTZ to rely on the vector CTPOP expansion instead of scalarizing. To do this I had to add additional checks to make sure the operations used by CTPOP expansions are all supported. Some of the operations were already needed for the CTLZ/CTTZ expansion. This is a huge improvement to the RISCV which doesn't have a scalar ctlz or cttz in the base ISA. For WebAssembly, I've added Custom lowering to keep the scalarizing behavior. I've also extended the scalarizing to CTPOP. Differential Revision: https://reviews.llvm.org/D111919	2021-10-20 07:46:41 -07:00
Sanjay Patel	3efd2a0bec	[x86] make helper for useVPTERNLOG; NFC See D112085 for another use case.	2021-10-20 10:26:53 -04:00
Simon Pilgrim	5b395bd633	[CostModel][X86] Add costs for multiply-by-pow2 constants These are folded to left shifts in the backend. We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436	2021-10-20 13:11:21 +01:00
Simon Pilgrim	9fc523d114	[X86] Remove X86ProcFamilyEnum::IntelSLM Replace X86ProcFamilyEnum::IntelSLM enum with a TuningUseSLMArithCosts flag instead, matching what we already do for Goldmont. This just leaves X86ProcFamilyEnum::IntelAtom to replace with general Tuning/Feature flags and we can finally get rid of the old X86ProcFamilyEnum enum. Differential Revision: https://reviews.llvm.org/D112079	2021-10-20 11:58:39 +01:00
Daniel Kiss	f903c85055	[AArch64] Emit .cfi_negate_ra_state for PAC-auth instructions. autiasp, autibsp instructions are the counterpart of paciasp/pacibsp instructions therefore let's emit .cfi_negate_ra_state for these too. In case of Armv8.3 instruction set the retaa/retbb will do the return and authentication in one step here we can't emit the . cfi_negate_ra_state because that would be point after the ret* instruction. Reviewed By: nickdesaulniers, MaskRay Differential Revision: https://reviews.llvm.org/D111780	2021-10-20 11:03:52 +02:00
Joerg Sonnenberger	ec428f7b78	[SPARC] Recognize the prefetch instruction Reviewed By: LemonBoy Differential Revision: https://reviews.llvm.org/D96311	2021-10-20 10:59:01 +02:00
Paulo Matos	6d0c7bc17d	[WebAssembly] Implementation of table.get/set for reftypes in LLVM IR This change implements new DAG nodes TABLE_GET/TABLE_SET, and lowering methods for load and stores of reference types from IR arrays. These global LLVM IR arrays represent tables at the Wasm level. Differential Revision: https://reviews.llvm.org/D111154	2021-10-20 10:31:31 +02:00
Zi Xuan Wu	de10a02fc0	[CSKY] Complete to add basic integer instruction set Complete the basic integer instruction set and add related predictor in CSKY.td. And it includes the instruction definition and asm parser support. Differential Revision: https://reviews.llvm.org/D111701	2021-10-20 15:50:44 +08:00
Zhi An Ng	2542bfa43a	[WebAssembly] Add prototype relaxed swizzle instructions Add i8x16 relaxed_swizzle instructions. These are only exposed as builtins, and require user opt-in. Differential Revision: https://reviews.llvm.org/D112022	2021-10-19 17:53:04 -07:00
Yonghong Song	cd40b5a712	BPF: set .BTF and .BTF.ext section alignment to 4 Currently, .BTF and .BTF.ext has default alignment of 1. For example, $ cat t.c int foo() { return 0; } $ clang -target bpf -O2 -c -g t.c $ llvm-readelf -S t.o ... Section Headers: [Nr] Name Type Address Off Size ES Flg Lk Inf Al ... [ 7] .BTF PROGBITS 0000000000000000 000167 00008b 00 0 0 1 [ 8] .BTF.ext PROGBITS 0000000000000000 0001f2 000050 00 0 0 1 But to have no misaligned data access, .BTF and .BTF.ext actually requires alignment of 4. Misalignment is not an issue for architecture like x64/arm64 as it can handle it well. But some architectures like mips may incur a trap if .BTF/.BTF.ext is not properly aligned. This patch explicitly forced .BTF and .BTF.ext alignment to be 4. For the above example, we will have [ 7] .BTF PROGBITS 0000000000000000 000168 00008b 00 0 0 4 [ 8] .BTF.ext PROGBITS 0000000000000000 0001f4 000050 00 0 0 4 Differential Revision: https://reviews.llvm.org/D112106	2021-10-19 16:26:01 -07:00
Artem Belevich	b6b7fe60a4	[NVPTX] Add a late SROA pass which allows optimizing away more allocas. Fixes performance regression https://bugs.llvm.org/show_bug.cgi?id=52037 Differential Revision: https://reviews.llvm.org/D111471	2021-10-19 16:18:28 -07:00
Sanjay Patel	92a0389b04	[x86] add special-case lowering for usubsat for pre-SSE4 usubsat X, SMIN --> (X ^ SMIN) & (X s>> BW-1) This would be a regression with D112085 where we combine to usubsat more aggressively, so avoid that by matching the special-case where we are subtracting SMIN (signmask): https://alive2.llvm.org/ce/z/4_3gBD Differential Revision: https://reviews.llvm.org/D112095	2021-10-19 17:13:16 -04:00
David Sherwood	5ea35791e6	[AArch64] Split out processor/tuning features Following on from an earlier patch that introduced support for -mtune for AArch64 backends, this patch splits out the tuning features from the processor features. This gives us the ability to enable architectural feature set A for a given processor with "-mcpu=A" and define the set of tuning features B with "-mtune=B". It's quite difficult to write a test that proves we select the right features according to the tuning attribute because most of these relate to scheduling. I have created a test here: CodeGen/AArch64/misched-fusion-addr-tune.ll that demonstrates the different scheduling choices based upon the tuning. Differential Revision: https://reviews.llvm.org/D111551	2021-10-19 15:18:55 +01:00
David Sherwood	607fb1bb8c	[AArch64] Always add -tune-cpu argument to -cc1 driver This patch ensures that we always tune for a given CPU on AArch64 targets when the user specifies the "-mtune=xyz" flag. In the AArch64Subtarget if the tune flag is unset we use the CPU value instead. I've updated the release notes here: llvm/docs/ReleaseNotes.rst and added tests here: clang/test/Driver/aarch64-mtune.c Differential Revision: https://reviews.llvm.org/D110258	2021-10-19 14:57:51 +01:00
Simon Pilgrim	71e39e3f18	[ADT] Add APInt::isNegatedPowerOf2() helper Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos..... Differential Revision: https://reviews.llvm.org/D111998	2021-10-19 14:38:21 +01:00
Hsiangkai Wang	facff468b6	[RISCV] Reorder the vector register allocation order. GPR uses argument registers as the first group of registers to allocate. This patch uses vector argument registers, v8 to v23, as the first group to allocate. Differential Revision: https://reviews.llvm.org/D111304	2021-10-19 09:30:13 +08:00
Anshil Gandhi	0567f03331	[HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols By default clang emits complete contructors as alias of base constructors if they are the same. The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols. @yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true` and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had to be extended to support aliases to functions. inline-calls.ll was corrected appropriately. Reviewed By: yaxunl, #amdgpu Differential Revision: https://reviews.llvm.org/D109707	2021-10-18 16:53:15 -06:00
Simon Pilgrim	a83384498b	[X86] combineMulToPMADDWD - replace ASHR(X,16) -> LSHR(X,16) If we're using an ashr to sign-extend the entire upper 16 bits of the i32 element, then we can replace with a lshr. The sign bit will be correctly shifted for PMADDWD's implicit sign-extension and the upper 16 bits are zero so the upper i16 sext-multiply is guaranteed to be zero. The lshr also has a better chance of folding with shuffles etc.	2021-10-18 22:12:56 +01:00
Matt Morehouse	431a5d8411	[x86] Implement a tagged-globals backend feature. The feature tells the backend to allow tags in the upper bits of global variable addresses. These tags will be ignored by upcoming CPUs with the Intel LAM feature but may be used in instrumentation passes (e.g., HWASan). This patch implements the feature by using @GOTPCREL relocations instead of direct references to the locally defined global. Thus the full tagged address can be loaded by a single instruction: movq global@GOTPCREL(%rip), %rax Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D111343	2021-10-18 13:31:10 -07:00
Alexandros Lamprineas	04dc68710a	[DebugInfo][ARM] Fix incorrect debug information for RWPI accessed globals When compiling for the RWPI relocation model the debug information is wrong: * the debug location is described as { DW_OP_addr Var } instead of { DW_OP_constNu Var DW_OP_bregX 0 DW_OP_plus } * the relocation type is R_ARM_ABS32 instead of R_ARM_SBREL32 Differential Revision: https://reviews.llvm.org/D111404	2021-10-18 21:29:46 +01:00
Yonghong Song	f4a8526cc4	[NFC][BPF] fix comments and rename functions related to BTF_KIND_DECL_TAG There are no functionality change. Fix some comments and rename processAnnotations() to processDeclAnnotations() to avoid confusion when later BTF_KIND_TYPE_TAG is introduced (https://reviews.llvm.org/D111199).	2021-10-18 10:43:45 -07:00
Yonghong Song	e9e4fc0fd3	BPF: fix a bug in IRPeephole pass Commit `009f3a89d8` ("BPF: remove intrindics @llvm.stacksave() and @llvm.stackrestore()") implemented IRPeephole pass to remove llvm.stacksave()/stackrestore() instrinsics. Buildbot reported a failure: UNREACHABLE executed at ../lib/IR/LegacyPassManager.cpp:1445! which is: llvm_unreachable("Pass modifies its input and doesn't report it"); The code has changed but the implementation didn't return true for changing. This patch fixed this problem.	2021-10-18 10:18:24 -07:00
Craig Topper	84d9bc51a3	[RISCV] Rewrite forwardCopyWillClobberTuple to not assume that there are exactly 32 registers. NFC This function was copied from ARM where register pairs/triples/quads can wrap around the 32 encoding space. So register 31 can pair with register 0. This is not true for RISCV vectors. The spec specifically mentions the possibility of a future encoding that has more than 32 registers. This patch removes the modulo from the code and directly checks that destination register is in the source register range and not the beginning of the range. Though I don't expect an identity copy will occur. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D111467	2021-10-18 09:57:38 -07:00
Yonghong Song	009f3a89d8	BPF: remove intrindics @llvm.stacksave() and @llvm.stackrestore() Paul Chaignon reported a bpf verifier failure ([1]) due to using non-ABI register R11. For the test case, llvm11 is okay while llvm12 and later generates verifier unfriendly code. The failure is related to variable length array size. The following mimics the variable length array definition in the test case: struct t { char a[20]; }; void foo(void *); int test() { const int a = 8; char tmp[AA + sizeof(struct t) + a]; foo(tmp); ... } Paul helped bisect that the following llvm commit is responsible: `552c6c2328` ("PR44406: Follow behavior of array bound constant folding in more recent versions of GCC.") Basically, before the above commit, clang frontend did constant folding for array size "AA + sizeof(struct t) + a" to be 68, so used alloca for stack allocation. After the above commit, clang frontend didn't do constant folding for array size any more, which results in a VLA and llvm.stacksave/llvm.stackrestore is generated. BPF architecture API does not support stack pointer (sp) register. The LLVM internally used R11 to indicate sp register but it should not be in the final code. Otherwise, kernel verifier will reject it. The early patch ([2]) tried to fix the issue in clang frontend. But the upstream discussion considered frontend fix is really a hack and the backend should properly undo llvm.stacksave/llvm.stackrestore. This patch implemented a bpf IR phase to remove these intrinsics unconditionally. If eventually the alloca can be resolved with constant size, r11 will not be generated. If alloca cannot be resolved with constant size, SelectionDag will complain, the same as without this patch. [1] https://lore.kernel.org/bpf/20210809151202.GB1012999@Mem/ [2] https://reviews.llvm.org/D107882 Differential Revision: https://reviews.llvm.org/D111897	2021-10-18 09:51:19 -07:00
Kazu Hirata	8568ca789e	Use llvm::erase_if (NFC)	2021-10-18 09:33:42 -07:00
Jessica Clarke	f5755c0849	[Mips] Add glue between CopyFromReg, CopyToReg and RDHWR nodes for TLS The MIPS ABI requires the thread pointer be accessed via rdhwr $3, $r29. This is currently represented by (CopyToReg $3, (RDHWR $29)) followed by a (CopyFromReg $3). However, there is no glue between these, meaning scheduling can break those apart. In particular, PR51691 is a report where PseudoSELECT_I was moved to between the CopyToReg and CopyFromReg, and since its expansion uses branches, it split the def and use of the physical register between two basic blocks, resulting in the def being eliminated and the use having no def. It also seems possible that a similar situation could arise splitting up the CopyToReg from the RDHWR, causing the RDHWR to use a destination register other than $3, violating the ABI requirement. Thus, add glue between all three nodes to ensure they aren't split up during instruction selection. No regression test is added since any test would be implictly relying on specific scheduling behaviour, so whilst it might be testing that glue is preventing reordering today, changes to scheduling behaviour could result in the test no longer being able to catch a regression here, as the reordering might no longer happen for other unrelated reasons. Fixes PR51691. Reviewed By: atanasyan, dim Differential Revision: https://reviews.llvm.org/D111967	2021-10-18 15:10:20 +01:00
Kerry McLaughlin	ac4e01ea0e	[SVE][CodeGen] Fix predicate for add/sub + element count patterns The patterns added in D111441 should use the HasSVEorStreamingSVE predicate. This changes one incorrect use of HasSVE with the new patterns.	2021-10-18 14:42:29 +01:00
Andrew Wei	f5056c8c16	[AArch64] Improve shuffle vector by using wider types Try to widen element type to get a new mask value for a better permutation sequence, so that we can use NEON shuffle instructions, such as zip1/2, UZP1/2, TRN1/2, REV, INS, etc. For example: shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 6, i32 7, i32 2, i32 3> is equivalent to: shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 1> Finally, we can get: mov v0.d[0], v1.d[1] Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D111619	2021-10-18 21:24:45 +08:00
Simon Pilgrim	f041338153	[X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:46:10 +01:00
Simon Pilgrim	c850d5c5c8	[X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:15:14 +01:00
Jay Foad	d55db4b033	[AMDGPU] Remove unused VirtRegMap analysis. NFC.	2021-10-18 11:55:40 +01:00
Jay Foad	a129932b0d	[AMDGPU] Add link to bug	2021-10-18 10:33:42 +01:00
Jay Foad	012248b0bc	Remove the verifyAfter mechanism that was replaced by D111397 Differential Revision: https://reviews.llvm.org/D111872	2021-10-18 10:26:46 +01:00
Jay Foad	36deb9a670	Add new MachineFunction property FailsVerification TargetPassConfig::addPass takes a "bool verifyAfter" argument which lets you skip machine verification after a particular pass. Unfortunately this is used in generic code in TargetPassConfig itself to skip verification after a generic pass, only because some previous target- specific pass damaged the MIR on that specific target. This is bad because problems in one target cause lack of verification for all targets. This patch replaces that mechanism with a new MachineFunction property called "FailsVerification" which can be set by (usually target-specific) passes that are known to introduce problems. Later passes can reset it again if they are known to clean up the previous problems. Differential Revision: https://reviews.llvm.org/D111397	2021-10-18 10:26:46 +01:00
Piotr Sobczak	d869921004	[AMDGPU] Add patterns for i8/i16 local atomic load/store Add patterns for i8/i16 local atomic load/store. Added tests for new patterns. Copied atomic_[store/load]_local.ll to GlobalISel directory. Differential Revision: https://reviews.llvm.org/D111869	2021-10-18 11:23:10 +02:00
Luo, Yuanke	942536ac08	[X86] Prefer VEX encoding in X86 assembler. This patch is to order the AVX instructions ahead of AVX512 instructions in the matching table so that the AVX instructions can be matched first. Thanks Craig and Shengchen for the idea. Differential Revision: https://reviews.llvm.org/D111538	2021-10-18 16:54:11 +08:00
Stanislav Mekhanoshin	7cdb1df8c7	[AMDGPU] Divergence driven selection for fused bitlogic The change adds divergence predicates for fused logical operations. The problem with selecting a scalar fused op such as S_NOR_B32 is that it does not have a VALU counterpart and will be split in moveToVALU. At the same time it prevents selection of a better opcode on the VALU side (such as V_OR3_B32) which does not have a counterpart on SALU side. XNOR opcodes are left as is and selected as scalar to get advantage of the SIInstrInfo::lowerScalarXnor() code which can commute operations to keep one of two opcodes on SALU if possible. See xnor.ll test for this. Differential Revision: https://reviews.llvm.org/D111907	2021-10-18 01:44:25 -07:00
Jingu Kang	3f0b178de2	[AArch64] Fixed a bug on AArch64MIPeepholeOpt Create new virtual register for the definition of new AND instruction and replace old register by the new one to keep SSA form. Differential Revision: https://reviews.llvm.org/D109963	2021-10-18 08:55:42 +01:00
Qiu Chaofan	67c64d8337	[PowerPC] Implement scheduling model for Power10 Reviewed By: jsji Differential Revision: https://reviews.llvm.org/D110855	2021-10-18 15:27:49 +08:00
Simon Pilgrim	0bb32b1b21	[X86][SLM] Fix BitTest+Set uops + port usage Both ports are required for BitTest ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures and what Intel AoM / Agner reports as well.	2021-10-17 18:13:15 +01:00
Simon Pilgrim	5ed5df4802	[X86][SLM] Fix uops for PCMPISTR/PCMPISTR instructions Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.	2021-10-17 18:13:14 +01:00
Simon Pilgrim	680afaaa5d	[X86][SLM] Fix uops for PCLMULQDQ Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.	2021-10-17 18:13:14 +01:00
Simon Pilgrim	498c7236bc	[X86][SLM] +1uop for PSHUFBrm xmm Extra 1uop for folded pshufb ops, based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.	2021-10-17 18:13:14 +01:00
Roman Lebedev	91373bf12e	[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So could pick cost of `40` For store we have: https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So we could pick cost of `40`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111945	2021-10-17 17:28:10 +03:00
Roman Lebedev	3274ce3a28	[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So could pick cost of `32` For store we have: https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111944	2021-10-17 17:28:10 +03:00
Roman Lebedev	3a6a9f74d3	[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0` So could pick cost of `68` For store we have: https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111943	2021-10-17 17:28:09 +03:00
Roman Lebedev	4b76a74b42	[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32` For store we have: https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `48`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111942	2021-10-17 17:28:09 +03:00
Roman Lebedev	887acf6842	[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0` So could pick cost of `212` For store we have: https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0` So we could pick cost of `90`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111940	2021-10-17 17:28:09 +03:00
David Truby	2e0fb007d6	[llvm][AArch64][SVE] Fold literals into math instructions SVE has predicated literal forms of some instructions for specific literals, which currently are generated correctly when using ACLE but not when those instructions are generated directly. This adds the patterns to generate those instructions when generating from standard LLVM IR instructions. Differential Revision: https://reviews.llvm.org/D99074	2021-10-17 10:57:04 +00:00
Kito Cheng	ff13189c5d	[RISCV] Unify the arch string parsing logic to to RISCVISAInfo. How many place you need to modify when implementing a new extension for RISC-V? At least 7 places as I know: - Add new SubtargetFeature at RISCV.td - -march parser in RISCV.cpp - RISCVTargetInfo::initFeatureMap@RISCV.cpp for handling feature vector. - RISCVTargetInfo::getTargetDefines@RISCV.cpp for pre-define marco. - Arch string parser for ELF attribute in RISCVAsmParser.cpp - ELF attribute emittion in RISCVAsmParser.cpp, and make sure it's in canonical order... - ELF attribute emittion in RISCVTargetStreamer.cpp, and again, must in canonical order... And now, this patch provide an unified infrastructure for handling (almost) everything of RISC-V arch string. After this patch, you only need to update 2 places for implement an extension for RISC-V: - Add new SubtargetFeature at RISCV.td, hmmm, it's hard to avoid. - Add new entry to RISCVSupportedExtension@RISCVISAInfo.cpp or SupportedExperimentalExtensions@RISCVISAInfo.cpp . Most codes are come from existing -march parser, but with few new feature/bug fixes: - Accept version for -march, e.g. -march=rv32i2p0. - Reject version info with `p` but without minor version number like `rv32i2p`. Differential Revision: https://reviews.llvm.org/D105168	2021-10-17 16:25:23 +08:00
Ben Shi	d0dbc991c0	Revert "[AArch64] Optimize add/sub with immediate" This reverts commit `9bf6bef995`.	2021-10-16 22:17:18 +00:00
Craig Topper	beb7862db5	[X86] Add DAG combine for negation of CMOV absolute value pattern. This patch detects the absolute value pattern on the RHS of a subtract. If we find it we swap the CMOV true/false values and replace the subtract with an ADD. There may be a more generic way to do this, but I'm not sure. Targets that don't have legal or custom ISD::ABS use a generic expand in DAG combiner already when it sees (neg (abs(x))). I haven't checked what happens if the neg is a more general subtract. Fixes PR50991 for X86. Reviewed By: RKSimon, spatel Differential Revision: https://reviews.llvm.org/D111858	2021-10-16 13:31:43 -07:00
Simon Pilgrim	85b87179f4	[TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938	2021-10-16 17:28:07 +01:00
Simon Pilgrim	6ec644e215	[TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64. It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win. Fixes PR47437 Differential Revision: https://reviews.llvm.org/D111938	2021-10-16 16:21:45 +01:00
Roman Lebedev	d137f1288e	[X86][LV] X86 does not prefer vectorized addressing And another attempt to start untangling this ball of threads around gather. There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`, which tells LV to try to vectorize the addresses that lead to loads, but X86 generally can not deal with vectors of addresses, the only instructions that support that are GATHER/SCATTER, but even those aren't available until AVX2, and aren't really usable until AVX512. This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111546	2021-10-16 12:32:18 +03:00
Ben Shi	9bf6bef995	[AArch64] Optimize add/sub with immediate Optimize ([add\|sub] r, imm) -> ([ADD\|SUB] ([ADD\|SUB] r, #imm0, lsl #12), #imm1), if imm == (imm0<<12)+imm1. and both imm0 and imm1 are non-zero 12-bit unsigned integers. Optimize ([add\|sub] r, imm) -> ([SUB\|ADD] ([SUB\|ADD] r, #imm0, lsl #12), #imm1), if imm == -(imm0<<12)-imm1, and both imm0 and imm1 are non-zero 12-bit unsigned integers. Reviewed By: jaykang10, dmgreen Differential Revision: https://reviews.llvm.org/D111034	2021-10-16 08:50:39 +00:00
Zhi An Ng	da07942834	[WebAssembly] Add prototype relaxed laneselect instructions Add i8x16, i16x8, i32x4, i64x2 laneselect instructions. These are only exposed as builtins, and require user opt-in.	2021-10-15 17:45:09 -07:00
Anshil Gandhi	1830ec94ac	Revert "[HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols" This reverts commit `03375a3fb3`.	2021-10-15 16:16:18 -06:00
Anshil Gandhi	03375a3fb3	[HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols By default clang emits complete contructors as alias of base constructors if they are the same. The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols. @yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true` and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had to be extended to support aliases to functions. inline-calls.ll was corrected appropriately. Reviewed By: yaxunl, #amdgpu Differential Revision: https://reviews.llvm.org/D109707	2021-10-15 11:39:15 -06:00
Michael Liao	bacddf47a8	[amdgpu] Fix a crash case when preserving MDT in SILowerControlFlow - When a redundant MBB is being erased from MDT, check whether its single successor is dominiated by it. If yes, update that successor's idom before erasing MBB; otherwise, it implies MBB is a leaf node and could be erased directly. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D111831	2021-10-15 13:21:53 -04:00
Jonas Paulsson	ccbfcfda1e	[SystemZ] Handle huge immediates in SystemZInstrInfo::loadImmediate(). This is needed during isel pseudo expansion in order not to crash on huge immediates. Review: Ulrich Weigand	2021-10-15 19:08:45 +02:00
Jessica Paquette	59b94c4a60	NFC: Remove wayward MIR tests from lib/Target These were put in lib/Target instead of tests. Thankfully dupes of them already existed in the tests directory. So, just delete them.	2021-10-15 09:59:00 -07:00
Abinav Puthan Purayil	de3038400b	[AMDGPU] Avoid redundant calls to numBits in AMDGPUCodeGenPrepare::replaceMulWithMul24(). The isU24() and isI24() calls numBits to make its decision. This change replaces them with the internal numBits call so that we can use its result for the > 32 bit width cases. Differential Revision: https://reviews.llvm.org/D111864	2021-10-15 19:49:44 +05:30
Dávid Bolvanský	6678db00e6	[X86] Enable promotion of i16 popcnt (PR52056) Solves https://bugs.llvm.org/show_bug.cgi?id=52056 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111507	2021-10-15 15:41:37 +02:00
Mubashar Ahmad	97809c828f	[AArch64]Enabling Cortex-A510 Support This patch enables support for Cortex-A510 CPUs. Reviewed By: MarkMurrayARM, dmgreen Differential Revision: https://reviews.llvm.org/D109825	2021-10-15 14:31:18 +01:00
Abinav Puthan Purayil	0379263f23	[AMDGPU] Fix width check for signed mul24 generation. This changes fixes a case in which the highest set bit of the original result is at bit 31 and sign-extending the mul24 for it would make the result negative. Differential Revision: https://reviews.llvm.org/D111823	2021-10-15 18:53:41 +05:30
David Green	fa1a68285e	[AArch64] Improve fptosi.sat vector lowering Similar to D111236, this improves the lowering of vector fptosi.sat and fptoui.sat, using legal converts and further saturating from there with min/max. f64 are excluded for the moment due to producing worse code in places compared to the unrolling. Differential Revision: https://reviews.llvm.org/D111787	2021-10-15 11:37:53 +01:00
David Green	a4f42a33be	[AArch64] Improve fptosi.sat lowering Improve the lowering of scalar fptosi.sat and fptoui.sat for saturating widths smaller than legal types by using the fact that the legal type will saturate under aarch64, and saturating the result further using min/max. Differential Revision: https://reviews.llvm.org/D111236	2021-10-15 11:12:23 +01:00
John Brawn	082fa56819	[ARM] Fix MOVCC peephole to not use an incorrect register class The MOVCC peephole eliminates a MOVCC by making one of its inputs a conditional instruction, but when doing this it should be using both inputs of the MOVCC to decide on the register class to use as otherwise we can get an error when using -verify-machineinstrs. Differential Revision: https://reviews.llvm.org/D111714	2021-10-15 10:54:26 +01:00
guopeilin	b092dc0bb9	[AArch64ISelLowering] Avoid duplane in some cases when sve enabled Reviewed By: david-arm, sdesmalen Differential Revision: https://reviews.llvm.org/D110524	2021-10-15 15:33:24 +08:00
Jim Lin	2ccdc7315e	[RISCV] Add invalid match case for uimm2, uimm3 and uimm7 Reviewed By: craig.topper, jrtc27 Differential Revision: https://reviews.llvm.org/D110308	2021-10-15 14:54:48 +08:00
Ben Shi	4fe5ab4b00	[RISCV] Optimize immediate materialisation with SHADD Use SH1ADD/SH2ADD/SH3ADD along with LUI+ADDI to compose int323, int325 and int329. Reviewed By: craig.topper, luismarques Differential Revision: https://reviews.llvm.org/D111484	2021-10-15 06:46:41 +00:00
Kazu Hirata	81e9c90686	[llvm] Use llvm::is_contained (NFC)	2021-10-14 22:44:09 -07:00
Qiu Chaofan	9e9b0f4621	[PowerPC] Support ppc-asm-full-reg-names for AIX Reviewed By: jsji Differential Revision: https://reviews.llvm.org/D94282	2021-10-15 12:22:44 +08:00
Roman Lebedev	3d7bf6625a	[X86][Costmodel] Improve cost modelling for not-fully-interleaved load While i've modelled most of the relevant tuples for AVX2, that only covered fully-interleaved groups. By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)*stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) As we have established over the course of last ~70 patches, (wow) `BaseT::getInterleavedMemoryOpCos()` is absolutely bogus, it is usually almost an order of magnitude overestimation, so i would claim that we should at least use the hardcoded costs of fully interleaved load groups. We could go further and adjust them e.g. by the number of demanded indices, but then i'm somewhat fearful of underestimating the cost. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111174	2021-10-14 23:14:36 +03:00
Craig Topper	79ae9562cc	[RISCV] Remove unused member variable. NFC	2021-10-14 12:56:47 -07:00
Craig Topper	3ff9cc01f2	[X86] Use CMOVNS for abs instead of CMOVGE. CMOVGE reads SF and OF. CMOVNS only reads SF. This matches with other recent changes to use a single flag where possible. It also matches gcc codegen. I believe this technically changes whether the conditioanl move happens on INT_MIN, but for INT_MIN both registers are the same so it doesn't matter. Differential Revision: https://reviews.llvm.org/D111826	2021-10-14 12:28:28 -07:00
Simon Pilgrim	871f773986	[TTI][X86] Merge getInterleavedMemoryOpCostAVX2 into getInterleavedMemoryOpCost. NFC This a NFC refactor patch to merge the AVX2 interleaved cost handling back into the getInterleavedMemoryOpCost base method - while getInterleavedMemoryOpCostAVX512 uses instruction and patterns very specific to AVX512+, much of the costs analysis for AVX2 can be reused for all SSE targets. This is the first step towards improving SSE and AVX1 costs that will reuse the relevant AVX2 costs by splitting some of the tables - for instance AVX1 has very similar costs for most vXi64/vXf64 interleave patterns and many sub-128bit vector costs are the same all the way down to SSE2 (or at least SSSE3). Differential Revision: https://reviews.llvm.org/D111822	2021-10-14 18:46:25 +01:00
Simon Pilgrim	fcbec7e668	[TTI][X86] Swap getInterleavedMemoryOpCostAVX2/getInterleavedMemoryOpCostAVX512 implementations. NFC. I have some upcoming refactoring for SSE/AVX1 interleaving cost support, and the diff is a lot nicer if the (unaltered) AVX512 implementation isn't stuck between getInterleavedMemoryOpCost and getInterleavedMemoryOpCostAVX2	2021-10-14 18:10:03 +01:00
Craig Topper	f7ba572483	[RISCV] Update Zba, Zbb, Zbc, and Zbs version from 0.93 to 1.0. I've removed the Zbs W instructions that are not part of the frozen spec. References to B as an extension name have been removed. Tests are updated or split accordingly. Reviewed By: luismarques Differential Revision: https://reviews.llvm.org/D110669	2021-10-14 09:25:03 -07:00
Brian Cain	743e263e08	[hexagon] Add system register, transfer support This commit adds the system reg/regpair definitions and the corresponding register transfer instructions.	2021-10-14 06:37:04 -07:00
Andrew Savonichev	51eefa8164	[NVPTX] Add VRFrame and VRFrameLocal to integer register classes These registers are used as operands for instructions that expect an integer register, so they should be added to Int32Regs or Int64Regs register classes. Otherwise the machine verifier emits an error for the following LIT tests when LLVM_ENABLE_MACHINE_VERIFIER=1 environment variable is set: * Bad machine code: Illegal physical register for instruction * - function: kernel_func - basic block: %bb.0 entry (0x55c8903d5438) - instruction: %3:int64regs = LEA_ADDRi64 $vrframelocal, 0 - operand 1: $vrframelocal $vrframelocal is not a Int64Regs register. CodeGen/NVPTX/call-with-alloca-buffer.ll CodeGen/NVPTX/disable-opt.ll CodeGen/NVPTX/lower-alloca.ll CodeGen/NVPTX/lower-args.ll CodeGen/NVPTX/param-align.ll CodeGen/NVPTX/reg-types.ll DebugInfo/NVPTX/dbg-declare-alloca.ll DebugInfo/NVPTX/dbg-value-const-byref.ll Differential Revision: https://reviews.llvm.org/D110164	2021-10-14 16:19:03 +03:00
Jonas Paulsson	c0d88613f2	[SystemZ] Remove some now unused ISD XXX_LOOP opcodes.	2021-10-14 14:55:44 +02:00
Andrew Savonichev	dc8a41de34	[ARM] Simplify address calculation for NEON load/store The patch attempts to optimize a sequence of SIMD loads from the same base pointer: %0 = gep float, float base, i32 4 %1 = bitcast float* %0 to <4 x float>* %2 = load <4 x float>, <4 x float>* %1 ... %n1 = gep float, float base, i32 N %n2 = bitcast float* %n1 to <4 x float>* %n3 = load <4 x float>, <4 x float>* %n2 For AArch64 the compiler generates a sequence of LDR Qt, [Xn, #16]. However, 32-bit NEON VLD1/VST1 lack the [Wn, #imm] addressing mode, so the address is computed before every ld/st instruction: add r2, r0, #32 add r0, r0, #16 vld1.32 {d18, d19}, [r2] vld1.32 {d22, d23}, [r0] This can be improved by computing address for the first load, and then using a post-indexed form of VLD1/VST1 to load the rest: add r0, r0, #16 vld1.32 {d18, d19}, [r0]! vld1.32 {d22, d23}, [r0] In order to do that, the patch adds more patterns to DAGCombine: - (load (add ptr inc1)) and (add ptr inc2) are now folded if inc1 and inc2 are constants. - (or ptr inc) is now recognized as a pointer increment if ptr is sufficiently aligned. In addition to that, we now search for all possible base updates and then pick the best one. Differential Revision: https://reviews.llvm.org/D108988	2021-10-14 15:23:10 +03:00
Simon Pilgrim	77dcdc2f50	[CostModel][X86] Pre-SSE41 targets can use PMADDWD for sext sub-i16 -> i32 Without SSE41 sext/zext instructions the extensions will be split, meaning that the MUL->PMADDWD fold will split the sext_i32(x) into zext_i32(sext_i16(x))	2021-10-14 12:17:40 +01:00
Jonas Paulsson	a33e4c8ae9	[SystemZ] Reapply memcmp and memcpy patches. This reverts `3562076` and includes some refactoring as well. Review: Ulrich Weigand Differential Revision: https://reviews.llvm.org/D111733	2021-10-14 10:37:33 +02:00
Jonas Paulsson	00baad35b2	[SystemZ] Bugfix and refactorization of mem-mem operations This patch fixes the bug that consisted of treating variable / immediate length mem operations (such as memcpy, memset, ...) differently. The variable length case needs to have the length minus 1 passed due to the use of EXRL target instructions. However, the DAGCombiner can convert a register length argument into a constant one, and whenever that happened one byte too little would end up being performed. This is also a refactorization by reducing the number of opcodes and variants involved. For any opcode (variable or constant length), only the length minus one is passed on to the ISD node. The rest of the logic is now instead handled during isel pseudo expansion. Review: Ulrich Weigand Differential Revision: https://reviews.llvm.org/D111729	2021-10-14 10:37:33 +02:00
Ben Shi	7e81526126	[RISCV] Optimize immediate materialisation with BSETI/BCLRI Opitimize immediate materialisation in the following way if profitable: 1. Use BCLRI for upper 32 bits if the lower 32 bits are negative int32. 2. Use BSETI for upper 32 bits if the lower 32 bits are positive int32. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D111508	2021-10-14 04:56:47 +00:00
Abinav Puthan Purayil	b3c9d84e5a	[AMDGPU] Fix 24-bit mul intrinsic generation for > 32-bit result. The 24-bit mul intrinsics yields the low-order 32 bits. We should only do the transformation if the operands are known to be not wider than 24 bits and the result is known to be not wider than 32 bits. Differential Revision: https://reviews.llvm.org/D111523	2021-10-14 09:00:19 +05:30
Ben Shi	481db13fec	[RISCV] Optimize immediate materialisation with SLLI.UW Use LUI+SLLI.UW to compose the upper bits instead of LUI+SLLI. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D111705	2021-10-14 02:24:50 +00:00
Roman Lebedev	18eef13dad	[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()` `X86TTIImpl::getGSScalarCost()` has (at least) two issues: * it naively computes the cost of sequence of `insertelement`/`extractelement`. If we are operating not on the XMM (but YMM/ZMM), this widely overestimates the cost of subvector insertions/extractions. * Gather/scatter takes a vector of pointers, and scalarization results in us performing scalar memory operation for each of these pointers, but we never account for the cost of extracting these pointers out of the vector of pointers. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111222	2021-10-13 22:35:39 +03:00
Joe Nash	b44eac1b85	[AMDGPU] Remove unneeded emit literal check NFC. This check does not verify any functional property since size 8 was added. Remove it for simplicity. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D111737 Change-Id: Ifd7cbd324a137f939d8dc04acb8fbd54c9527a42	2021-10-13 12:46:22 -04:00
Kai Nacke	0a950a2e94	[SystemZ/z/OS] Implement save of non-volatile registers on z/OS XPLINK This PR implements the save of the XPLINK callee-saved registers on z/OS. Reviewed By: uweigand, Kai Differential Revision: https://reviews.llvm.org/D111653	2021-10-13 12:57:57 -04:00
Jay Foad	c885857e9d	[AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%. Differential Revision: https://reviews.llvm.org/D111646	2021-10-13 17:12:26 +01:00
Kerry McLaughlin	1a2e90199f	[SVE][CodeGen] Add patterns for ADD/SUB + element count This patch adds patterns to match the following with INC/DEC: - @llvm.aarch64.sve.cnt[b\|h\|w\|d] intrinsics + ADD/SUB - vscale + ADD/SUB For some implementations of SVE, INC/DEC VL is not as cheap as ADD/SUB and so this behaviour is guarded by the "use-scalar-inc-vl" feature flag, which for SVE is off by default. There are no known issues with SVE2, so this feature is enabled by default when targeting SVE2. Reviewed By: david-arm Differential Revision: https://reviews.llvm.org/D111441	2021-10-13 11:36:15 +01:00
Simon Pilgrim	fb2539b9d8	[X86][SSE] Add X86ISD::AVG to isCommutativeBinOp to support folding shuffles through the binop	2021-10-13 10:54:37 +01:00
Heejin Ahn	9261ee32dc	[WebAssembly] Make EH work with dynamic linking This makes Wasm EH work with dynamic linking. So far we were only able to handle destructors, which do not use any tags or LSDA info. 1. This uses `TargetExternalSymbol` for `GCC_except_tableN` symbols, which points to the address of per-function LSDA info. It is more convenient to use than `MCSymbol` because it can take additional target flags. 2. When lowering `wasm_lsda` intrinsic, if PIC is enabled, make the symbol relative to `__memory_base` and generate the `add` node. If PIC is disabled, continue to use the absolute address. 3. Make tag symbols (`__cpp_exception` and `__c_longjmp`) undefined in the backend, because it is hard to make it work with dynamic linking's loading order. Instead, we make all tag symbols undefined in the LLVM backend and import it from JS. 4. Add support for undefined tags to the linker. Companion patches: - https://github.com/WebAssembly/binaryen/pull/4223 - https://github.com/emscripten-core/emscripten/pull/15266 Reviewed By: sbc100 Differential Revision: https://reviews.llvm.org/D111388	2021-10-12 23:28:27 -07:00
Ben Shi	787eeb8597	[RISCV] Optimize immediate materialisation with BCLRI Do the following optimization for immediate materialisation: 1. For values in range 0xffffffff 7fffffff ~ 0xffffffff 00000000, first generate the lower 32-bit with Val\|0x80000000 (which is expected be an int32), then emit (BCLRI r, 31). 2. For values in range 0x80000000 ~ 0xffffffff, first generate the lower 32-bit with Val&~0x80000000 (which is expected to be an int32), then emit (BSETI r, 31). Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D111532	2021-10-13 00:59:23 +00:00
Fangrui Song	c2d4fe51bb	[X86] Remove little support we had for MPX GCC 9.1 removed Intel MPX support. Linux kernel removed MPX in 2019. glibc 2.35 will remove MPX. Our support is limited: we support assembling of bndmov but not bnd. Just remove it. Reviewed By: pengfei, skan Differential Revision: https://reviews.llvm.org/D111517	2021-10-12 16:18:51 -07:00
Albion Fung	b4b9f9b4b3	[PowerPC] Emit dcbt and dcbtst in place of their extended mnemonics on AIX On AIX, the system assembler does not support the extended mnemonics dcbtt and dcbtstt. This patch stops them from being emitted on AIX and emits the base mnemonics instead, dcbt X, X, 16 and dcbtstt X, X, 16 respectively. Differential revision: https://reviews.llvm.org/D111258	2021-10-12 15:47:57 -05:00
Roman Lebedev	958da6598f	[X86] `detectAVGPattern()`: don't require zext in the with-constant case	2021-10-12 20:24:17 +03:00
Roman Lebedev	a1d57f75d1	[NFC][X86] `detectAVGPattern()`: rely on `AVGSplitter()` to perform truncation	2021-10-12 20:12:09 +03:00
Stanislav Mekhanoshin	9cf995be6b	[AMDGPU] Promote generic pointer kernel arguments into global The new pass walks kernel's pointer arguments, then loads from them. If a loaded value is a pointer and loaded pointer is unmodified in the kernel before the load, then promote loaded pointer to global. Then recursively continue. Differential Revision: https://reviews.llvm.org/D111464	2021-10-12 10:07:33 -07:00
Roman Lebedev	5f4f5da634	[X86] `detectAVGPattern()`: support basic case of PAVG chaining (PR52131) As noted in https://github.com/halide/Halide/pull/6302, we hilariously fail to match PAVG if we even as much as look at it the wrong way. In this particular case, the problem stems from the fact that `PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s, and InstCombine really loves to get rid of both of these, for example replace them with a bit mask. So we may not have said `zext`. Instead of checking for that + type match, i think we should rely on the actual active type, as per the knownbits. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111571	2021-10-12 19:51:43 +03:00
Roman Lebedev	7964c3ed82	[X86] `detectAVGPattern()`: small preparatory NFC refactor	2021-10-12 19:51:43 +03:00
Jay Foad	66ce1015af	Revert "[AMDGPU] Enable load clustering in the post-RA scheduler" This reverts commit `66e13c7f43`. It was committed by accident.	2021-10-12 16:19:35 +01:00
Jay Foad	66e13c7f43	[AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%.	2021-10-12 16:09:04 +01:00
Bradley Smith	2eb42e3d2a	[AArch64][SVE] Add fixed type lowering for EXTRACT_SUBVECTOR Depends on D111135 Differential Revision: https://reviews.llvm.org/D111165	2021-10-12 14:56:15 +00:00
Simon Pilgrim	61d124f7a2	[X86] Fix implicit MathsExtras.h header dependency	2021-10-12 13:37:16 +01:00
jacquesguan	0608bbd4e8	[RISCV] Rename assembler mnemonic of unordered floating-point reductions for v1.0-rc change Rename vfredsum and vfwredsum to vfredusum and vfwredusum. Add aliases for vfredsum and vfwredsum. Reviewed By: luismarques, HsiangKai, khchen, frasercrmck, kito-cheng, craig.topper Differential Revision: https://reviews.llvm.org/D105690	2021-10-12 06:46:46 +00:00
Yonghong Song	1321e47298	BPF: rename BTF_KIND_TAG to BTF_KIND_DECL_TAG Per discussion in https://reviews.llvm.org/D111199, the existing btf_tag attribute will be renamed to btf_decl_tag. This patch updated BTF backend to use btf_decl_tag attribute name and also renamed BTF_KIND_TAG to BTF_KIND_DECL_TAG. Differential Revision: https://reviews.llvm.org/D111592	2021-10-11 21:33:39 -07:00
hsmahesha	52cb3af08c	[AMDGPU] Remove dead frame indices after sgpr spill. All those frame indices which are dead after sgpr spill should be removed from the function frame. Othewise, there is a side effect such as re-mapping of free frame index ids by the later pass(es) like "stack slot coloring" which in turn could mess-up with the book keeping of "frame index to VGPR lane". Reviewed By: cdevadas Differential Revision: https://reviews.llvm.org/D111150	2021-10-12 09:58:49 +05:30
Freddy Ye	d57a87ea89	[X86][ISel] Lowering llvm.thread.pointer Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D110681	2021-10-12 11:01:18 +08:00
David Green	860b4479dc	[ARM] Be more explicit about disabling CombineBaseUpdate for MVE. This shouldn't be called for non-neon targets at the moment in either case, but it is good to be expliit about the CombineBaseUpdate being a NEON function, not expecting to be run under MVE.	2021-10-11 21:51:45 +01:00
Roman Lebedev	684cbae89a	[KnownBits] Introduce `countMaxActiveBits()` and use it in a few places	2021-10-11 23:36:06 +03:00
Jay Foad	2e1ad93201	[AMDGPU] Fix copying a machine operand Without this I get: * Bad machine code: Instruction has operand with wrong parent set * - function: available_externally_test - basic block: %bb.0 (0x7dad598) - instruction: %0:r600_treg32_x = MOV 1, 0, 0, 0, $alu_literal_x, 0, 0, 0, -1, 1, $pred_sel_off, @available_externally, 0 Differential Revision: https://reviews.llvm.org/D111549	2021-10-11 20:22:47 +01:00
Joe Nash	b4b7e605a6	[AMDGPU] Support shared literals in FMAMK/FMAAK These instructions should allow src0 to be a literal with the same value as the mandatory other literal. Enable it by introducing an operand that defers adding its value to the MI when decoding till the mandatory literal is parsed. Reviewed By: dp, foad Differential Revision: https://reviews.llvm.org/D111067 Change-Id: I22b0ae0d35bad17b6f976808e48bffe9a6af70b7	2021-10-11 13:09:54 -04:00
Victor Campos	3550e242fa	[Clang][ARM][AArch64] Add support for Armv9-A, Armv9.1-A and Armv9.2-A armv9-a, armv9.1-a and armv9.2-a can be targeted using the -march option both in ARM and AArch64. - Armv9-A maps to Armv8.5-A. - Armv9.1-A maps to Armv8.6-A. - Armv9.2-A maps to Armv8.7-A. - The SVE2 extension is enabled by default on these architectures. - The cryptographic extensions are disabled by default on these architectures. The Armv9-A architecture is described in the Arm® Architecture Reference Manual Supplement Armv9, for Armv9-A architecture profile (https://developer.arm.com/documentation/ddi0608/latest). Reviewed By: SjoerdMeijer Differential Revision: https://reviews.llvm.org/D109517	2021-10-11 17:44:09 +01:00
Simon Pilgrim	ad16c6e52f	[X86][AVX] Ensure we retain zero elements in select(pshufb,pshufb) -> or(pshufb,pshufb) fold (PR52122) The select(pshufb,pshufb) -> or(pshufb,pshufb) fold uses getConstVector to create the refreshed pshufb masks, which treats all negative indices as undef. PR52122 shows that if we were selecting an element that the PSHUFB has set to zero we must set it back to 0x80 when we recreate the PSHUFB mask and not just leave it as SM_SentinelZero	2021-10-11 14:50:24 +01:00
Bradley Smith	03065ecd85	[AArch64][SVE] Ensure LowerEXTRACT_SUBVECTOR is not called for illegal types The lowering for EXTRACT_SUBVECTOR should not be called during type legalization, only as part of lowering, hence return SDValue() when called on illegal types. This also adds missing tests for extracting fixed types from illegal scalable types. Differential Revision: https://reviews.llvm.org/D111412	2021-10-11 11:20:50 +00:00
Andrew Savonichev	7ae8f392a1	[AArch64] Emit AssertZExt for i1 arguments AAPCS requires i1 argument to be zero-extended to 8-bits by the caller. Emit a new AArch64ISD::ASSERT_ZEXT_BOOL hint (or AssertZExt for GlobalISel) to enable some optimization opportunities. In particular, when the argument is forwarded to the callee, we can avoid zero-extension and use it as-is. Differential Revision: https://reviews.llvm.org/D107160	2021-10-11 11:55:11 +03:00
David Sherwood	26b7d9d622	[LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns This patch adds further support for vectorisation of loops that involve selecting an integer value based on a previous comparison. Consider the following C++ loop: int r = a; for (int i = 0; i < n; i++) { if (src[i] > 3) { r = b; } src[i] += 2; } We should be able to vectorise this loop because all we are doing is selecting between two states - 'a' and 'b' - both of which are loop invariant. This just involves building a vector of values that contain either 'a' or 'b', where the final reduced value will be 'b' if any lane contains 'b'. The IR generated by clang typically looks like this: %phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ] ... %pred = icmp ugt i32 %val, i32 3 %phi.update = select i1 %pred, i32 %b, i32 %phi We already detect min/max patterns, which also involve a select + cmp. However, with the min/max patterns we are selecting loaded values (and hence loop variant) in the loop. In addition we only support certain cmp predicates. This patch adds a new pattern matching function (isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp. We only support selecting values that are integer and loop invariant, however we can support any kind of compare - integer or float. Tests have been added here: Transforms/LoopVectorize/AArch64/sve-select-cmp.ll Transforms/LoopVectorize/select-cmp-predicated.ll Transforms/LoopVectorize/select-cmp.ll Differential Revision: https://reviews.llvm.org/D108136	2021-10-11 09:41:38 +01:00
Amara Emerson	f1e9ecea44	[AArch64][GlobalISel] Legalize G_VECREDUCE_XOR. Treated same as other bitwise reductions.	2021-10-10 17:01:21 -07:00
David Green	adec922361	[AArch64] Make -mcpu=generic schedule for an in-order core We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting any others. A blend of the performance options hopefully beneficial to all CPUs. The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from `eecb353d0e` which made -mcpu=generic perform in-order scheduling using the cortex-a8 schedule model. The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds. When running on an in-order core this improved performance by 3.8% geomean on a set of DSP workloads, 2% geomean on some other embedded benchmark and between 1% and 1.8% on a set of singlecore and multicore workloads, all running on a Cortex-A55 cluster. On an out-of-order cpu the results are a lot more noisy but show flat performance or an improvement. On the set of DSP and embedded benchmarks, run on a Cortex-A78 there was a very noisy 1% speed improvement. Using the most detailed results I could find, SPEC2006 runs on a Neoverse N1 show a small increase in instruction count (+0.127%), but a decrease in cycle counts (-0.155%, on average). The instruction count is very low noise, the cycle count is more noisy with a 0.15% decrease not being significant. SPEC2k17 shows a small decrease (-0.2%) in instruction count leading to a -0.296% decrease in cycle count. These results are within noise margins but tend to show a small improvement in general. When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used. A lot of existing tests have updated. This is a summary of the important differences: - Most changes are the same instructions in a different order. - Sometimes this leads to very minor inefficiencies, such as requiring an extra mov to move variables into r0/v0 for the return value of a test function. - misched-fusion.ll was no longer fusing the pairs of instructions it should, as per D110561. I've changed the schedule used in the test for now. - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to the different latencies. This seems fine to me. - Some SVE tests do not always remove movprfx where they did before due to different register allocation giving different destructive forms. - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll produce two LDR where they previously produced an LDP due to store-pair-suppress kicking in. - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD. - Some tests such as arm64-neon-mul-div.ll and ragreedy-local-interval-cost.ll have more, less or just different spilling. - In aarch64_generated_funcs.ll.generated.expected one part of the function is no longer outlined. Interestingly if I switch this to use any other scheduled even less is outlined. Some of these are expected to happen, such as differences in outlining or register spilling. There will be places where these result in worse codegen, places where they are better, with the SPEC instruction counts suggesting it is not a decrease overall, on average. Differential Revision: https://reviews.llvm.org/D110830	2021-10-09 15:58:31 +01:00
Arthur Eubanks	a0a4935182	Make more places that use alignment use uint64_t Followup to D110451.	2021-10-08 16:35:19 -07:00
Reid Kleckner	b3a6d096d7	Fix shlib builds for all lib/Target/*/TargetInfo libs They all must depend on MC now that the target registry is in MC. Also fix llvm-cxxdump	2021-10-08 15:21:13 -07:00
Reid Kleckner	2827b1b89d	Fix shared library build after TargetRegistry move	2021-10-08 15:06:03 -07:00
Reid Kleckner	89b57061f7	Move TargetRegistry.(h\|cpp) from Support to MC This moves the registry higher in the LLVM library dependency stack. Every client of the target registry needs to link against MC anyway to actually use the target, so we might as well move this out of Support. This allows us to ensure that Support doesn't have includes from MC/*. Differential Revision: https://reviews.llvm.org/D111454	2021-10-08 14:51:48 -07:00
Leonard Chan	4dc462b589	[AArch64] Emit CFI instruction for updating x18 when using ShadowCallStack with exception unwinding PR45875 notes an instance where exception handling crashes on aarch64-fuchsia where SCS is enabled by default. The underlying issue seems to be that within libunwind, various _Unwind_* functions, the x18 register is not updated if a function is marked with nounwind. This removes the check for nounwind and emits the CFI instruction that updates x18. Differential Revision: https://reviews.llvm.org/D79822	2021-10-08 14:20:26 -07:00
David Stuttard	69f7d81d0a	[AMDGPU] Set number vgprs used in PS shaders based on input registers actually used For PS shaders we can use the input SPI_PS_INPUT_ENA and SPI_PS_INPUT_ADDR registers Calculate the number of VGPR registers used as input VGPRs based on these registers rather than the arguments passed in (this conservatively always allocates the maximum). Differential Revision: https://reviews.llvm.org/D101633 Change-Id: Idf7c060cbbd5f7e3300102c55ecee3c07f209de6	2021-10-08 14:24:35 +01:00
Jingu Kang	30caca39f4	Third Recommit "[AArch64] Split bitmask immediate of bitwise AND operation" This reverts the revert commit `fc36fb4d23` with bug fixes. Differential Revision: https://reviews.llvm.org/D109963	2021-10-08 11:28:49 +01:00
Craig Topper	f2ad8c9dc6	[RISCV] Remove experimental-b extension that includes all Zb* extensions At this point it looks like a B extension will never exist. Instead Zba, Zbb, Zbc, and Zbs are individual extensions being ratified together as a package. Unknown at this time when or if the other Zb* extensions will be ratified. This patch removes references to the B extension. I've updated and split tests accordingly. This has been split from D110669 to make review a little easier. Differential Revision: https://reviews.llvm.org/D111338	2021-10-07 20:47:17 -07:00
Wang, Pengfei	c236883b6b	[X86] Optimize fdiv with reciprocal instructions for half type Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110557	2021-10-08 09:41:13 +08:00
Jay Foad	e996cf7dce	[AMDGPU] Preserve MachineDominatorTree in SILowerControlFlow Updating the MachineDominatorTree is easy since SILowerControlFlow only splits and removes basic blocks. This should save a bit of compile time because previously we would recompute the dominator tree from scratch after this pass. Another reason for doing this is that SILowerControlFlow preserves LiveIntervals which transitively requires MachineDominatorTree. I think that means that SILowerControlFlow is obliged to preserve MachineDominatorTree too as explained here: https://lists.llvm.org/pipermail/llvm-dev/2020-November/146923.html although it does not seem to have caused any problems in practice yet. Differential Revision: https://reviews.llvm.org/D111313	2021-10-07 21:30:26 +01:00
Mark Schimmel	417f8ea4ba	[ARC] ARCRegisterInfo cleanup prior to adding core register pairs (ARC32) and 64-bit core registers (ARC64) Differential Revision: https://reviews.llvm.org/D11108	2021-10-07 13:01:49 -07:00
Jay Foad	5b8befdd02	[X86] Special-case ADD of two identical registers in convertToThreeAddress X86InstrInfo::convertToThreeAddress would convert this: %1:gr32 = ADD32rr killed %0:gr32(tied-def 0), %0:gr32, implicit-def dead $eflags to this: undef %2.sub_32bit:gr64 = COPY killed %0:gr32 undef %3.sub_32bit:gr64_nosp = COPY %0:gr32 %1:gr32 = LEA64_32r killed %2:gr64, 1, killed %3:gr64_nosp, 0, $noreg Note that in the ADD32rr, %0 was used twice and the first use had a kill flag, which is what MachineInstr::addRegisterKilled does. In the converted code, each use of %0 is copied to a new reg, and the first COPY inherits the kill flag from the ADD32rr. This causes machine verification to fail (if you force it to run after TwoAddressInstructionPass) because the second COPY uses %0 after it is killed. Note that machine verification is currently disabled after TwoAddressInstructionPass but this is a step towards being able to enable it. Fix this by not inserting more than one COPY from the same source register. Differential Revision: https://reviews.llvm.org/D110829	2021-10-07 19:50:27 +01:00
Craig Topper	c4803bd416	[RISCV] Handle vector of pointer in getTgtMemIntrinsic for strided load/store. getScalarSizeInBits() doesn't work if the scalar type is a pointer. For that we need to go through DataLayout.	2021-10-07 10:11:56 -07:00
Mark Schimmel	20c074ee96	C] Add option to ARCOptAddrMode to disable the pass and diagnose errors Fixed formatting issues reported by clang-format Differential Revision: https://reviews.llvm.org/D111255	2021-10-07 09:07:07 -07:00
Bradley Smith	5be266db7a	[AArch64][SVE] Improve VECTOR_SPLICE codegen for VL > 128-bit Differential Revision: https://reviews.llvm.org/D111135	2021-10-07 15:28:55 +00:00
Jack Andersen	bd4dad87f4	[MachineInstr] Move MIParser's DBG_VALUE RegState::Debug invariant into MachineInstr::addOperand Based on the reasoning of D53903, register operands of DBG_VALUE are invariably treated as RegState::Debug operands. This change enforces this invariant as part of MachineInstr::addOperand so that all passes emit this flag consistently. RegState::Debug is inconsistently set on DBG_VALUE registers throughout LLVM. This runs the risk of a filtering iterator like MachineRegisterInfo::reg_nodbg_iterator to process these operands erroneously when not parsed from MIR sources. This issue was observed in the development of the llvm-mos fork which adds a backend that relies on physical register operands much more than existing targets. Physical RegUnit 0 has the same numeric encoding as $noreg (indicating an undef for DBG_VALUE). Allowing debug operands into the machine scheduler correlates $noreg with RegUnit 0 (i.e. a collision of register numbers with different zero semantics). Eventually, this causes an assert where DBG_VALUE instructions are prohibited from participating in live register ranges. Reviewed By: MatzeB, StephenTozer Differential Revision: https://reviews.llvm.org/D110105	2021-10-07 16:08:52 +01:00
Michael Forster	53801a59eb	Fix two unused-variable warnings.	2021-10-07 15:17:58 +02:00
Chen Zheng	1bf05fbc98	[PowerPC] refactor rewriteLoadStores for reusing; nfc This is split from https://reviews.llvm.org/D108750. Refactor rewriteLoadStores() so that we can reuse the outlined functions. Reviewed By: jsji Differential Revision: https://reviews.llvm.org/D110314	2021-10-07 12:59:20 +00:00
David Green	73346f5848	[ARM] Introduce a MQPRCopy Currently when creating tail predicated loops, we need to validate that all the live-outs of a loop will be equivalent with and without tail predication, and if they are not we cannot legally create a tail-predicated loop, leaving expensive vctp and vpst instructions in the loop. These notably can include register-allocation instructions like stack loads and stores, and copys lowered from COPYs to MVE_VORRs. Instead of trying to prove this is valid late in the pipeline, this patch introduces a MQPRCopy pseudo instruction that COPY is lowered to. This can then either be converted to a MVE_VORR where possible, or to a couple of VMOVD instructions if not. This way they do not behave differently within and outside of tail-predications regions, and we can know by construction that they are always valid. The idea is that we can do the same with stack load and stores, converting them to VLDR/VSTR or VLDM/VSTM where required to prove tail predication is always valid. This does unfortunately mean inserting multiple VMOVD instructions, instead of a single MVE_VORR, but my experiments show it to be an improvement in general. Differential Revision: https://reviews.llvm.org/D111048	2021-10-07 12:52:12 +01:00
Cullen Rhodes	14cb138b15	[AArch64][SME] Update DUP (predicate) instruction Changes in architecture revision 00eac1: * Renamed to PSEL. * Copies whole source register. * Element type suffix removed from destination. * Element index no longer optional and '#' prefix has been removed. The reference can be found here: https://developer.arm.com/documentation/ddi0602/2021-09 Depends on D111212. Reviewed By: kmclaughlin Differential Revision: https://reviews.llvm.org/D111213	2021-10-07 08:55:11 +00:00
Cullen Rhodes	42ba79b7b0	[AArch64][SME] Update tile slice index offset Changes in architecture revision 00eac1: * Tile slice index offset no longer prefixed with '#'. * The syntax for 128-bit (.Q) ZA tile slice accesses must now include an explicit zero index. The reference can be found here: https://developer.arm.com/documentation/ddi0602/2021-09 Reviewed By: david-arm Differential Revision: https://reviews.llvm.org/D111212	2021-10-07 08:55:10 +00:00
Itay Bookstein	40ec1c0f16	[IR][NFC] Rename getBaseObject to getAliaseeObject To better reflect the meaning of the now-disambiguated {GlobalValue, GlobalAlias}::getBaseObject after breaking off GlobalIFunc::getResolverFunction (D109792), the function is renamed to getAliaseeObject.	2021-10-06 19:33:10 -07:00
Craig Topper	58b68e70eb	[X86] Don't use popcnt for parity if only bits 7:0 of the input can be non-zero. Without popcnt we had a special case for using the parity flag from a single test i8 test instruction if only bits 7:0 could be non-zero. That special case is still useful when we have popcnt. To reach this special case, we enable custom lowering of parity for i16/i32/i64 even when popcnt is enabled. The check for POPCNT being enabled is now after the special case in LowerPARITY. Fixes PR52093 Differential Revision: https://reviews.llvm.org/D111249	2021-10-06 14:39:57 -07:00
Wolfgang Pieb	f53d05135e	[UBSAN][PS4] For the PS4 target, emit the ud2 ocpode for ubsan traps. Reviewed By: probinson Differential Revision: https://reviews.llvm.org/D111172	2021-10-06 11:49:36 -07:00
Stefan Pintilie	740086596c	[PowerPC] Fix issue with lowering byval parameters. Lowering of byval parameters with sizes that are not represented by a single store require multiple stores to properly address the correct size of the parameter. Sizes that cannot be done with a single store are 3 bytes, 5 bytes, 6 bytes, 7 bytes. It is not correct to simply perform an 8 byte store and for these elements because then the store would be larger than the element and alias analysis would assume that this is undefined behaivour and return NoAlias for them. This patch adds the correct stores so that the size of the store is not larger than the size of the element. Reviewed By: nemanjai, #powerpc Differential Revision: https://reviews.llvm.org/D108795	2021-10-06 13:19:15 -05:00
Shivam Gupta	7a189333ed	[NFC] Add doxygen comment for hasFp in RISCVFrameLowering.cpp	2021-10-06 22:35:28 +05:30
Pengxuan Zheng	b0045f5595	[ARM] Fix a bug in finding a pair of extracts to create VMOVRRD D100244 missed a check on the ResNo of the extract's operand 0 when finding a pair of extracts to combine into a VMOVRRD (extract(x, n); extract(x, n+1) -> VMOVRRD(extract x, n/2)). As a result, it can incorrectly pair an extract(x, n) with another extract(x:3, n+1) for example. This patch fixes the bug by adding the proper check on ResNo. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D111188	2021-10-06 10:03:32 -07:00
Simon Pilgrim	21661607ca	[llvm] Replace report_fatal_error(std::string) uses with report_fatal_error(Twine) As described on D111049, we're trying to remove the <string> dependency from error handling and replace uses of report_fatal_error(const std::string&) with the Twine() variant which can be forward declared.	2021-10-06 12:04:30 +01:00
Nathan Sidwell	c11e7b59d2	[X86][NFC] structure-return simplificiation The X86 backend only needs to know whether structure return is via an sret pointer. This removes the categorization enumeration and adjusts, templatizes and renames the related functions. Differential Revision: https://reviews.llvm.org/D109966	2021-10-06 03:12:48 -07:00
Simon Pilgrim	0776924a17	[CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost. These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.	2021-10-06 10:14:56 +01:00
Jonas Paulsson	3562076dfc	[SystemZ] Temporarily revert memcmp and memcpy patches Seem to cause test failures in compiler-rt. Revert "[SystemZ] Implement memcmp of variable length with CLC." This reverts commit `7a4e9a0c73`. Revert "[SystemZ] Implement memcpy of variable length with MVC." This reverts commit `c6c13c58ee`.	2021-10-06 11:05:18 +02:00
David Spickett	fc36fb4d23	Revert "Second Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"" This reverts commit `13f3c39f36`. Due to test failures in stage 2 clang tests on AArch64 bots.	2021-10-06 08:39:48 +00:00
Paulo Matos	0c7495848a	[WebAssembly] Fix call_indirect on funcrefs The currently implementation of funcrefs is broken since it is putting the funcref itself on the stack before the call_indirect. Instead what should be on the stack is the constant 0, which is the index at which we store the funcref in __funcref_call_table. Reviewed By: tlively Differential Revision: https://reviews.llvm.org/D111152	2021-10-06 10:11:53 +02:00
Paulo Matos	91fe069c35	[WebAssembly] De-duplicate WasmAddressSpace and refactor reftype predicates This is a non-functional change to remove the duplicate WasmAddressSpace enum and refactor reftype predicates by moving them to the Utilities source file. Reviewed By: tlively Differential Revision: https://reviews.llvm.org/D111144	2021-10-06 09:56:23 +02:00
Carl Ritson	adf7043a9f	[AMDGPU] Only remove branches in SIInstrInfo::removeBranch Without this change _term instructions can be removed during critical edge splitting. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D111126	2021-10-06 10:34:26 +09:00
Heejin Ahn	3ec1760d91	[WebAssembly] Remove WasmTagType This removes `WasmTagType`. `WasmTagType` contained an attribute and a signature index: ``` struct WasmTagType { uint8_t Attribute; uint32_t SigIndex; }; ``` Currently the attribute field is not used and reserved for future use, and always 0. And that this class contains `SigIndex` as its property is a little weird in the place, because the tag type's signature index is not an inherent property of a tag but rather a reference to another section that changes after linking. This makes tag handling in the linker also weird that tag-related methods are taking both `WasmTagType` and `WasmSignature` even though `WasmTagType` contains a signature index. This is because the signature index changes in linking so it doesn't have any info at this point. This instead moves `SigIndex` to `struct WasmTag` itself, as we did for `struct WasmFunction` in D111104. In this CL, in lib/MC and lib/Object, this now treats tag types in the same way as function types. Also in YAML, this removes `struct Tag`, because now it only contains the tag index. Also tags set `SigIndex` in `WasmImport` union, as functions do. I think this makes things simpler and makes tag handling more in line with function handling. These two shares similar properties in that both of them have signatures, but they are kind of nominal so having the same signature doesn't mean they are the same element. Also a drive-by fix: the reserved 'attirubute' part's encoding changed from uleb32 to uint8 a while ago. This was fixed in lib/MC and lib/Object but not in YAML. This doesn't change object files because the field's value is always 0 and its encoding is the same for the both encoding. This is effectively NFC; I didn't mark it as such just because it changed YAML test results. Reviewed By: sbc100, tlively Differential Revision: https://reviews.llvm.org/D111086	2021-10-05 17:11:22 -07:00
Simon Pilgrim	2e5daac217	[llvm] Update report_fatal_error calls from raw_string_ostream to use Twine(OS.str()) As described on D111049, we're trying to remove the <string> dependency from error handling and replace uses of report_fatal_error(const std::string&) with the Twine() variant which can be forward declared. We can use the raw_string_ostream::str() method to perform the implicit flush() and return a reference to the std::string container that we can then wrap inside Twine().	2021-10-05 18:42:12 +01:00
Jonas Paulsson	7a4e9a0c73	[SystemZ] Implement memcmp of variable length with CLC. Following the same pattern of memset/memcpy, this patch implements a variable length memcmp with a CLC loop followed by an EXRL instruction. Review: Ulrich Weigand Differential Revision: https://reviews.llvm.org/D107380	2021-10-05 18:20:36 +02:00
Jonas Paulsson	c6c13c58ee	[SystemZ] Implement memcpy of variable length with MVC. Instead of making a memcpy libcall, emit an MVC loop and an EXRL instruction the same way as is already done for memset 0. Review: Ulrich Weigand Differential Revision: https://reviews.llvm.org/D106874	2021-10-05 17:14:41 +02:00
Peter Waller	be26e6ff73	[AArch64][SVE] Remove redundant PTEST following PNEXT/PFIRST PNEXT and PFIRST set the NZCV flags, so the subsequent PTEST can be optimized away in AArch64InstrInfo::optimizePTestInstr. See-also: https://reviews.llvm.org/D93292 Differential Revision: https://reviews.llvm.org/D110177	2021-10-05 15:10:48 +00:00
Matthew Devereau	2ac1999937	[AArch64][SVE] Propagate math flags from intrinsics to instructions Retain floating-point math flags inside instCombineSVEVectorBinOp	2021-10-05 15:39:13 +01:00
Roman Lebedev	3f9b235482	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1jfGddcre - for intels `Block RThroughput: =36.0`; for ryzens, `Block RThroughput: =12.0` So could pick cost of `36` For store we have: https://godbolt.org/z/ao9srMT8r - for intels `Block RThroughput: =30.0`; for ryzens, `Block RThroughput: =12.0` So we could pick cost of `30`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111094	2021-10-05 16:58:58 +03:00
Roman Lebedev	e2784c5d8c	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rc8jYxW6M - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0` So could pick cost of `18`. For store we have: https://godbolt.org/z/9PhPEr65G - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: =6.0` So we could pick cost of `15`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111093	2021-10-05 16:58:58 +03:00
Roman Lebedev	3960693048	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/onese7rec - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =3.0` So could pick cost of `6`. For store we have: https://godbolt.org/z/bMd7dddnT - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=6.0` So we could pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111092	2021-10-05 16:58:58 +03:00
Roman Lebedev	79d6d12d95	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=16 interleaving costs This one required quite a bit of an assembly surgery, but i think it's in the right ballpark.. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/na97Kb96o - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So could pick cost of `64`. For store we have: https://godbolt.org/z/GG1WeoKar - for intels `Block RThroughput: =66.0`; for ryzens, `Block RThroughput: <=27.5` So we could pick cost of `66`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111091	2021-10-05 16:58:58 +03:00
Roman Lebedev	2996a2b50f	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/jK85GWKaK - for intels `Block RThroughput: =31.0`; for ryzens, `Block RThroughput: <=17.0` So could pick cost of `31`. For store we have: https://godbolt.org/z/hPWWhEEf9 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=13.8` So we could pick cost of `33`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111089	2021-10-05 16:58:57 +03:00
Roman Lebedev	d51532d8aa	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/szEj1ceee - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=8.8` So could pick cost of `15`. For store we have: https://godbolt.org/z/81bq4fTo1 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=10.0` So we could pick cost of `12`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111087	2021-10-05 16:58:57 +03:00
Roman Lebedev	764fd5f463	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.3` So could pick cost of `6`. For store we have: https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0` So we could pick cost of `9`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111083	2021-10-05 16:58:57 +03:00
Roman Lebedev	c800119c46	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/3M3hbq7n8 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0` So could pick cost of `20`. For store we have: https://godbolt.org/z/zvnPYWTx7 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0` So we could pick cost of `20`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111076	2021-10-05 16:58:57 +03:00
Roman Lebedev	000ce0bfd5	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTKdzjvnr - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So could pick cost of `8`. For store we have: https://godbolt.org/z/cMYEvqoah - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So we could pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111075	2021-10-05 16:58:57 +03:00
Roman Lebedev	dcc2b0d933	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/z197317d1 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So could pick cost of `6`. For store we have: https://godbolt.org/z/8dzszjf9q - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0` So we could pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111073	2021-10-05 16:58:57 +03:00
Roman Lebedev	7d91037fd2	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=16 interleaving costs This one required quite a bit of assembly surgery, but the trend continues, so i think this is right. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/EKWdj8cKT - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32`. For store we have: https://godbolt.org/z/zj4bb9P75 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111064	2021-10-05 16:58:57 +03:00
Roman Lebedev	4aee1e5b93	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/a6rxMG6ec - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=12.0` So could pick cost of `16`. For store we have: https://godbolt.org/z/ced1bdqc9 - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0` So we could pick cost of `16`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111063	2021-10-05 16:58:57 +03:00
Roman Lebedev	3c2e22b795	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/avq1oz98W - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: =4.0` So could pick cost of `8`. For store we have: https://godbolt.org/z/89PGMc1qs - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=6.0` So we could pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111061	2021-10-05 16:58:57 +03:00
Roman Lebedev	b6234c1edf	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=2 interleaving costs Finally, we are getting to the heavy-hitter stuff! The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/7crGWoar6 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So could pick cost of `4`. For store we have: https://godbolt.org/z/T8aq3MszM - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=2.0` So we could pick cost of `5`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111060	2021-10-05 16:58:56 +03:00
kpyzhov	095c48fdf3	[AMDGPU] Use "hostcall" module flag instead of searching for ockl_hostcall_internal() declaration. The current way to detect hostcalls by looking for "ockl_hostcall_internal()" function in the module seems to be not reliable enough. The LTO may rename the "ockl_hostcall_internal()" function when an application is compiled with "-fgpu-rdc", and MetadataStreamer pass to fail to detect hostcalls, therefore it does not set the "hidden_hostcall_buffer" kernel argument. This change adds a new module flag: hostcall that can be used to detect whether GPU functions use host calls for printf. Differential revision: https://reviews.llvm.org/D110337	2021-10-05 09:56:04 -04:00
Hsiangkai Wang	80a6456306	[RISCV] Update to vlm.v and vsm.v according to v1.0-rc1. vle1.v -> vlm.v vse1.v -> vsm.v Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D106044	2021-10-05 21:49:54 +08:00
Jay Foad	9ce4f37206	[AMDGPU][GlobalISel] Fix legalization of G_UMULH Scalarize before narrowing because the narrowing implementation does not work on vectors. This matches what we do for regular G_MUL. Differential Revision: https://reviews.llvm.org/D111129	2021-10-05 10:56:02 +01:00
Tim Northover	5f65ee260d	AArch64+GISel: legalize vector remainder operations.	2021-10-05 10:20:10 +01:00

... 4 5 6 7 8 ...

64968 Commits