Commit Graph

152419 Commits

Author SHA1 Message Date
Philip Reames 37ead201e6 [runtime-unroll] Use incrementing IVs instead of decrementing ones
This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.

Why does this matter?  A couple of reasons:
* SCEV doesn't have a native subtract node.  Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such.  As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones.  (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language.  We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced.  (You can see this looking at nearby phis in the test cases.)

Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.

* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value.  We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
2021-11-12 15:44:58 -08:00
Craig Topper 02bed66cd5 [RISCV] Improve codegen for i32 udiv/urem by constant on RV64.
The division by constant optimization often produces constants that
are uimm32, but not simm32. These constants require 3 or 4 instructions
to materialize without Zba.

Since these instructions are often used by a multiply with a LHS
that needs to be zero extended with an AND, we can switch the MUL
to a MULHU by shifting both inputs left by 32. Once we shift the
constant left, the upper 32 bits no longer need to be 0 so constant
materialization is free to use LUI+ADDIW. This reduces the constant
materialization from 4 instructions to 3 in some cases while also
reducing the zero extend of the LHS from 2 shifts to 1.

Differential Revision: https://reviews.llvm.org/D113805
2021-11-12 14:49:10 -08:00
Philip Reames de2fed6152 [unroll] Keep unrolled iterations with initial iteration
The unrolling code was previously inserting new cloned blocks at the end of the function.  The result of this with typical loop structures is that the new iterations are placed far from the initial iteration.

With unrolling, the general assumption is that the a) the loop is reasonable hot, and b) the first Count-1 copies of the loop are rarely (if ever) loop exiting.  As such, placing Count-1 copies out of line is a fairly poor code placement choice.  We'd much rather fall through into the hot (non-exiting) path.  For code with branch profiles, later layout would fix this, but this may have a positive impact on non-PGO compiled code.

However, the real motivation for this change isn't performance.  Its readability and human understanding.  Having to jump around long distances in an IR file to trace an unrolled loop structure is error prone and tedious.
2021-11-12 11:40:50 -08:00
Lang Hames 9d5e647428 [JITLink] Fix think-o in handwritten CWrapperFunctionResult -> Error converter.
We need to skip the length field when generating error strings.

No test case: This hand-hacked deserializer should be removed in the near future
once JITLink can use generic ORC APIs (including SPS and WrapperFunction).
2021-11-12 10:36:17 -08:00
Florian Hahn 03cfea68c6
[SCEV] Update SCEVLoopGuardRewriter to take SCEV -> SCEV map (NFC).
Split off refactoring from D113577 to reduce the diff. NFC as the new
interface will only be used in D113577.
2021-11-12 18:16:03 +00:00
Simon Pilgrim 6bb71738e2 [X86] convertShiftLeftToScale - improve vXi8 constant handling
Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow.

We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).
2021-11-12 16:48:10 +00:00
Joel E. Denny c9dfe322ee [OpenMP] Fix main thread barrier for Pascal and amdgpu
Fixes what's left of https://bugs.llvm.org/show_bug.cgi?id=51781.

Reviewed By: jdoerfert, JonChesterfield, tianshilei1992

Differential Revision: https://reviews.llvm.org/D113602
2021-11-12 11:18:45 -05:00
Jay Foad a70bbb5f7a [AMDGPU] Simplify 64-bit division/remainder expansion
The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.

Differential Revision: https://reviews.llvm.org/D113679
2021-11-12 15:48:41 +00:00
Kazu Hirata 99d5cbbd7e [CodeGen] Use SDNode::uses (NFC) 2021-11-12 07:33:29 -08:00
Kerry McLaughlin 7647822156 [AArch64][SVE] Remove i1 type from isElementTypeLegalForScalableVector
`collectElementTypesForWidening` collects the types of load, store and
reduction Phis in a loop. These types are later checked using
`isElementTypeLegalForScalableVector` to prevent vectorisation of
loops with instruction types that are unsupported.

This patch removes i1 from the list of types supported for scalable
vectors. This fixes an assert ("Cannot yet scalarize uniform stores") in
`setCostBasedWideningDecision` when we have a loop containing a uniform
i1 store and a scalable VF, which we cannot create a scatter for.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D113680
2021-11-12 14:24:38 +00:00
Alexey Bataev 352c46e707 [SLP]Improve vectorization of split loads.
Need to fix ther cost estimation for split loads, since we look at the
subregs already, no need to permute them, need just to estimate
subregister insert, if it is smaller than the real register. Also, using
split loads, it might be profitable already to vectorize smaller trees
with gathering of the loads.

Differential Revision: https://reviews.llvm.org/D107188
2021-11-12 06:13:22 -08:00
Simon Pilgrim 59087dce3b [X86] combineX86ShufflesConstants - constant fold from target shuffles unless optsize = true
Currently we only constant fold target shuffles if any of the sources has one use, or it would remove a variable shuffle mask - the aim being to avoid constant pool bloat.

This patch proposes we should constant fold by default and only limit this if optsize is enabled - I've added a basic test for this in vector-mul.ll (the pmuludq case is by far the most common), I can add other specific test cases if people need them.

This should permit further constant folding, break some instruction dependencies and help reduce shuffle port pressure.

Differential Revision: https://reviews.llvm.org/D113748
2021-11-12 14:02:43 +00:00
Sanjay Patel bf5748a1af [x86] fold vector (X > -1) & Y to shift+andn
and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y

This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).

We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.

Differential Revision: https://reviews.llvm.org/D113603
2021-11-12 08:17:46 -05:00
Florian Hahn 819bca9b90
[SCEV] Use APIntOps::umin to select best max BC count (NFC).
Suggested in D102267, but I missed this in the committed version.
2021-11-12 12:20:01 +00:00
Neubauer, Sebastian d1f45ed58f [AMDGPU][NFC] Fix typos
Differential Revision: https://reviews.llvm.org/D113672
2021-11-12 11:37:21 +01:00
Simon Moll 751aa6c280 [VE][NFCi] Remove unused tablegen parameters
TableGen has started warning about unused template parameters in the isel patterns.  Remove those.

Reviewed By: kaz7

Differential Revision: https://reviews.llvm.org/D113675
2021-11-12 08:19:50 +01:00
Markus Lavin 4e94e25c90 Fix minor deficiency in machine-sink.
Register uses that are MRI->isConstantPhysReg() should not inhibit
sinking transformation.

Reviewed By: StephenTozer

Differential Revision: https://reviews.llvm.org/D111531
2021-11-12 08:01:13 +01:00
Kazu Hirata 2ca45adf24 [CodeGen, Target] Use MachineRegisterInfo::use_operands (NFC) 2021-11-11 22:28:55 -08:00
Serge Pavlov 3057e850b8 [X86] Preserve FPSW when popping x87 stack
When compiler converts x87 operations to stack model, it may insert
instructions that pop top stack element. To do it the compiler inserts
instruction FSTP right after the instruction that calculates value on
the stack. It can break the code that uses FPSW set by the last
instruction. For example, an instruction FXAM is usually followed by
FNSTSW, but FSTP is inserted after FXAM. As FSTP leaves condition code
in FPSW undefined, the compiler produces incorrect code.

With this change FSTP in inserted after the FPSW consumer if the last
instruction sets FPSW.

Differential Revision: https://reviews.llvm.org/D113335
2021-11-12 12:00:09 +07:00
Luís Ferreira 665b4138d9 [DebugInfo] run clang-format on some unformatted files
This trivial patch runs clang-format on some unformatted files before
doing logic changes and prevent hard to review diffs.

Differential Revision: https://reviews.llvm.org/D113572
2021-11-11 18:59:41 -08:00
Phoebe Wang 74b979abcd [X86][FP16] Avoid to generate VZEXT_MOVL with i16
This fixes the crash due to lacking VZEXT_MOVL support with i16.

Reviewed By: LuoYuanke, RKSimon

Differential Revision: https://reviews.llvm.org/D113661
2021-11-12 09:32:29 +08:00
Nikita Popov 986416251b [InstCombine] Drop redundant fold for and/or of icmp eq/ne (NFCI)
This handles a special case of foldAndOrOfICmpsUsingRanges()
with two equality predicates.
2021-11-11 20:25:40 +01:00
Min-Yih Hsu 99152a4164 [M68k][NFC] Rename 'GlSel' -> 'GISel'
AArch64 as well as other targets use the abbrev "GISel" so we'd better
to be consistent with them. NFC.
2021-11-11 11:01:09 -08:00
Simon Pilgrim 94a901a50a [X86] Move LowerFunnelShift below LowerShift. NFC.
Makes it easier to reuse the various vector shift helpers defined above LowerShift
2021-11-11 18:45:51 +00:00
Simon Pilgrim 010b09b0c5 [DAG] reassociateOpsCommutative - test getNode result directly. NFC
Matches the clean code style we use directly above
2021-11-11 18:45:50 +00:00
Mircea Trofin f64eee1625 [NFC][InlineAdvisor] Inform advisor when the module is invalidated
This avoids unnecessary re-calculation of module-wide features in the
MLInlineAdvisor. In cases where function passes don't invalidate
functions (and, thus, don't invalidate the module), but we re-process a
CGSCC, we currently refreshed module features unnecessarily. The
overhead of fetching cached results (albeit they weren't themselves
invalidated) was noticeable in certain modules' compilations.

We don't want to just invalidate the advisor object, though, via the
analysis manager, because we'd then need to re-create expensive state
(like the model evaluator in the ML 'development' mode).

Reviewed By: phosek

Differential Revision: https://reviews.llvm.org/D113644
2021-11-11 10:23:49 -08:00
Nikita Popov 84e273cced [InstCombine] Handle undefs in and of icmp eq zero fold
For the scalar/splat case, this fold is subsumed by
foldLogOpOfMaskedICmps(). However, the conjugated fold for "or"
also supports splats with undef. Make both code paths consistent
by using m_ZeroInt() for the "and" implementation as well.

https://alive2.llvm.org/ce/z/tN63cu
https://alive2.llvm.org/ce/z/ufB_Ue
2021-11-11 19:07:07 +01:00
Jordan Rupprecht da4822f6c8 [PowerPC][NFC] Ignore unused var in release builds.
Note we can't inline this call into assert because `isIntS16Immediate` has a side effect. But we only look at the return value in asserts builds.
2021-11-11 08:57:40 -08:00
Craig Topper ee7a006ce4 [RISCV] Promote f16 ceil/floor/round/roundeven/nearbyint/rint/trunc intrinsics to f32 libcalls.
Previously these would crash. I don't think these can be generated
directly from C. Not sure if any optimizations can introduce them.

Reviewed By: asb

Differential Revision: https://reviews.llvm.org/D113527
2021-11-11 08:28:41 -08:00
Victor Huang 18fe0a0d9e [PowerPC] PPC backend optimization to lower int_ppc_tdw/int_ppc_tw intrinsics to TDI/TWI machine instructions
This patch adds the backend optimization to match XL behavior for the two
builtins __tdw and __tw that when the second input argument is an immediate,
emitting tdi/twi instructions instead of td/tw.

Reviewed By: nemanjai, amyk, PowerPC

Differential revision: https://reviews.llvm.org/D112285
2021-11-11 09:52:00 -06:00
Sanjay Patel 11522cfcad [DAGCombiner] add fold for vselect based on mask of signbit, part 3
(Cond0 s> -1) ? N1 : 0 --> ~(Cond0 s>> BW-1) & N1

https://alive2.llvm.org/ce/z/mGCBrd

This was suggested as a potential enhancement in D113212 (also 7e30404c3b ).
There's an improvement for AArch that could be generalized ( X > -1 --> X >= 0 ).
For x86, we have a counter-acting fold for most cases that turns the shift+not
back into a setcc, so that needs a work-around to get more cases to use "pandn":
D113603

Note that this pattern (and a previous one) are not currently canonical forms
in IR:
https://alive2.llvm.org/ce/z/e4o96b

Adding swapped variants is left as a TODO item here, but is planned as
a near-term follow-up patch.

Differential Revision: https://reviews.llvm.org/D113426
2021-11-11 10:27:37 -05:00
Jay Foad 417add4d4e [CodeGen] Tweak whitespace in LiveInterval printing
When printing a LiveInterval, tweak the use of single and double spaces
to try to make it clearer that the valnos are associated with the
preceding range or subrange, not the following subrange.

Compare the output before and then after this patch:
%1 [32r,144r:0)  0@32r L000000000000000C [32r,144r:0)  0@32r L00000000000000F3 [32r,32d:0)  0@32r weight:0.000000e+00
%1 [32r,144r:0) 0@32r  L000000000000000C [32r,144r:0) 0@32r  L00000000000000F3 [32r,32d:0) 0@32r  weight:0.000000e+00

Differential Revision: https://reviews.llvm.org/D113671
2021-11-11 15:19:32 +00:00
Kazu Hirata ce227ce3b3 [CodeGen] Use MachineInstr::operands (NFC) 2021-11-11 07:10:30 -08:00
Jay Foad 491beae71d [TwoAddressInstruction] Update LiveIntervals after rewriting INSERT_SUBREG to COPY
Also add subranges to an existing live interval when introducing a new
subreg def.

Differential Revision: https://reviews.llvm.org/D113044
2021-11-11 12:24:59 +00:00
Jay Foad 6abbc3a420 [LiveIntervals] Update subranges in processTiedPairs
In TwoAddressInstructionPass::processTiedPairs when updating live
intervals after moving the last use of RegB back to the newly inserted
copy, update any affected subranges as well as the main range.

Differential Revision: https://reviews.llvm.org/D110411
2021-11-11 12:24:59 +00:00
Simon Pilgrim 82b74363a9 [DAG] reassociateOpsCommutative - peek through bitcasts to find constants
Now that FoldConstantArithmetic can fold bitcasted constants, we should peek through bitcasts of binop operands to try and find foldable constants
2021-11-11 12:00:22 +00:00
Simon Pilgrim 098ea29641 [DAG] FoldConstantArithmetic - fold intop(bitcast(buildvector(c1)),bitcast(buildvector(c1))) -> bitcast(intop(buildvector(c1'),buildvector(c2')))
Enable FoldConstantArithmetic to constant fold bitcasted constant build vectors. These have typically been bitcasted for type legalization purposes.

By extracting the raw constant bit data, performing the constant fold, and then casting the constant bit data back to the (legalized) type, we can perform constant folding on integer types after legalization.

This in particular helps 32-bit targets which need to handle vXi64 build vectors - during legalization the (unsupported) i64 elements are split to create a bitcasted v2Xi32 build vector.

Addresses some regressions in D113192.

Differential Revision: https://reviews.llvm.org/D113564
2021-11-11 11:35:18 +00:00
Florian Hahn c2ed9fd054
[AArch64] Use custom lowering for {U,S}INT_TO_FP with i8.
With fullfp16, it is cheaper to cast the {U,S}INT_TO_FP operand to i16
first, rather than promoting it to i32. The custom lowering for
{U,S}INT_TO_FP  already supports that, it just needs to be used.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D113601
2021-11-11 08:47:15 +00:00
duanbo.db 53dc525828 [LoopInfo] Fix function getInductionVariable
The way function gets the induction variable is by judging whether
StepInst or IndVar in the phi statement is one of the operands of CMP.
But if the LatchCmpOp0/LatchCmpOp1 is a constant,  the subsequent
comparison may result in null == null, which is meaningless. This patch
fixes the typo.

Reviewed By: Whitney

Differential Revision: https://reviews.llvm.org/D112980
2021-11-11 16:22:42 +08:00
David Green 703ded8dda [AArch64] Allow FP16 vector fixed point converts
This extends performFpToIntCombine to work on FP16 vectors as well as
the f32 and f64 vectors it already supported.

Differential Revision: https://reviews.llvm.org/D113297
2021-11-11 07:32:52 +00:00
Bin Cheng bf76e64854 [BPI] Push exit block rather than exiting ones in getSccExitBlocks
The function BranchProbabilityInfo::SccInfo::getSccExitBlocks is
supposed to collect all exit blocks for SCC rather than all exiting
blocks. This patch fixes the typo.

Reviewed By: ebrevnov

Differential Revision: https://reviews.llvm.org/D113344
2021-11-11 14:22:19 +08:00
Craig Topper 0963291991 [TypePromotion] Fix a hardcoded use of 32 as the size being promoted to.
At least I think that's what the 32 here is. Use RegisterBitWidth
instead.

While there replace zext with zextOrSelf to simplify the code.

Reviewed By: samparker, dmgreen

Differential Revision: https://reviews.llvm.org/D113495
2021-11-10 22:12:39 -08:00
Kazu Hirata 642a361b7e [llvm] Use make_early_inc_range (NFC) 2021-11-10 19:56:35 -08:00
kpyzhov c9690092c8 [AMDGPU] Small correction in SITargetLowering::performOrCombine().
Differential Revision: https://reviews.llvm.org/D113203
2021-11-10 21:07:27 -05:00
Craig Topper 4183522e80 [RISCV] Promote f16 frem with Zfh.
Add riscv64 coverage for f32 and f64 frem.

Reviewed By: luismarques

Differential Revision: https://reviews.llvm.org/D113531
2021-11-10 17:35:07 -08:00
Stanislav Mekhanoshin 476ab0f809 [AMDGPU] Fixed stack pointer init with architected flat scratch
Even if wave offset is not present we still need to do the rest
of the initialization. The mov into s32 was missing in the
kernels.

Fixes: SWDEV-310935

Differential Revision: https://reviews.llvm.org/D113628
2021-11-10 17:18:38 -08:00
Matt Arsenault c7a0c2d0f7 AMDGPU: Report large stack usage for recursive calls
We were previously setting an ignored bit in the kernel headers. The
current behavior is to add the large amount on top of the statically
known size of a single stack frame. I'm not sure if we should just use
the large size as the entire reported size instead.
2021-11-10 20:02:01 -05:00
Nikita Popov 0242a6adf7 [InstCombine] Support splat vectors in some or of icmp folds
Replace m_ConstantInt() with m_APInt() in order to support splat
constants in addition to scalar integers.
2021-11-10 22:59:09 +01:00
Nikita Popov 861adaf2ad [InstCombine] Support splat vectors in some and of icmp folds
Replace m_ConstantInt() with m_APInt() to support splat vectors
in addition to scalar integers.
2021-11-10 22:37:54 +01:00
Nikita Popov 58ebc79a64 [InstCombine] Strip offset when folding and/or of icmps
When folding and/or of icmps, look through add of a constant and
adjust the icmp range instead. Effectively, this decomposes
X + C1 < C2 style range checks back into a normal range. This allows
us to fold comparisons involving two range checks or one range check
and some other condition. We had a fold for a really specific case
of this (or of range check and eq, and only one one side!) while
this handles it in fully generality.

Differential Revision: https://reviews.llvm.org/D113510
2021-11-10 22:01:52 +01:00