We have these 2 "isDesirable" promotion hooks (I'm not sure why we need both of them, but that's
independent of this patch), and we can adjust them to promote "mul i8 X, C" to i32. Then, all of
our existing LEA and other multiply expansion magic happens as it would for i32 ops.
Some of the test diffs show that we could end up with an actual 32-bit mul instruction here
because we choose not to expand to simpler ops. That instruction could be slower depending on the
subtarget. On the plus side, this means we don't need a separate instruction to load the constant
operand and possibly an extra instruction to move the result. If we need to tune mul i32 further,
we could add a later transform that tries to shrink it back to i8 based on subtarget timing.
I did not bother to duplicate all of the 32-bit test file RUNs and target settings that exist to
test whether LEA expansion is cheap or not. The diffs here assume a default target, so that means
LEA is generally cheap.
Differential Revision: https://reviews.llvm.org/D54803
llvm-svn: 347557
When splitting the v16f32/v8f64 result type, type legalization will try to promote the integer result type before a concat and an explicit truncate. But for the fptoui test case this is particularly bad since fptoui isn't supported on X86 until AVX512. We could use an fptosi since the result range would fit in a signed 32-bit value, but the generic type legalization doesn't do that transformation when splitting. It does do this when promoting.
llvm-svn: 347533
This should likely be adjusted to limit this transform
further, but these diffs should be clear wins.
If we have blendv/conditional move, then we should assume
those are cheap ops. The loads become independent of the
compare, so those can be speculated before we need to use
the values in the blend/mov.
llvm-svn: 347526
There are many options here depending on subtarget,
but we are uniformly relying on a transform that was
driven by performance for a 32-bit SSE2 target in 2009.
Note: The same motivation was apparently used to do this
transform for *all* targets, so non-x86 may want to look
at this too.
llvm-svn: 347525
...and use them to avoid creating obviously undef values as
discussed in the post-commit thread for r347478.
The diffs in vector div/rem show that we were missing real
optimizations by creating bogus shift nodes.
llvm-svn: 347502
I'm not sure if this actually preserves the original intent
of this test, but if we leave it as-is, the -1 (oversized)
shift should be folded to undef and allow deleting half
of the output.
llvm-svn: 347501
This code takes a truncate, fp_to_int, or int_to_fp with a legal result type and an input type that needs to be split and enlarges the elements in the result type before doing the split. Then inserts a follow up truncate or fp_round after concatenating the two halves back together.
But if the input type of the original op is being split on its way to ultimately being scalarized we're just going to end up building a vector from scalars and then truncating or rounding it in the vector register. Seems kind of silly to enlarge the result element type of the operation only to end up with scalar code and then building a vector with large elements only to make the elements smaller again in the vector register. Seems better to just try to get away producing smaller result types in the scalarized code.
The X86 test case that changes is a pretty contrived test case that exists because of a bug we used to have in our AVG matching code. I think the code is better now, but its not realistic anyway.
llvm-svn: 347482
SplitVecOp_TruncateHelper tries to introduce a multilevel truncate to avoid scalarization. But if splitting the result type would still be a legal type we don't need to do that.
The comment block at the top of the function implied that this was already implemented. I looked back through the history and it doesn't look to have ever been checked.
llvm-svn: 347479
We fail to canonicalize IR this way (prefer 'not' ops to arbitrary 'xor'),
but that would not matter without this patch because DAGCombiner was
reversing that transform. I think we need this transform in the backend
regardless of what happens in IR to catch cases where the shift-xor
is formed late from GEP or other ops.
https://rise4fun.com/Alive/NC1
Name: shl
Pre: (-1 << C2) == C1
%shl = shl i8 %x, C2
%r = xor i8 %shl, C1
=>
%not = xor i8 %x, -1
%r = shl i8 %not, C2
Name: shr
Pre: (-1 u>> C2) == C1
%sh = lshr i8 %x, C2
%r = xor i8 %sh, C1
=>
%not = xor i8 %x, -1
%r = lshr i8 %not, C2
https://bugs.llvm.org/show_bug.cgi?id=39657
llvm-svn: 347478
GCC does it this way, and we have to be consistent. This includes
stdcall and fastcall functions with suffixes. I confirmed that a
fastcall function named "foo" ends up in ".text$foo", not
".text$@foo@8".
Based on a patch by Andrew Yohn!
Fixes PR39218.
Differential Revision: https://reviews.llvm.org/D54762
llvm-svn: 347431
These are AVX2 instructions, but have been incorrectly marked in tablegen for a while. This wasn't a problem until r346784 switched the patterns to use target independent ISD opcodes. This made the patterns visible to fast isel.
Fixes PR39733
llvm-svn: 347375
We can't guarantee that demanded bits passing through the vector shuffle won't cause the AND in front of this to be removed. This would prevent the PACKUS from being matched during shuffle lowering.
Unfortunately, this adds a packuswb to one of the vector-reduce-mul.ll tests since we were removing the shuffle via SimplifyDemandedVectorElts. We appear to have similar issues with vpmovwb on the same test case on other targets.
llvm-svn: 347361
This is another step in vector narrowing - a follow-up to D53784
(and hoping to eventually squash potential regressions seen in
D51553).
The x86 test diffs are wins, but the AArch64 diff is probably not.
That problem already exists independent of this patch (see PR39722), but it
went unnoticed in the previous patch because there were no regression tests
that showed the possibility.
The x86 diff in i64-mem-copy.ll is close. Given the frequency throttling
concerns with using wider vector ops, an extra extract to reduce vector
width is the right trade-off at this level of codegen.
Differential Revision: https://reviews.llvm.org/D54392
llvm-svn: 347356
Previously we emitted to separate shuffles, one for unpcklbw and one for unpcklwd. Instead emit a single shuffle equivalent to both of the original shuffles. Shuffle lowering seems able to handle it. This avoids a bitcast between the two shuffles which seems helpful to DAG combine.
Remove the custom type legalization for v8i8->v8i32. I had put that in to avoid some almost duplicate punpcklbw instructions I was seeing, but this lowering change seems to fix that. It also fixes some duplicate shuffles seen in vector-sext.ll
llvm-svn: 347348
This uncovered an off-by-one typo in SimplifyDemandedVectorElts's INSERT_SUBVECTOR handling as its bounds check was bailing on safe indices.
llvm-svn: 347313
Pull out getPackDemandedElts demanded elts remapping helper from computeKnownBitsForTargetNode and use in computeKnownBits/ComputeNumSignBits.
llvm-svn: 347303
For bitcast nodes from larger element types, add the ability for SimplifyDemandedVectorElts to call SimplifyDemandedBits by merging the elts mask to a bits mask.
I've raised https://bugs.llvm.org/show_bug.cgi?id=39689 to deal with the few places where SimplifyDemandedBits's lack of vector handling is a problem.
Differential Revision: https://reviews.llvm.org/D54679
llvm-svn: 347301
Previously if V2 was unused we ended up using V1 for both inputs as part of the code that follows the new code. By using lowerVectorShuffleWithUNPCK we keep the undef nature of V2 in the output.
As near as I can tell this makes v16i8 behavior consistent with every other VT now.
This does mean that we give the register allocator freedom to fill in random registers now and create false dependencies. But like I said we're already doing that for other types.
llvm-svn: 347296
getZeroVector produces a specifically canonicalized zero vector, but we can just let DAG legalization take care of it.
The test changes are because MULH lowering happens later than it should and this change gave us the opportunity to constant fold away a multiply during a DAG combine before the build_vector got legalized with a bitcast.
llvm-svn: 347290
Summary:
We already support this for scalars, but it was explicitly disabled for vectors. In the updated test cases this allows us to see the upper bits are zero to use less multiply instructions to emulate a 64 bit multiply.
This should help with this ispc issue that a coworker pointed me to https://github.com/ispc/ispc/issues/1362
Reviewers: spatel, efriedma, RKSimon, arsenm
Reviewed By: spatel
Subscribers: wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D54725
llvm-svn: 347287
This can occur when one of the inputs to the multiply is loop invariant. Though my test cases just use two basic blocks with an unconditional jump which we won't merge until after isel in the codegen pipeline.
For scalars, I believe SelectionDAGBuilder can add an AssertZExt to pass knowledge across basic blocks but its explicitly disabled for vectors.
llvm-svn: 347266
SSE PSHUFB vector ctlz lowering works at the i4 nibble level. As detailed in PR39703, we were masking the lower nibble off but we only actually use it in the case where the upper nibble is known to be zero, making it safe to remove the mask and save an instruction.
Differential Revision: https://reviews.llvm.org/D54707
llvm-svn: 347242
Previously we split the vectors in half to allow the two halves to be any extended then concatenated the results back together.
This patch instead instead extends the v16i8 sse algorithm to extend half of each 128-bit lane using punpcklbw/punpckhbw. Multiplies all the low half lanes and high half lanes together in separate operations. Then merges the half lane results back together using packuswb.
Unfortunately, some of the cases in vector-reduce-mul.ll regress because we aren't narrowing the vector width of the multiplies as we reduce. The splitting was somewhat making up for that before by causing halves to be discarded after the split.
Differential Revision: https://reviews.llvm.org/D54668
llvm-svn: 347240
The shift requires a copy to avoid clobbering a register. Comparing with 0 uses an xor to produce 0 that will be overwritten with the compare results. So still requires 2 instructions, but should be one byte shorter since it doesn't need to encode an immediate.
llvm-svn: 347185
Previously we used an arithmetic shift right by 31, but that requires a copy to preserve the input. So we might as well materialize a zero and compare to it since the comparison will overwrite the register that contains the zeros. This should be one byte shorter.
llvm-svn: 347181
Leave just the v4i8->v4i64 and v8i8->v8i64, but only enable them on pre-sse4.1 targets when 64-bit mode is enabled. In those cases we end up creating sext loads that get scalarized to code that looks better than what we get from loading into a vector register and doing a multiple step sign extend using unpacks and shifts.
llvm-svn: 347180
Pre-SSE4.1 sext_invec for v2i64 is complicated because we don't have a v2i64 sra instruction. So instead we sign extend to i32 using unpack and sra, then copy the elements and do a v4i32 sra to fill with sign bits, then interleave the i32 sign extend and the sign bits. So really we're doing to two sign extends but only using half of the v4i32 intermediate result.
When the result is more than 128 bits, default type legalization would prefer to split the destination type all the way down to v2i64 with shuffles followed by v16i8/v8i16->v2i64 sext_inreg operations. This results in more instructions than necessary because we are only utilizing the lower 2 elements of the v4i32 intermediate result. Instead we can custom split a v4i8/v4i16->v4i64 sign_extend. Then we can sign extend v4i8/v4i16->v4i32 invec producing a full v4i32 result. Create the sign bit vector as a v4i32 then split and interleave with the sign bits using an punpackldq and punpackhdq.
llvm-svn: 347176
Some of these sequeces look pretty bad since we have to copy the sign bit from a 32 bit register to a 64 bit register to finish a sign extend.
llvm-svn: 347175
If we widen illegal types instead of promoting, we should be able to rely on the type legalizer to create the vector_inreg operations for us with some caveats.
This patch disables combineToExtendVectorInReg when we are using widening.
I've enabled custom legalization for v8i8->v8i64 extends under avx512f since the type legalizer would want to create a vector_inreg with a v64i8 input type which isn't legal without avx512bw. So we go to v16i8 with custom code using the relaxation of rules we get from D54346.
I've also enable custom legalization of v8i64 and v16i32 operations with with AVX. When the input type is 128 bits, the default splitting legalization would extend first 128->256, then do the a split to two 128 pieces. Extend each half to 256 and then concat the result. The custom legalization I've added instead uses a 128->256 bit vector_inreg extend that only reads the lower 64-bits for the low half of the split. Then shuffles the high 64-bits to the low 64-bits and does another vector_inreg extend.
llvm-svn: 347172
Summary: This is an improvement over the two pshufbs and punpcklqdq we'd get otherwise.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54671
llvm-svn: 347171
Sadly, this duplicates (twice) the logic from InstSimplify. There
might be some way to at least share the DAG versions of the code,
but copying the folds seems to be the standard method to ensure
that we don't miss these folds.
Unlike in IR, we don't run DAGCombiner to fixpoint, so there's no
way to ensure that we do these kinds of simplifications unless the
code is repeated at node creation time and during combines.
There were other tests that would become worthless with this
improvement that I changed as pre-commits:
rL347161
rL347164
rL347165
rL347166
rL347167
I'm not sure how to salvage the remaining tests (diffs in this patch).
So the x86 tests verify that the new code is working as intended.
The AMDGPU test is actually similar to my motivating case: we have
some undef value that has survived to machine IR in an x86 test, and
then it gets folded in some weird way, or we crash if we don't transfer
the undef flag. But we would have been better off never getting to that
point by doing these simplifications.
This will lead back to PR32023 someday...
https://bugs.llvm.org/show_bug.cgi?id=32023
llvm-svn: 347170
We were using the 'normalized' shuffle mask from resolveTargetShuffleInputs, which replaces zero/undef inputs with sentinel values. For SimplifyDemandedVectorElts we need the raw mask so we can correctly demand those 'zero' inputs that got normalized away, this requires an extra bit of logic to locally normalize undef inputs.
llvm-svn: 347158
The zero extend will require two stages of unpacks to implement. So its better to shrink the multiply using pmullw and then extend that result back to v4i32 using a single unpack.
llvm-svn: 347149
This tries to force the result type to vXi32 followed by a truncate. This can help avoid scalarization that would otherwise occur.
There's some annoying examples of an avx512 truncate instruction followed by a packus where we should really be able to just use one truncate. But overall this is still a net improvement.
llvm-svn: 347105
Summary:
As discussed in previous review, and noted in the FIXME, if `X` is actually an `lshr Y, Z` (logical!),
we can fold the `Z` into 'control`, and let the `BEXTR` do this too.
We could just insert those 8 bits of shift amount into control,
but it is better to instead zero-extend them, and 'or' them in place.
We can only do this for `lshr`, not `ashr`, because we do not know that the mask cover only the bits of `Y`,
and not any of the sign-extended bits.
The obvious question is, is this actually legal to do?
I believe it is. Relevant quotes, from `Intel® 64 and IA-32 Architectures Software Developer’s Manual`, `BEXTR — Bit Field Extract`:
* `Bit 7:0 of the second source operand specifies the starting bit position of bit extraction.`
* `A START value exceeding the operand size will not extract any bits from the second source operand.`
* `Only bit positions up to (OperandSize -1) of the first source operand are extracted.`
* `All higher order bits in the destination operand (starting at bit position LENGTH) are zeroed.`
* `The destination register is cleared if no bits are extracted.`
FIXME: if we can do this, i wonder if we should prefer `BEXTR` over `BZHI` in such cases.
Reviewers: RKSimon, craig.topper, spatel, andreadb
Reviewed By: RKSimon, craig.topper, andreadb
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54095
llvm-svn: 347048
By early promoting the multiply to use an i16 element type we can avoid op legalization emit a second multiply for the 8 upper elements of the v16i8 type we would otherwise get.
llvm-svn: 347032
We aren't going to use the upper bits of the multiply result that the extend would effect. So we don't need a specific type of extend.
This makes some reduction test cases shorter because we were previously trying to sign_extend a truncate which we can't eliminate.
llvm-svn: 347011
In reduceVMULWidth, we no longer need to worry about extending the vector to 128 bits first. Regular widening of extends, muls and shuffles will take care of that for us.
In combineMulToPMADDWD, we can handle v2i32 multiplies and allow the VPMADDWD to be widened to v4i32 during type legalization by adding custom widening like we do have for AVG/ADDUS/SUBUS. I had to modify that code a little to allow different and output VTs.
Differential Revision: https://reviews.llvm.org/D54512
llvm-svn: 346980
This fixes -filetype=null support when compiling for a Win32 target and the module has a CodeView flag.
The only places changed are the uses of getTargetStreamer function - this patch guards both of them with null checks.
Committed on behalf of @eush (Eugene Sharygin)
Differential Revision: https://reviews.llvm.org/D54008
llvm-svn: 346962
This avoids some nasty shuffles when we have avx512. It will also prevent using zmm truncate instructions when a ymm instruction that zeroes part of an xmm register will do. Also avoid using avx512 truncate instructions when the input is 128 bits or less. These instructions are 2 uops on skx so we can probably find a better single uop shuffle like pshufb.
llvm-svn: 346936
The narrow types end up requesting widening, but generic legalization will end up scalaring and using a build_vector to do the widening.
llvm-svn: 346916
On 64-bit targets the type legalizer will use i64 to legalize these. But when i64 isn't legal, the type legalizer won't try an FP type. So do it manually instead.
There are a few regressions in here due to some v2i32 operations like mul and div now being reassembled into a full vector just to store instead of storing the pieces. But this was already occuring in 64-bit mode so its not a new issue.
llvm-svn: 346908
The machine scheduler currently biases register copies to/from
physical registers to be closer to their point of use / def to
minimize their live ranges. This change extends this to also physical
register assignments from immediate values.
This causes a reduction in reduction in overall register pressure and
minor reduction in spills and indirectly fixes an out-of-registers
assertion (PR39391).
Most test changes are from minor instruction reorderings and register
name selection changes and direct consequences of that.
Reviewers: MatzeB, qcolombet, myatsina, pcc
Subscribers: nemanjai, jvesely, nhaehnle, eraman, hiraditya,
javed.absar, arphaman, jfb, jsji, llvm-commits
Differential Revision: https://reviews.llvm.org/D54218
llvm-svn: 346894
Narrower vectors will be widened to 128 bits without changing the element size. And generic type legalization can already handle widening mulhu/mulhs.
Differential Revision: https://reviews.llvm.org/D54513
llvm-svn: 346879
This patch removes the last use of the constant pool shuffle decode helper and consistently uses the 'getTargetShuffleMaskIndices' versions instead. The constant pool versions are now purely used for assembly comments.
The avx512vbmi intrinsic upgrades had to be altered as they were being decoded as broadcasts, similar to what I fixed in rL346032. I don't think the change is critical - although its annoying that we lose the {k}{z} instruction test coverage as they are tricky to generate....
Differential Revision: https://reviews.llvm.org/D54083
llvm-svn: 346850
I've only added sse2 and sse4.1 variants as I'm only interested in the two v4i16 tests and I don't expect that to different with AVX other than a v prefix.
llvm-svn: 346834
The IEEE-754 Standard makes it clear that fneg(x) and
fsub(-0.0, x) are two different operations. The former is a bitwise
operation, while the latter is an arithmetic operation. This patch
creates a dedicated FNeg IR Instruction to model that behavior.
Differential Revision: https://reviews.llvm.org/D53877
llvm-svn: 346774
I'm looking into whether we can make this the default legalization strategy. Adding these tests to help cover the changes that will be necessary.
This patch adds copies of some tests with the command line switch enabled. By making copies its easier to compare the two legalization strategies.
I've also removed RUN lines from some of these tests that already had -x86-experimental-vector-widening-legalization
llvm-svn: 346745
This patch adds the ability to use a PALIGNR to rotate a pair of inputs to select a range containing all the referenced elements, followed by a single input permute to put them in the right location.
Differential Revision: https://reviews.llvm.org/D54267
llvm-svn: 346706
Truncate and shuffle lowering are already capable of matching to PACKUS using known bits analysis.
This features one test change where we now prefer to extend v16i16->v16i32 then trunc v16i32->v16i8 over extract_subvector+packus when avx512f is available, but avx512bw is not.
llvm-svn: 346697
This is a long-awaited follow-up suggested in D33578. Since then, we've picked up even more
opportunities for vector narrowing from changes like D53784, so there are a lot of test diffs.
Apart from 2-3 strange cases, these are all wins.
I've structured this to be no-functional-change-intended for any target except for x86
because I couldn't tell if AArch64, ARM, and AMDGPU would improve or not. All of those
targets have existing regression tests (4, 4, 10 files respectively) that would be
affected. Also, Hexagon overrides the shouldReduceLoadWidth() hook, but doesn't show
any regression test diffs. The trade-off is deciding if an extra vector load is better
than a single wide load + extract_subvector.
For x86, this is almost always better (on paper at least) because we often can fold
loads into subsequent ops and not increase the official instruction count. There's also
some unknown -- but potentially large -- benefit from using narrower vector ops if wide
ops are implemented with multiple uops and/or frequency throttling is avoided.
Differential Revision: https://reviews.llvm.org/D54073
llvm-svn: 346595
There are two AGU units, and per 1cy, there can be either two loads,
or a load and a store; but not two stores, or two loads and a store.
Additionally, loads shouldn't affect the store scheduler and vice versa.
(but *should* affect the PdEX scheduler.)
Required rL346545.
Fixes https://bugs.llvm.org/show_bug.cgi?id=39465
llvm-svn: 346587
The sdivrem will emit its own MOVSX to move %ah to the low byte of a register. By using a MOVSX for an any_extend this allows a post-isel peephole to merge them.
llvm-svn: 346581
After the division %ah is being sign extended to move it to lower byte of a register while avoiding a partial register read. We then zero extend the low byte to the full 32 bit register. But we don't use any of the zero extended bits. In the DAG the zero extend was really an any_extend so the sign extend should have been enough.
llvm-svn: 346580
This gives shuffle lowering the freedom to use zero_extend_vector_inreg for the unpckl shuffle. Shuffle combining usually makes this swap later, but not when AVX512 is enabled it seems.
While there also use DAG.getConstant to create a 0 vector instead of using the helper the forces a specific BUILD_VECTOR. I don't think that helper is usually needed. We're basically free to create a constant build_vector anytime and it will be legalized on its own.
llvm-svn: 346574
With avx512f but not avx512bw we need to extend to v16i32 then truncate that to to v16i8. Previously we emitted both nodes during lowering, but I'm trying to switch to using target independent nodes and with that switched the extend+truncate wou
This patch changes the implementation to what will be necessary with that patch which helps minimize test diffs.
llvm-svn: 346552
This makes X86ISD::VSEXT more similar to ISD::SIGN_EXTEND and ISD::ZERO_EXTEND.
I'm hoping to replace X86ISD::VSEXT/VZEXT with target independent nodes. Making the target specific nodes similar to the target independent nodes helps minimize test diffs in that patch.
llvm-svn: 346539
It's possible for vector op legalization to generate a shuffle. If that happens we should give a chance for DAG combine to combine that with a build_vector input.
I also fixed a bug in combineShuffleOfScalars that was considering the number of uses on a undef input to a shuffle. We don't care how many times undef is used.
Differential Revision: https://reviews.llvm.org/D54283
llvm-svn: 346530
I noticed that we weren't generating broadcasts as much I thought we would with
D54271, and this is part of the problem.
Widening the shuffle elements means adding bitcasts and hiding the relationship
between a splatted scalar and the vector. If we can form a broadcast, do that
before going through the rest of the shuffle lowering because broadcasts should
be cheap and can often be load-folded.
Differential Revision: https://reviews.llvm.org/D54280
llvm-svn: 346498
In SimplifyCFG when given a conditional branch that goes to BB1 and BB2, the hoisted common terminator instruction in the two blocks, caused debug line records associated with subsequent select instructions to become ambiguous. It causes the debugger to display unreachable source lines.
Differential Revision: https://reviews.llvm.org/D53390
llvm-svn: 346481
As discussed in D54073, we have a potential regression from more aggressive vector narrowing here, so let's try to avoid that by changing build-vector lowering slightly.
Insert-vector-element lowering always does this since there's no "pinsr" for ymm/zmm:
// If the vector is wider than 128 bits, extract the 128-bit subvector, insert
// into that, and then insert the subvector back into the result.
...but we can sometimes do better for insert-into-constant-vector by using shuffle lowering.
Differential Revision: https://reviews.llvm.org/D54271
llvm-svn: 346433
FindBetterNeighborChains simulateanously improves the chain
dependencies of a chain of related stores avoiding the generation of
extra token factors. For chains longer than the GatherAllAliasDepths,
stores further down in the chain will necessarily fail, a potentially
significant waste and preventing otherwise trivial parallelization.
This patch directly parallelize the chains of stores before improving
each store. This generally improves DAG-level parallelism.
Reviewers: courbet, spatel, RKSimon, bogner, efriedma, craig.topper, rnk
Subscribers: sdardis, javed.absar, hiraditya, jrtc27, atanasyan, llvm-commits
Differential Revision: https://reviews.llvm.org/D53552
llvm-svn: 346432
Summary:
The conditional branch created to support -fsplit-stack for X86 is
left unbiased/unhinted, resulting in less than ideal block placement:
the __morestack call block is kept on the main hot path. Bias the
branch to insure that the stack allocation block is treated as a
"cold" block during machine basic block placement.
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54123
llvm-svn: 346336
This adds the llvm-side support for post-inlining evaluation of the
__builtin_constant_p GCC intrinsic.
Also fixed SCCPSolver::visitCallSite to not blow up when seeing a call
to a function where canConstantFoldTo returns true, and one of the
arguments is a struct.
Updated from patch initially by Janusz Sobczak.
Differential Revision: https://reviews.llvm.org/D4276
llvm-svn: 346322
Set `LiveReg::PhysReg` to zero when freeing a register instead of
removing it from the entry from `LiveRegMap`. This way no iterators get
invalidated and we can avoid passing around and updating iterators all
over the place.
This does not change any allocator decisions. It is not completely NFC
because the arbitrary iteration order through `LiveRegMap` in
`spillAll()` changes so we may get a different order in those spill
sequences (the amount of spills does not change).
This is in preparation of https://reviews.llvm.org/D52010.
llvm-svn: 346298
I'm preparing a patch to avoid creating critical edges in cmov expansion. Updating these tests to make the changes by the next patch easier to see.
llvm-svn: 346161
The constrained intrinsic tests have grown in number. Split off
the FMA tests into their own file to reduce double coverage.
Differential Revision: https://reviews.llvm.org/D53932
llvm-svn: 346137
v2i8/v2i16/v2i32 are promoted to v2i64. pmuludq takes a v2i64 input and produces a v2i64 output. Since we don't about the upper bits of the type legalized multiply we can use the pmuludq to produce the multiply result for the bits we do care about.
llvm-svn: 346115
Summary: This also enables some constant folding from KnownBits propagation. This helps on some cases vXi64 case in 32-bit mode where constant vectors appear as vXi32 and a bitcast. This can prevent getNode from constant folding sra/shl/srl.
Reviewers: RKSimon, spatel
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54069
llvm-svn: 346102
The majority of the changes are because the rest of shuffle lowering/combining prefers to replace the undef input with the other operand. Using UNPCKL directly seemed to avoid this and just grabbed a randomish register for the undef which can create false dependencies.
llvm-svn: 346050
We already have custom lowering for the AVX case in LegalizeVectorOps. So its better to keep the regular extend op around as long as possible.
I had to qualify one place in DAG combine that created illegal vector extending load operations. This change by itself had no effect on any tests which is why its included here.
I've made a few cleanups to the custom lowering. The sign extend code no longer creates an identity shuffle with undef elements. The zero extend code now emits a zero_extend_vector_inreg instead of an unpckl with a zero vector.
For the high half of the custom lowering of zero_extend/any_extend, we're now using an unpckh with a zero vector or undef. Previously we used used a pshufd to move the upper 64-bits to the lower 64-bits and then used a zero_extend_vector_inreg. I think the zero vector should require less execution resources and be smaller code size.
Differential Revision: https://reviews.llvm.org/D54024
llvm-svn: 346043
This is necessary as I'm wanting to remove the 'Constant Pool' shuffle decoding from getTargetShuffleMask - but using getTargetShuffleMaskIndices allows the shuffle combiner to realize that these calls are really broadcasts.....
As with a lot of the X86ISD::VPERMV3 code this causes some vperm2i/vperm2t shuffles to flip depending on optimal commutation.
llvm-svn: 346032
As reported in PR38952, postra-machine-sink relies on DBG_VALUE insns being
adjacent to the def of the register that they reference. This is not always
true, leading to register copies being sunk but not the associated DBG_VALUEs,
which gives the debugger a bad variable location.
This patch collects DBG_VALUEs as we walk through a BB looking for copies to
sink, then passes them down to performSink. Compile-time impact should be
negligable.
Differential Revision: https://reviews.llvm.org/D53992
llvm-svn: 345996
reduceBuildVecConvertToConvertBuildVec vectorizes int2float in the DAGCombiner, which means that even if the LV/SLP has decided to keep scalar code using the cost models, this will override this.
While there are cases where vectorization is necessary in the DAG (mainly due to legalization artefacts), I don't think this is the case here, we should assume that the vectorizers know what they are doing.
Differential Revision: https://reviews.llvm.org/D53712
llvm-svn: 345964
I'm having trouble creating a test case for the ISD::TRUNCATE part of this that shows any codegen differences. But I was able to test the setcc path which is what the test changes here cover.
llvm-svn: 345908
The test causes a crash because we were trying to extract v4f32 to v3f32, and the
narrowing factor was then 4/3 = 1 producing a bogus narrow type.
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=39511
llvm-svn: 345842
Reapplying an updated version of rL345395 (reverted in rL345451), now the issues noticed in PR39483 have been fixed.
This patch allows resolveTargetShuffleInputs to remove UNDEF inputs from cases where we have more than 2 inputs.
llvm-svn: 345824
The debug-use flag must be set exactly for uses on DBG_VALUEs. This is
so obvious that it can be trivially inferred while parsing. This will
reduce noise when printing while omitting an information that has little
value to the user.
The parser will keep recognizing the flag for compatibility with old
`.mir` files.
Differential Revision: https://reviews.llvm.org/D53903
llvm-svn: 345671
The CONCAT_VECTORS case was using the original mask element count to determine how to adjust the broadcast index. But if we looked through a bitcast the original mask size doesn't tell us anything about the concat_vectors.
This patch switchs to using the concat_vectors input element count directly instead.
Differential Revision: https://reviews.llvm.org/D53823
llvm-svn: 345626
The SchedModel allows the addition of ReadAdvances to express that certain
operands of the instructions are needed at a later point than the others.
RegAlloc may add pseudo operands that are not part of the instruction
descriptor, and therefore cannot have any read advance entries. This meant
that in some cases the desired read advance was nullified by such a pseudo
operand, which still had the original latency.
This patch fixes this by making sure that such pseudo operands get a zero
latency during DAG construction.
Review: Matthias Braun, Ulrich Weigand.
https://reviews.llvm.org/D49671
llvm-svn: 345606
Narrowing vector binops came up in the demanded bits discussion in D52912.
I don't think we're going to be able to do this transform in IR as a canonicalization
because of the risk of creating unsupported widths for vector ops, but we already have
a DAG TLI hook to allow what I was hoping for: isExtractSubvectorCheap(). This is
currently enabled for x86, ARM, and AArch64 (although only x86 has existing regression
test diffs).
This is artificially limited to not look through bitcasts because there are so many
test diffs already, but that's marked with a TODO and is a small follow-up.
Differential Revision: https://reviews.llvm.org/D53784
llvm-svn: 345602
Summary:
Because of the D48768, that pattern is always unfolded into pattern d,
thus we had no test coverage.
Reviewers: RKSimon, craig.topper
Reviewed By: craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53574
llvm-svn: 345583
Similar to FoldCONCAT_VECTORS, this patch adds FoldBUILD_VECTOR to simplify cases that can avoid the creation of the BUILD_VECTOR - if all the operands are UNDEF or if the BUILD_VECTOR simplifies to a copy.
This exposed an assumption in some AMDGPU code that getBuildVector was guaranteed to be a BUILD_VECTOR node that I've tried to handle.
Differential Revision: https://reviews.llvm.org/D53760
llvm-svn: 345578
- Relex hard coded registers and stack frame sizes
- Some test cleanups
- Change phi-dbg.ll to match on mir output after phi elimination instead
of going through the whole codegen pipeline.
This is in preparation for https://reviews.llvm.org/D52010
I'm committing all the test changes upfront that work before and after
independently.
llvm-svn: 345532
The machine verifier was disabled for x86 by default. There are now only
9 tests failing, compared to what previously was between 20 and 30.
This is a good opportunity to file bugs for all the remaining issues,
then explicitly disable the failing tests and enabling the machine
verifier by default.
This allows us to avoid adding new tests that break the verifier.
PR27481
llvm-svn: 345513
Add an intrinsic that takes 2 integers and perform saturation subtraction on
them.
This is a part of implementing fixed point arithmetic in clang where some of
the more complex operations will be implemented as intrinsics.
Differential Revision: https://reviews.llvm.org/D53783
llvm-svn: 345512
This test breaks the X86 MachineVerifier. It looks like the MIR part is
completely useless.
The original author suggests that it can be removed.
Differential Revision: https://reviews.llvm.org/D53767
llvm-svn: 345501
When the floating point constants are whole numbers they have no decimal point so look like integers, but mean something very different in something like an 'and' instruction.
Ideally we would just print a decimal point and a 0, but I couldn't see how to make APFloat::toString do that.
llvm-svn: 345488
Add vector support to TargetLowering::expandFP_TO_UINT.
This exposes an issue in X86TargetLowering::LowerVSELECT which was assuming that the select mask was the same width as the LHS/RHS ops - as long as the result is a sign splat we can easily sext/trunk this.
llvm-svn: 345473
Adding the baseline tests in a preparatory NFC commit,
so that the actual commit shows the *diff*.
Yes, i'm aware that a few of these codegen-based sched tests
are testing wrong instructions, i will fix that afterwards.
For https://reviews.llvm.org/D52779
llvm-svn: 345462
On GreenDragon, CodeGen/X86/cpus-no-x86_64.ll was still timing out even
after breaking up the original test. I further split off the intel and
AMD cpus which hopefully resolves this.
http://green.lab.llvm.org/green/job/clang-stage2-cmake-RgSan/
llvm-svn: 345438
Summary:
The main challenge here is that X86InstrInfo::AnalyzeBranch doesn't
understand the way we're using a CALL instruction as a branch, so we
can't list the CallTarget MBB as a successor of the entry block. If we
don't list it as a successor, then the AsmPrinter doesn't print a label
for the MBB.
Fix the issue by inserting our own label at the beginning of the call
target block. We can rely on the AsmPrinter to always emit it, even
though the block appears to be unreachable, but address-taken.
Fixes PR38391.
Reviewers: thegameg, chandlerc, echristo
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D53653
llvm-svn: 345426
These promotions add additional bitcasts to the SelectionDAG that can pessimize computeKnownBits/computeNumSignBits. It also seems to interfere with broadcast formation.
This patch removes the promotion and adds isel patterns instead.
The increased table size is more than I would like, but hopefully we can find some canonicalizations or other tricks to start pruning out patterns going forward.
Differential Revision: https://reviews.llvm.org/D53268
llvm-svn: 345408
This is a narrow fix for 1 of the problems mentioned in PR27780:
https://bugs.llvm.org/show_bug.cgi?id=27780
I looked at more general solutions, but it's a mess. We canonicalize shuffle masks
based on the number of elements accessed from each operand, and that's not optional.
If you remove that, we'll crash because we fail to match isel patterns. So I'm
waiting until we're sure that we have blendvb with constant condition and then
commuting based on the load potential. Other cases like blend-with-immediate are
already handled elsewhere, so this is probably not a common problem anyway.
I didn't use "MayFoldLoad" because that checks for one-use and in these cases, we've
screwed that up by creating a temporary PSHUFB using these operands that we're counting
on to be killed later. Undoing that didn't look like a simple task because it's
intertwined with determining if we actually use both operands of the shuffle or not.a
Differential Revision: https://reviews.llvm.org/D53737
llvm-svn: 345390
.debug_loclists is the DWARF 5 version of the .debug_loc.
With that patch, it will be emitted when DWARF 5 is used.
Differential revision: https://reviews.llvm.org/D53365
llvm-svn: 345377
The required-vector-width attribute was only used for backend testing and has never been generated by clang.
I believe clang is now generating min-legal-vector-width for vector uses in user code.
With this I believe passing -mprefer-vector-width=256 to clang should prevent use of zmm registers in the generated assembly unless the user used a 512-bit intrinsic in their source code.
llvm-svn: 345317
FENTRY_CALL is actually not taking any input / output operands. The
machine verifier complains now because the target description says that:
* It needs 1 unknown output
* It needs 1 or more variable inputs
llvm-svn: 345316
KNL is based on a modified Silvermont core so I don't think these features apply. I think the LEA flag is probably also wrong, but I'm less sure as I barely understand the 3 LEA flags we have currently.
Differential Revision: https://reviews.llvm.org/D53671
llvm-svn: 345285
The current state of the llc invocation is:
* Running all the passes from dwarfehprepare to stack coloring
(included)
* It runs it from the LLVM IR included in the file
* It *ADDS* the generated MI from ISel to the MI in the MIR file
* The machine verifier doesn't like it.
Differential Revision: https://reviews.llvm.org/D53698
llvm-svn: 345266
As suggested on D52965, this patch moves the i64 to f64 UINT_TO_FP expansion code from LegalizeDAG into TargetLowering and makes it available to LegalizeVectorOps as well.
Not only does this help perform X86 lowering as a true vectorization instead of (partially vectorized) scalar conversions, it avoids the HADDPD op from the scalar code which can be slow on most targets.
The AVX512F does have the vcvtusi2sdq scalar operation but we don't unroll to use it as it seems to only help for the v2f64 case - otherwise the unrolling cost will certainly be too high. My feeling is that we should leave it to the vectorizers - and if it generates the vector UINT_TO_FP we should use it.
Differential Revision: https://reviews.llvm.org/D53649
llvm-svn: 345256
When SimplifyCFG changes the PHI node into a select instruction, the debug line records becomes ambiguous. It causes the debugger to display unreachable source lines.
Differential Revision: https://reviews.llvm.org/D53287
llvm-svn: 345250
Instead of using the MOVGOT64r pseudo, use the existing
MO_PIC_BASE_OFFSET support on symbol operands. Now I don't have to
create a "scratch register operand" for the pseudo to use, and the
register allocator can make better decisions.
Fixes some X86 verifier errors tracked in PR27481.
llvm-svn: 345219
It's possible to do a tail call to a stack argument. LLVM already
calculates the right stack offset to call through.
Fixes the sibcall* and musttail* verifier failures tracked at PR27481.
llvm-svn: 345197
Add X86 SimplifyDemandedBitsForTargetNode and use it to simplify PMULDQ/PMULUDQ target nodes.
This enables us to repeatedly simplify the node's arguments after the previous approach had to be reverted due to PR39398.
Differential Revision: https://reviews.llvm.org/D53643
llvm-svn: 345182
This patch brings back the MOV64r0 pseudo instruction for zeroing a 64-bit register. This replaces the SUBREG_TO_REG MOV32r0 sequence we use today. Post register allocation we will rewrite the MOV64r0 to a 32-bit xor with an implicit def of the 64-bit register similar to what we do for the various XMM/YMM/ZMM zeroing pseudos.
My main motivation is to enable the spill optimization in foldMemoryOperandImpl. As we were seeing some code that repeatedly did "xor eax, eax; store eax;" to spill several registers with a new xor for each store. With this optimization enabled we get a store of a 0 immediate instead of an xor. Though I admit the ideal solution would be one xor where there are multiple spills. I don't believe we have a test case that shows this optimization in here. I'll see if I can try to reduce one from the code were looking at.
There's definitely some other machine CSE(and maybe other passes) behavior changes exposed by this patch. So it seems like there might be some other deficiencies in SUBREG_TO_REG handling.
Differential Revision: https://reviews.llvm.org/D52757
llvm-svn: 345165
A lifetime end intrinsic between a tail call and the return should not
prevent the call from being tail call optimized.
Differential Revision: https://reviews.llvm.org/D53519
llvm-svn: 345163
When implementing memset's today we often see this pattern:
$x0 = MOV 0xXYXYXYXYXYXYXYXY
store $x0, ...
$w1 = MOV 0xXYXYXYXY
store $w1, ...
We first create a 64bit constant in a 64bit register with all bytes the
same and then create a 32bit constant with all bytes the same in a 32bit
register. In many targets we could just access the lower byte of the
64bit register instead.
- Ideally this would be handled by the ConstantHoist pass but it runs
too early when memset isn't expanded yet.
- The memset expansion code already had this optimization implemented,
however SelectionDAG constantfolding would constantfold the
"trunc(bigconstnat)" pattern to "smallconstant".
- This patch makes the memset expansion mark the constant as Opaque and
stop DAGCombiner from constant folding in this situation. (Similar to
how ConstantHoisting marks things as Opaque to avoid folding
ADD/SUB/etc.)
Differential Revision: https://reviews.llvm.org/D53181
llvm-svn: 345102
We can't add the MULDQ node back to the worklist after the demanded bits change has been committed in case the node has been removed entirely. This will have to wait until we have SimplifyDemandedBitsForTargetNode.
llvm-svn: 345070
As suggested on D53258, this patch shares common CTLZ expansion code between VectorLegalizer and SelectionDAGLegalize by putting it in TargetLowering.
Extension to D53474
llvm-svn: 345060
This initially landed in rL345014, but was reverted in rL345017
due to sanitizer-x86_64-linux-fast buildbot failure in
check-lld (ELF/relocatable-versioned.s) test.
While i'm not yet quite sure what is the problem, one obvious
thing here is that extra truncation roundtrip.
Maybe that's it? If not, will re-revert.
Differential Revision: https://reviews.llvm.org/D53521
llvm-svn: 345027
Summary:
Continuation of D52348.
We also get the `c) x & (-1 >> (32 - y))` pattern here, because of the D48768.
I will add extra-uses into those tests and follow-up with a patch to handle those patterns too.
Reviewers: RKSimon, craig.topper
Reviewed By: craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53521
llvm-svn: 345014
Add an intrinsic that takes 2 integers and perform unsigned saturation
addition on them.
This is a part of implementing fixed point arithmetic in clang where some of
the more complex operations will be implemented as intrinsics.
Differential Revision: https://reviews.llvm.org/D53340
llvm-svn: 344971
analyzeBranch()/insertBranch() etc. do not properly deal with an undef
flag on the eflags input and used to produce invalid MIR. I don't see
this ever affecting real world inputs (I don't think it is possible to
produce undef flags with llvm IR), so I simply changed the code to bail
out in this case.
rdar://42122367
llvm-svn: 344970
I've included a fix to DAGCombiner::ForwardStoreValueToDirectLoad that I believe will prevent the previous miscompile.
Original commit message:
Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to rem
I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.
I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the lo
I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits, RKSimon
Differential Revision: https://reviews.llvm.org/D53306
llvm-svn: 344965
Summary:
Trivial continuation of D52304.
While this pattern is not canonical, we do select it in the BZHI case,
so this should not be any different.
Reviewers: RKSimon, craig.topper, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52348
llvm-svn: 344902
This makes fast isel treat all legal vector types the same way. Previously only vXi64 was in the fast-isel tables.
This unfortunately prevents matching of andn by fast-isel for these types since the requires SelectionDAG. But we already had this issue for vXi64. So at least we're consistent now.
Interestinly it looks like fast-isel can't handle instructions with constant vector arguments so the the not part of the andn patterns is selected with SelectionDAG. This explains why VPTERNLOG shows up in some of the tests.
This is a subset of D53268. As I make progress on that, I will try to reduce the number of lines in the tablegen files.
llvm-svn: 344884
Summary:
Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to remove the bitcast.
I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.
I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the load size.
I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits, RKSimon
Differential Revision: https://reviews.llvm.org/D53306
llvm-svn: 344877
This is a late backend subset of the IR transform added with:
D52439
We can confirm that the conversion to a 'trunc' is correct by running:
$ opt -instcombine -data-layout="e"
(assuming the IR transforms are correct; change "e" to "E" for big-endian)
As discussed in PR39016:
https://bugs.llvm.org/show_bug.cgi?id=39016
...the pattern may emerge during legalization, so that's we are waiting for an
insertelement to become a scalar_to_vector in the pattern matching here.
The DAG allows for fun variations that are not possible in IR. Result types for
extracts and scalar_to_vector don't necessarily match input types, so that means
we have to be a bit more careful in the transform (see code comments).
The tests show that we don't handle cases that require a shift (as we did in the
IR version). I've left that as a potential follow-up because I'm not sure if
that's a real concern at this late stage.
Differential Revision: https://reviews.llvm.org/D53201
llvm-svn: 344872
Allows to disable direct TLS segment access (%fs or %gs). GCC supports
a similar flag, it can be useful in some circumstances, e.g. when a thread
context block needs to be updated directly from user space. More info
and specific use cases: https://bugs.llvm.org/show_bug.cgi?id=16145
There is another revision for clang as well.
Related: D53102
All X86 CodeGen tests appear to pass:
```
[46/47] Running lit suite /SourceCache/llvm-trunk-8.0/test/CodeGen
Testing Time: 23.17s
Expected Passes : 3801
Expected Failures : 15
Unsupported Tests : 8021
```
Reviewed by: Craig Topper.
Patch by nruslan (Ruslan Nikolaev).
Differential Revision: https://reviews.llvm.org/D53103
llvm-svn: 344723
Without this we match the CMP+AND to a TEST and then match the SHR separately. I'm trusting analyzeCompare to remove the TEST during the peephole pass. Otherwise we need to check the flag users to see if they only use the Z flag.
This recovers a case lost by r344270.
Differential Revision: https://reviews.llvm.org/D53310
llvm-svn: 344649
Add an intrinsic that takes 2 integers and perform saturation addition on them.
This is a part of implementing fixed point arithmetic in clang where some of
the more complex operations will be implemented as intrinsics.
Differential Revision: https://reviews.llvm.org/D53053
llvm-svn: 344629
This is intended to make the backend on par with functionality that was
added to the IR version of SimplifyDemandedVectorElts in:
rL343727
...and the original motivation is that we need to improve demanded-vector-elements
in several ways to avoid problems that would be exposed in D51553.
Differential Revision: https://reviews.llvm.org/D52912
llvm-svn: 344541
Summary:
I've noticed that the bitcasts we introduce for these make computeKnownBits and computeNumSignBits not work well in LegalizeVectorOps. LegalizeVectorOps legalizes bottom up while LegalizeDAG legalizes top down. The bottom up strategy for LegalizeVectorOps means operands are legalized before their uses. So we promote and/or/xor before we legalize the operands that use them making computeKnownBits/computeNumSignBits in places like LowerTruncate suboptimal. I looked at changing LegalizeVectorOps to be top down as well, but that was more disruptive and caused some regressions. I also looked at just moving promotion of binops to LegalizeDAG, but that had a few issues one around matching AND,ANDN,OR into VSELECT because I had to create ANDN as vXi64, but the other nodes hadn't legalized yet, I didn't look too hard at fixing that.
This patch seems to produce better results overall than my other attempts. We now form broadcasts of constants better in some cases. For at least some of them the AND was being introduced in LegalizeDAG, promoted to vXi64, and the BUILD_VECTOR was also legalized there. I think we got bad ordering of that. Now the promotion is out of the legalizer so we handle this better.
In the longer term I think we really should evaluate whether we should be doing this promotion at all. It's really there to reduce isel pattern count, but I'm wondering if we'd be better served just eating the pattern cost or doing C++ based isel for vector and/or/xor in X86ISelDAGToDAG. The masked and/or/xor will definitely be difficult in patterns if a bitcast gets between the vselect and the and/or/xor node. That becomes a lot of permutations to cover.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53107
llvm-svn: 344487
Summary: This is similar to what D52528 did for loads. It should match what generic type legalization does in 64-bit mode where it uses a v2i64 cast and an i64 store.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53173
llvm-svn: 344470
Summary:
getShiftAmountTy for X86 returns MVT::i8. If a BSWAP or BITREVERSE is created that requires promotion and the difference between the original VT and the promoted VT is more than 255 then we won't able to create the constant.
This patch adds a check to replace the result from getShiftAmountTy to MVT::i32 if the difference won't fit. This should get legalized later when the shift is ultimately expanded since its clearly an illegal type that we're only promoting to make it a power of 2 bit width. Alternatively we could base the decision completely on the largest shift amount the promoted VT could use.
Vectors should be immune here because getShiftAmountTy always returns the incoming VT for vectors. Only the scalar shift amount can be changed by the targets.
Reviewers: eli.friedman, RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53232
llvm-svn: 344460
Use isConstantSplat instead of ISD::isConstantSplatVector to let us us peek through to illegal types (in this case for i686 targets to recognise i64 constants)
llvm-svn: 344452
If we have better CTLZ support than CTPOP, then use cttz(x) = width - ctlz(~x & (x - 1)) - and remove the CTTZ_ZERO_UNDEF handling as it no longer gives better codegen.
Similar to rL344447, this is also closer to LegalizeDAG's approach
llvm-svn: 344448
This patch changes the vector CTTZ lowering from:
cttz(x) = ctpop((x & -x) - 1)
to:
cttz(x) = ctpop(~x & (x - 1))
Not only does this make better use of the PANDN instruction, but it also matches the LegalizeDAG method which should allow us to remove the x86 specific code at some point in the future (we need to fix some issues with the bitcasted logic ops and CTPOP lowering first).
Differential Revision: https://reviews.llvm.org/D53214
llvm-svn: 344447
Add shuffle lowering for the case where we can shuffle the lanes into place followed by an in-lane permute.
This is mainly for cases where we can have non-repeating permutes in each lane, but for now I've just enabled it for v4f64 unary shuffles to fix PR39161 - there is no test coverage for other shuffles that might benefit yet.
We now have several cross-lane shuffle lowering methods that all do something similar - I've looked at merging some of these (notably by making the repeated mask mechanism in lowerVectorShuffleByMerging128BitLanes optional), but there is a lot of assertions/assumptions in the way that makes this tricky - I ended up going for adding yet another relatively simple method instead.
Differential Revision: https://reviews.llvm.org/D53148
llvm-svn: 344446
This saves a conversion to extracts and build_vector. We already do this when both the result and the input need to be widened to the same type.
This changed the sse-intrinsics-fast-isel test because we don't lower (insert_vector_elt (scalar_to_vector X), Y, 1) well. We turn it into (vector_shuffle (scalar_to_vector X), (scalar_to_vector Y), <0, 4, 2, 3>) losing track of the fact that the upper elts could be undef.
We should probably find a way to prevent the scalarization of the <2 x f32> load on these tests.
llvm-svn: 344404
This is the planned follow-up to D52997. Here we are reducing horizontal vector math codegen
by default. AMD Jaguar (btver2) should have no difference with this patch because it has
fast-hops. (If we want to set that bit for other CPUs, let me know.)
The code changes are small, but there are many test diffs. For files that are specifically
testing for hops, I added RUNs to distinguish fast/slow, so we can see the consequences
side-by-side. For files that are primarily concerned with codegen other than hops, I just
updated the CHECK lines to reflect the new default codegen.
To recap the recent horizontal op story:
1. Before rL343727, we were producing hops for all subtargets for a variety of patterns.
Hops were likely not optimal for all targets though.
2. The IR improvement in r343727 exposed a hole in the backend hop pattern matching, so
we reduced hop codegen for all subtargets. That was bad for Jaguar (PR39195).
3. We restored the hop codegen for all targets with rL344141. Good for Jaguar, but
probably bad for other CPUs.
4. This patch allows us to distinguish when we want to produce hops, so everyone can be
happy. I'm not sure if we have the best predicate here, but the intent is to undo the
extra hop-iness that was enabled by r344141.
Differential Revision: https://reviews.llvm.org/D53095
llvm-svn: 344361
Summary:
Reland of
- r344197 "[MC][ELF] compute entity size for explicit sections"
- r344206 "[MC][ELF] Fix section_mergeable_size.ll"
after being reverted in r344278 due to build breakages from not
specifying a target triple.
Move test from test/CodeGen/Generic/ to test/MC/ELF/.
Add explicit target triple so we don't try to run
this test on non ELF targets.
Reported: https://reviews.llvm.org/D53056#1261707
Reviewers: fhahn, rnk, espindola, NoQ
Reviewed By: fhahn, rnk
Subscribers: NoQ, MaskRay, rengolin, emaste, arichardson, llvm-commits, pirama, srhines
Differential Revision: https://reviews.llvm.org/D53146
llvm-svn: 344360
Generalize SelectionDAGLegalize's CTLZ expansion to handle vectors - lets VectorLegalizer::ExpandCTLZ to just pass the expansion on instead of repeating the same codegen.
llvm-svn: 344349
Pull out repeated byte sum stage for popcount of vector elements > 8bits.
This allows us to simplify the LUT/BITMATH popcnt code to always assume vXi8 vectors, and also improves avx512bitalg codegen which only has access to vpopcntb/vpopcntw.
llvm-svn: 344348
Fixes PR32160 by reducing the size of PSHUFB if we only use one of the lanes.
This approach can probably be generalized to handle any target shuffle (and any subvector index) but we have no test coverage at the moment.
llvm-svn: 344336
I want to add another pattern here that includes scalar_to_vector,
so this makes that patch smaller. I was hoping to remove the
hasOneUse() check because it shouldn't be necessary for common
codegen, but an AMDGPU test has a comment suggesting that the
extra check makes things better on one of those targets.
llvm-svn: 344320
On 64-bit targets the generic legalize will use an i64 load and a scalar_to_vector for us. But on 32-bit targets i64 isn't legal and the generic legalizer will end up emitting two 32-bit loads. We have DAG combines that try to put those two loads back together with pretty good success.
This patch instead uses f64 to avoid the splitting entirely. I've made it do the same for 64-bit mode for consistency and to keep the load in the fp domain.
There are a few things in here that look like regressions in 32-bit mode, but I believe they bring us closer to the 64-bit mode codegen. And that the 64-bit mode code could be better. I think those issues should be looked at separately.
Differential Revision: https://reviews.llvm.org/D52528
llvm-svn: 344291
This is an alternative to D53080 since I think using a BEXTR for a shifted mask is definitely an improvement when the shl can be absorbed into addressing mode. The other cases I'm less sure about.
We already have several tricks for handling an and of a shift in address matching. This adds a new case for BEXTR.
I've moved the BEXTR matching code back to X86ISelDAGToDAG to allow it to match. I suppose alternatively we could directly emit a X86ISD::BEXTR node that isel could pattern match. But I'm trying to view BEXTR matching as an isel concern so DAG combine can see 'and' and 'shift' operations that are well understood. We did lose a couple cases from tbm_patterns.ll, but I think there are ways to recover that.
I've also put back the manual load folding code in matchBEXTRFromAnd that I removed a few months ago in r324939. This gives us some more freedom to make decisions based on the ability to fold a load. I haven't done anything with that yet.
Differential Revision: https://reviews.llvm.org/D53126
llvm-svn: 344270
Summary:
As discussed in D48491, we can't really do this in the TableGen,
since we need to produce *two* instructions. This only implements
one single pattern. The other 3 patterns will be in follow-ups.
I'm not sure yet if we want to also fuse shift into here
(i.e `(x >> start) & ...`)
Reviewers: RKSimon, craig.topper, spatel
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D52304
llvm-svn: 344224
Summary:
As discussed in [[ https://bugs.llvm.org/show_bug.cgi?id=38938 | PR38938 ]],
we fail to emit `BEXTR` if the mask is shifted.
We can't deal with that in `X86DAGToDAGISel` `before the address mode for the inc is selected`,
and we can't really do it in the normal DAGCombine, because we don't have generic `ISD::BitFieldExtract` node,
and if we simply turn the shifted mask into a normal mask + shift-left, it will be folded back.
So it would seem X86ISelLowering is the place to handle this.
This patch only moves the matchBEXTRFromAnd()
from X86DAGToDAGISel to X86ISelLowering.
It does not add support for the 'shifted mask' pattern.
Reviewers: RKSimon, craig.topper, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52426
llvm-svn: 344179
Summary:
GlobalISel generates incorrect code because the legalizer artifact
combiner assumes `G_[SZ]EXT (G_IMPLICIT_DEF)` is equivalent to
`G_IMPLICIT_DEF `.
Replace `G_[SZ]EXT (G_IMPLICIT_DEF)` with 0 because the top bits
will be 0 for G_ZEXT and 0/1 for the G_SEXT.
Reviewers: aditya_nandakumar, dsanders, aemerson, javed.absar
Reviewed By: aditya_nandakumar
Subscribers: rovka, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D52996
llvm-svn: 344163
Summary:
Extend analysis forwarding loads from preceeding stores to work with
extended loads and truncated stores to the same address so long as the
load is fully subsumed by the store.
Hexagon's swp-epilog-phis.ll and swp-memrefs-epilog1.ll test are
deleted as they've no longer seem to be relevant.
Reviewers: RKSimon, rnk, kparzysz, javed.absar
Subscribers: sdardis, nemanjai, hiraditya, atanasyan, llvm-commits
Differential Revision: https://reviews.llvm.org/D49200
llvm-svn: 344142
This is intended to restore horizontal codegen to what it looked like before IR demanded elements improved in:
rL343727
As noted in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...horizontal ops can be worse for performance than a shuffle+regular binop, so I've added a TODO. Ideally, we'd
solve that in a machine instruction pass, but a quicker solution will be adding a 'HasFastHorizontalOp' feature
bit to deal with it here in the DAG.
Differential Revision: https://reviews.llvm.org/D52997
llvm-svn: 344141
When SimplifyCFG changes the PHI node into a select instruction, the debug line records becomes ambiguous. It causes the debugger to display unreachable source lines.
Differential Revision: https://reviews.llvm.org/D52887
llvm-svn: 344120
We already do the following combines:
(bitcast int (and (bitcast fp X to int), 0x7fff...) to fp) -> fabs X
(bitcast int (xor (bitcast fp X to int), 0x8000...) to fp) -> fneg X
When the target has "bit preserving fp logic". This patch just extends it
to also combine:
(bitcast int (or (bitcast fp X to int), 0x8000...) to fp) -> fneg (fabs X)
As some targets have fnabs and even those that don't can efficiently lower
both the fabs and the fneg.
Differential revision: https://reviews.llvm.org/D44548
llvm-svn: 344093
This may give slightly better opportunities for DAG combine to simplify with the operations before the setcc. It also matches the type the xors will eventually be promoted to anyway so it saves a legalization step.
Almost all of the test changes are because our constant pool entry is now v2i64 instead of v4i32 on 64-bit targets. On 32-bit targets getConstant should be emitting a v4i32 build_vector and a v4i32->v2i64 bitcast.
There are a couple test cases where it appears we now combine a bitwise not with one of these xors which caused a new constant vector to be generated. This prevented a constant pool entry from being shared. But if that's an issue we're concerned about, it seems we need to address it another way that just relying a bitcast to hide it.
This came about from experiments I've been trying with pushing the promotion of and/or/xor to vXi64 later than LegalizeVectorOps where it is today. We run LegalizeVectorOps in a bottom up order. So the and/or/xor are promoted before their users are legalized. The bitcasts added for the promotion act as a barrier to computeKnownBits if we try to use it during vector legalization of a later operation. So by moving the promotion out we can hopefully get better results from computeKnownBits/computeNumSignBits like in LowerTruncate on AVX512. I've also looked at running LegalizeVectorOps in a top down order like LegalizeDAG, but thats showing some other issues.
llvm-svn: 344071
As noted in D52747, if we prefer IR to use trunc for bool vectors rather
than and+icmp, we can expose codegen shortcomings as seen here with masked store.
Replace a hard-coded PCMPGT simplification with the more general demanded bits call
to improve things.
Differential Revision: https://reviews.llvm.org/D52964
llvm-svn: 344048
As discussed on D52964, this adds 256-bit *_EXTEND_VECTOR_INREG lowering support for AVX1 targets to help improve SimplifyDemandedBits handling.
Differential Revision: https://reviews.llvm.org/D52980
llvm-svn: 344019
One case left around nonsensical operands for the KILL instruction
which the machine verifier checks for nowadays. While this should not
hurt in release builds we should fix the machine verifier errors anyway.
llvm-svn: 344008
More tests related to PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
If we limit the horizontal codegen, it may require different
constraints for FP and integer.
llvm-svn: 343994
This patch implements a pass that optimizes condition branches on x86 by
taking advantage of the three-way conditional code generated by compare
instructions.
Currently, it tries to hoisting EQ and NE conditional branch to a dominant
conditional branch condition where the same EQ/NE conditional code is
computed. An example:
bb_0:
cmp %0, 19
jg bb_1
jmp bb_2
bb_1:
cmp %0, 40
jg bb_3
jmp bb_4
bb_4:
cmp %0, 20
je bb_5
jmp bb_6
Here we could combine the two compares in bb_0 and bb_4 and have the
following code:
bb_0:
cmp %0, 20
jg bb_1
jl bb_2
jmp bb_5
bb_1:
cmp %0, 40
jg bb_3
jmp bb_6
For the case of %0 == 20 (bb_5), we eliminate two jumps, and the control height
for bb_6 is also reduced. bb_4 is gone after the optimization.
This optimization is motivated by the branch pattern generated by the switch
lowering: we always have pivot-1 compare for the inner nodes and we do a pivot
compare again the leaf (like above pattern).
This pass currently is enabled on Intel's Sandybridge and later arches. Some
reviewers pointed out that on some arches (like AMD Jaguar), this pass may
increase branch density to the point where it hurts the performance of the
branch predictor.
Differential Revision: https://reviews.llvm.org/D46662
llvm-svn: 343993
Some necessary yak shaving before lowering *_EXTEND_VECTOR_INREG 256-bit vectors on AVX1 targets as suggested by D52964.
Differential Revision: https://reviews.llvm.org/D52970
llvm-svn: 343991
Support G_UDIV/G_UREM/G_SREM. The instruction selection
code is taken from FastISel with only minor tweaks to adapt
for GlobalISel.
Differential Revision: https://reviews.llvm.org/D49781
llvm-svn: 343966
This change is proposed as a part of D44548, but we
need this independently to avoid regressions from improved
undef propagation in SimplifyDemandedVectorElts().
llvm-svn: 343940
rL343913 was using SimplifyDemandedBits's original demanded mask instead of the adjusted 'NewMask' that accounts for multiple uses of the op (those variable names really need improving....).
Annoyingly many of the test changes (back to pre-rL343913 state) are actually safe - but only because their multiple uses are all by PMULDQ/PMULUDQ.
Thanks to Jan Vesely (@jvesely) for bisecting the bug.
llvm-svn: 343935
Prevents missing other simplifications that may occur deep in the operand chain where CommitTargetLoweringOpt won't add the PMULDQ back to the worklist itself
llvm-svn: 343922
Attempt to simplify PSHUFB masks (even non-constant ones) - we should probably be able to simplify other variable shuffles as well as the need arises.
llvm-svn: 343919
This patch enables SimplifyDemandedBits to call SimplifyDemandedVectorElts in cases where the demanded bits mask covers entire elements of a bitcasted source vector.
There are a couple of cases here where simplification at a deeper level (such as through bitcasts) prevents further simplification - CommitTargetLoweringOpt only adds immediate uses/users back to the worklist when we might want to combine the original caller again to see what else it can simplify.
As well as that I had to disable handling of bool vector until SimplifyDemandedVectorElts better supports some of their opcodes (SETCC, shifts etc.).
Fixes PR39178
Differential Revision: https://reviews.llvm.org/D52935
llvm-svn: 343913
The comments in this code say we were trying to avoid 16-bit immediates, but if the immediate fits in 8-bits this isn't an issue. This avoids creating a zero extend that probably won't go away.
The movmskb related changes are interesting. The movmskb instruction writes a 32-bit result, but fills the upper bits with 0. So the zero_extend we were previously emitting was free, but we turned a -1 immediate that would fit in 8-bits into a 32-bit immediate so it was still bad.
llvm-svn: 343871
And use that to transform fsub with zero constant operands.
The integer part isn't used yet, but it is proposed for use in
D44548, so adding both enhancements here makes that
patch simpler.
llvm-svn: 343865