Extend the existing lowering of vXi8 multiplies to support v64i8 on avx512bw targets.
I added the Lower512IntArith helper function to help with this - not sure how often this could be used in the future, but it seemed better than putting all that logic inside LowerMUL.
Differential Revision: http://reviews.llvm.org/D18937
llvm-svn: 265902
Summary:
After we make the adjustment, we can assume that for local allocas, but
not for stack parameters, the return address, or any other fixed stack
object (which has a negative offset and therefore lies prior to the
adjusted SP).
Fixes PR26662.
Reviewers: hfinkel, qcolombet, rnk
Subscribers: rnk, llvm-commits
Differential Revision: http://reviews.llvm.org/D18471
llvm-svn: 265886
This patch adds support for decoding XOP VPPERM instruction when it represents a basic shuffle.
The mask decoding required the existing MCInstrLowering code to be updated to support binary shuffles - the implementation now matches what is done in X86InstrComments.cpp.
Differential Revision: http://reviews.llvm.org/D18441
llvm-svn: 265874
It caused PR27275: "ARM: Bad machine code: Using an undefined physical register"
Also reverting the following commits that were landed on top:
r265610 "Fix the compare-clang diff error introduced by r265547."
r265639 "Fix the sanitizer bootstrap error in r265547."
r265657 "InlineSpiller.cpp: Escap \@ in r265547. [-Wdocumentation]"
llvm-svn: 265790
It seems that llc cannot be called used in assembler tests so test that
checks asm for particular target needs to be moved to codegen.
llvm-svn: 265770
Summary:
EHPad BB are not entered the classic way and therefor do not need to be placed after their predecessors. This patch make sure EHPad BB are not chosen amongst successors to form chains, and are selected as last resort when selecting the best candidate.
EHPad are scheduled in reverse probability order in order to have them flow into each others naturally.
Reviewers: chandlerc, majnemer, rafael, MatzeB, escha, silvas
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17625
llvm-svn: 265726
Re-apply r265450 which caused PR27245 and was reverted in r265559
because of a wrong generalization: the fetch_and_add->add_and_fetch
combine only works in specific, but pretty common, cases:
(icmp slt x, 0) -> (icmp sle (add x, 1), 0)
(icmp sge x, 0) -> (icmp sgt (add x, 1), 0)
(icmp sle x, 0) -> (icmp slt (sub x, 1), 0)
(icmp sgt x, 0) -> (icmp sge (sub x, 1), 0)
Original Message:
We only generate LOCKed versions of add/sub when the result is unused.
It often happens that the result is used, but only by a comparison. We
can optimize those out by reusing EFLAGS, which lets us use the proper
instructions, instead of having to fallback to LXADD.
Instead of doing this as an MI peephole (as we do for the other
non-LOCKed (really, non-MR) forms), do it in ISel. It becomes quite
tricky later.
This also makes it eventually possible to stop expanding and/or/xor
if the only user is an icmp (also see D18141).
This uses the LOCK ISD opcodes added by r262244.
Differential Revision: http://reviews.llvm.org/D17633
llvm-svn: 265636
Third time's the charm? The previous attempt (r265345) caused ASan test
failures on X86, as broken CFI caused stack traces to not work.
This version of the patch makes sure not to merge with stack adjustments
that have CFI, and to not add merged instructions' offests to the CFI
about to be generated.
This is already covered by the lit tests; I just got the expectations
wrong previously.
llvm-svn: 265623
when DenseMap growed and moved memory. I verified it fixed the bootstrap
problem on x86_64-linux-gnu but I cannot verify whether it fixes
the bootstrap error on clang-ppc64be-linux. I will watch the build-bot
result closely.
Replace analyzeSiblingValues with new algorithm to fix its compile
time issue. The patch is to solve PR17409 and its duplicates.
analyzeSiblingValues is a N x N complexity algorithm where N is
the number of siblings generated by reg splitting. Although it
causes siginificant compile time issue when N is large, it is also
important for performance since it removes redundent spills and
enables rematerialization.
To solve the compile time issue, the patch removes analyzeSiblingValues
and replaces it with lower cost alternatives containing two parts. The
first part creates a new spill hoisting method in postOptimization of
register allocation. It does spill hoisting at once after all the spills
are generated instead of inside every instance of selectOrSplit. The
second part queries the define expr of the original register for
rematerializaiton and keep it always available during register allocation
even if it is already dead. It deletes those dead instructions only in
postOptimization. With the two parts in the patch, it can remove
analyzeSiblingValues without sacrificing performance.
Differential Revision: http://reviews.llvm.org/D15302
llvm-svn: 265547
While preserving the return value for @llvm.experimental.deoptimize at
the IR level is useful during mid-level optimization, doing so at the
machine instruction level requires generating some extra code and a
return that is non-ideal. This change has LLVM lower
```
%val = call @llvm.experimental.deoptimize
ret %val
```
to effectively
```
call @__llvm_deoptimize()
unreachable
```
instead.
llvm-svn: 265502
Bionic has a defined thread-local location for the stack protector
cookie. Emit a direct load instead of going through __stack_chk_guard.
llvm-svn: 265481
We only generate LOCKed versions of add/sub when the result is unused.
It often happens that the result is used, but only by a comparison. We
can optimize those out by reusing EFLAGS, which lets us use the proper
instructions, instead of having to fallback to LXADD.
Instead of doing this as an MI peephole (as we do for the other
non-LOCKed (really, non-MR) forms), do it in ISel. It becomes quite
tricky later.
This also makes it eventually possible to stop expanding and/or/xor
if the only user is an icmp (also see D18141).
This uses the LOCK ISD opcodes added by r262244.
Differential Revision: http://reviews.llvm.org/D17633
llvm-svn: 265450
utils/update_test_checks.py was improved with:
http://reviews.llvm.org/rL265414
to include the first line of the function (expected to be
a comment line). This ensures that nothing bad has happened
before the first actual line of checked asm. It also matches
the existing behavior of the old script.
llvm-svn: 265416
Presently, CodeGenPrepare deletes all nearly empty (only phi and branch)
basic blocks. This pass can delete loop preheaders which frequently creates
critical edges. A preheader can be a convenient place to spill registers to
the stack. If the entrance to a loop body is a critical edge, then spills
may occur in the loop body rather than immediately before it. This patch
protects loop preheaders from deletion in CodeGenPrepare even if they are
nearly empty.
Since the patch alters the CFG, it affects a large number of test cases.
In most cases, the changes are merely cosmetic (basic blocks have different
names or instruction orders change slightly). I am somewhat concerned about
the test/CodeGen/Mips/brdelayslot.ll test case. If the loop preheader is not
deleted, then the MIPS backend does not take advantage of a branch delay
slot. Consequently, I would like some close review by a MIPS expert.
The patch also partially subsumes D16893 from George Burgess IV. George
correctly notes that CodeGenPrepare does not actually preserve the dominator
tree. I think the dominator tree was usually not valid when CodeGenPrepare
ran, but I am using LoopInfo to mark preheaders, so the dominator tree is
now always valid before CodeGenPrepare.
Author: Tom Jablin (tjablin)
Reviewers: hfinkel george.burgess.iv vkalintiris dsanders kbarton cycheng
http://reviews.llvm.org/D16984
llvm-svn: 265397
Summary:
I encountered this issue when constant folding during inlining tried to
fold away a bitcast of a double to an x86_mmx, which is not an integral
type. The test case exposes the same issue with a smaller code snippet
during early CSE.
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D18528
llvm-svn: 265367
The original commit miscompiled things on 32-bit Windows, e.g. a Clang
boostrap. It turns out that mergeSPUpdates() was a bit too generous in
what it interpreted as a stack adjustment, causing the following code:
addl $12, %esp
leal -4(%ebp), %esp
To be "optimized" into simply:
addl $8, %esp
This commit tightens up mergeSPUpdates() and includes a new test
(test14 in movtopush.ll) for this situation.
llvm-svn: 265345
Adds some dllexport tests to verify that:
- Variables in bss are exported appropriately
- Non-dllexport symbols aliased to dllexport symbols are not exported
- Symbols declared as dllexport but are not defined are not exported
We plan to enable dllimport/dllexport support for the PS4, and these
additional tests are for points we noticed in our internal testing.
Patch by Warren Ristow!
Differential Revision: http://reviews.llvm.org/D18682
llvm-svn: 265333
time issue. The patch is to solve PR17409 and its duplicates.
analyzeSiblingValues is a N x N complexity algorithm where N is
the number of siblings generated by reg splitting. Although it
causes siginificant compile time issue when N is large, it is also
important for performance since it removes redundent spills and
enables rematerialization.
To solve the compile time issue, the patch removes analyzeSiblingValues
and replaces it with lower cost alternatives containing two parts. The
first part creates a new spill hoisting method in postOptimization of
register allocation. It does spill hoisting at once after all the spills
are generated instead of inside every instance of selectOrSplit. The
second part queries the define expr of the original register for
rematerializaiton and keep it always available during register allocation
even if it is already dead. It deletes those dead instructions only in
postOptimization. With the two parts in the patch, it can remove
analyzeSiblingValues without sacrificing performance.
Differential Revision: http://reviews.llvm.org/D15302
llvm-svn: 265309
Implemented truncstore for KNL and skylake-avx512.
Covered vectors from v2i1 to v64i1. We save the value in bits (not in bytes) - v32i1 is saved in 4 bytes.
Differential Revision: http://reviews.llvm.org/D18740
llvm-svn: 265283
Was there really no other way to splat a byte in SSE2?
punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]
pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
llvm-svn: 265172
Note however that this is identical to the existing SSE2 run.
What we really want is yet another run for an SSE2 machine that
also has fast unaligned 16-byte accesses.
llvm-svn: 265167
Follow-up to http://reviews.llvm.org/D18566 and http://reviews.llvm.org/D18676 -
where we noticed that an intermediate splat was being generated for memsets of
non-zero chars.
That was because we told getMemsetStores() to use a 32-bit vector element type,
and it happily obliged by producing that constant using an integer multiply.
The 16-byte test that was added in D18566 is now equivalent for AVX1 and AVX2
(no splats, just a vector load), but we have PR27141 to track that splat difference.
Note that the SSE1 path is not changed in this patch. That can be a follow-up.
This patch should resolve PR27100.
llvm-svn: 265161
Follow-up to D18566 - where we noticed that an intermediate splat was being
generated for memsets of non-zero chars.
That was because we told getMemsetStores() to use a 32-bit vector element type,
and it happily obliged by producing that constant using an integer multiply.
The tests that were added in the last patch are now equivalent for AVX1 and AVX2
(no splats, just a vector load), but we have PR27141 to track that splat difference.
In the new tests, the splat via shuffling looks ok to me, but there might be some
room for improvement depending on uarch there.
Note that the SSE1/2 paths are not changed in this patch. That can be a follow-up.
This patch should resolve PR27100.
Differential Revision: http://reviews.llvm.org/D18676
llvm-svn: 265148
This changes some dllexport tests, to verify that some symbols that
should not be exported are not, in a way that improves the robustness
of CHECK-SAME interaction with CHECK-NOT.
We plan to enable dllimport/dllexport support for the PS4, and these
changes are for points we noticed in our internal testing.
Patch by Warren Ristow!
llvm-svn: 265106
Re-enable an assertion enabled by Justin Lebar in rL265092. rL265092
was breaking test/CodeGen/X86/deopt-intrinsic.ll because webkit_jscc
does not like non-i64 return types. Change the test case to not do
that.
llvm-svn: 265099
This mostly cosmetic patch moves the DebugEmissionKind enum from DIBuilder
into DICompileUnit. DIBuilder is not the right place for this enum to live
in — a metadata consumer should not have to include DIBuilder.h.
I also added a Verifier check that checks that the emission kind of a
DICompileUnit is actually legal.
http://reviews.llvm.org/D18612
<rdar://problem/25427165>
llvm-svn: 265077
We don't really support non-constant shuffle masks, but these tests are for cases where BUILD_VECTOR is made up from vector extracts (as well as undef/zero scalars).
llvm-svn: 265045
The test case was defining and using a function 'notExported()', but
the FileCheck checks were checking for the name 'not_exported'. This
changes the test to use 'notExported' across the board. Also, the test
defined a function 'not_defined()', but doesn't have any checks related
to it. For consistency, this name is changed to 'notDefined'. A later
commit will add checks for 'notDefined'.
Patch by Warren Ristow!
llvm-svn: 264984
The size savings are significant, and from what I can tell, both ICC and GCC do this.
Differential Revision: http://reviews.llvm.org/D18573
llvm-svn: 264966
Fix for issue introduced D17297, where we were breaking early from the loop detecting consecutive loads which could leave us thinking a consecutive load with zeros was possible.
llvm-svn: 264922
XOP's VPPERM has some great 'permute operations' that it can do as well as part of shuffling the bytes of a 128-bit vector - in this case we use it to perform BITREVERSE in a single instruction.
llvm-svn: 264870
We are currently doing a REALLY bad job of packing results of vector comparisons into the legalized <X x i1> result equivalents - a mixture of PACKSS/PMOVMSKB would be much better here.
llvm-svn: 264867
operations.
Specifically, we had code that tried to badly approximate reconstructing
all of the possible variations on addressing modes in two x86
instructions based on those in one pseudo instruction. This is not the
first bug uncovered with doing this, so stop doing it altogether.
Instead generically and pedantically copy every operand from the address
over to both new instructions, and strip kill flags from any register
operands.
This fixes a subtle bug seen in the wild where we would mysteriously
drop parts of the addressing mode, causing for example the index
argument in the added test case to just be completely ignored.
Hypothetically, this was an extremely bad miscompile because it actually
caused a predictable and leveragable write of a 64bit quantity to an
unintended offset (the first element of the array intead of whatever
other element was intended). As a consequence, in theory this could even
have introduced security vulnerabilities.
However, this was only something that could happen with an atomic
floating point add. No other operation could trigger this bug, so it
seems extremely unlikely to have occured widely in the wild.
But it did in fact occur, and frequently in scientific applications
which were using relaxed atomic updates of a floating point value after
adding a delta. Those would end up being quite badly miscompiled by
LLVM, which is how we found this. Of course, this often looks like
a race condition in the code, but it was actually a miscompile.
I suspect that this whole RELEASE_FADD thing was a complete mistake.
There is no such operation, and I worry that anything other than add
will get remarkably worse codegeneration. But that's not for this
change....
llvm-svn: 264845
1. Removed the run line for mingw32 and made the Darwin triples unknown.
This is a test of 32-bit vs. 64-bit platform and the underlying hardware.
We have other tests for checking behavioral differences of the OS platform.
2. Changed the CPU specifiers to the attributes they were meant to represent.
Any CPU that doesn't have SSE4.2 is assumed to have slow unaligned 16-byte accesses,
so it won't use those here.
3. Although the stores really could all be CHECK-DAG, I left them as CHECK-NEXT to
show the strange behavior of the instruction scheduler in the SLOW_32 case.
4. The odd-looking instructions are due to the use of a null pointer in the IR, so
we have integer immediate store addresses. Cute.
llvm-svn: 264796
Add function soft attribute to the generation of Jump Tables in CodeGen
as initial step towards clang support of gcc's no-jump-table support
Reviewers: hans, echristo
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D18321
llvm-svn: 264756
Fixed fp_to_uint instruction selection on KNL.
One pattern was missing for <4 x double> to <4 x i32>
Differential Revision: http://reviews.llvm.org/D18512
llvm-svn: 264701
Minimum density for both optsize and non optsize are now options
-sparse-jump-table-density (default 10) for non optsize functions
-dense-jump-table-density (default 40) for optsize functions, which
matches the current default. This improves several benchmarks at google
at the cost of a small codesize increase. For code compiled with -Os,
the old behavior continues
llvm-svn: 264689
If all a BUILD_VECTOR's source elements are the same bit (AND/XOR/OR) operation type and each has one constant operand, lower to a pair of BUILD_VECTOR and just apply the bit operation to the vectors.
The constant operands will form a constant vector meaning that we still only have a single BUILD_VECTOR to lower and we will have replaced all the scalarized operations with a single SSE equivalent.
Its not in our interest to start make a general purpose vectorizer from this, but I'm seeing enough of these scalar bit operations from the later legalization/scalarization stages to support them at least.
Differential Revision: http://reviews.llvm.org/D18492
llvm-svn: 264666
MachineFunctionProperties represents a set of properties that a MachineFunction
can have at particular points in time. Existing examples of this idea are
MachineRegisterInfo::isSSA() and MachineRegisterInfo::tracksLiveness() which
will eventually be switched to use this mechanism.
This change introduces the AllVRegsAllocated property; i.e. the property that
all virtual registers have been allocated and there are no VReg operands
left.
With this mechanism, passes can declare that they require a particular property
to be set, or that they set or clear properties by implementing e.g.
MachineFunctionPass::getRequiredProperties(). The MachineFunctionPass base class
verifies that the requirements are met, and handles the setting and clearing
based on the delcarations. Passes can also directly query and update the current
properties of the MF if they want to have conditional behavior.
This change annotates the target-independent post-regalloc passes; future
changes will also annotate target-specific ones.
Reviewers: qcolombet, hfinkel
Differential Revision: http://reviews.llvm.org/D18421
llvm-svn: 264593
ICMP instruction selection fails on SKX and KNL for i1 operand.
I use XOR to resolve:
(A == B) is equivalent to (A xor B) == 0
Differential Revision: http://reviews.llvm.org/D18511
llvm-svn: 264566
Currently this is to mainly to prevent scalarization of integer division by constants.
Differential Revision: http://reviews.llvm.org/D18307
llvm-svn: 264511
64-bit, 32-bit and 16-bit move-immediate instructions are 7, 6, and 5 bytes,
respectively, whereas and/or with 8-bit immediate is only three bytes.
Since these instructions imply an additional memory read (which the CPU could
elide, but we don't think it does), restrict these patterns to minsize functions.
Differential Revision: http://reviews.llvm.org/D18374
llvm-svn: 264440
This is the same as r255936, with added logic for avoiding clobbering of the
red zone (PR26023).
Differential Revision: http://reviews.llvm.org/D18246
llvm-svn: 264375
It is incorrect to get the corresponding MBB for a ReturnInst before
SelectAllBasicBlocks since SelectAllBasicBlocks can change the
correspondence between a ReturnInst and the MBB it is in.
PR27062
llvm-svn: 264358
Earlier we were ignoring varargs in LowerCallSiteWithDeoptBundle because
populateCallLoweringInfo does not set CallLoweringInfo::IsVarArg.
llvm-svn: 264354
Summary:
Only adds support for "naked" calls to llvm.experimental.deoptimize.
Support for round-tripping through RewriteStatepointsForGC will come
as a separate patch (should be simpler than this one).
Reviewers: reames
Subscribers: sanjoy, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18429
llvm-svn: 264329
Given that StatepointLowering now uniques derived pointers before
putting them in the per-statepoint spill map, we may end up with missing
entries for derived pointers when we visit a gc.relocate on a pointer
that was de-duplicated away.
Fix this by keeping two maps, one mapping gc pointers to their
de-duplicated values, and one mapping a de-duplicated value to the slot
it is spilled in.
llvm-svn: 264320
KTEST instruction may be used instead of TEST in this case:
%int_sel3 = bitcast <8 x i1> %sel3 to i8
%res = icmp eq i8 %int_sel3, zeroinitializer
br i1 %res, label %L2, label %L1
Differential Revision: http://reviews.llvm.org/D18444
llvm-svn: 264298
This patch begins adding support for lowering to the XOP VPPERM instruction - adding the X86ISD::VPPERM opcode.
Differential Revision: http://reviews.llvm.org/D18189
llvm-svn: 264260
We need the "return address" of a noreturn call to be within the
bounds of the calling function; TrapUnreachable turns 'unreachable'
into a 'ud2' instruction, which has that desired effect.
Differential Revision: http://reviews.llvm.org/D18414
llvm-svn: 264224
Currently, AnalyzeBranch() fails non-equality comparison between floating points
on X86 (see https://llvm.org/bugs/show_bug.cgi?id=23875). This is because this
function can modify the branch by reversing the conditional jump and removing
unconditional jump if there is a proper fall-through. However, in the case of
non-equality comparison between floating points, this can turn the branch
"unanalyzable". Consider the following case:
jne.BB1
jp.BB1
jmp.BB2
.BB1:
...
.BB2:
...
AnalyzeBranch() will reverse "jp .BB1" to "jnp .BB2" and then "jmp .BB2" will be
removed:
jne.BB1
jnp.BB2
.BB1:
...
.BB2:
...
However, AnalyzeBranch() cannot analyze this branch anymore as there are two
conditional jumps with different targets. This may disable some optimizations
like block-placement: in this case the fall-through behavior is enforced even if
the fall-through block is very cold, which is suboptimal.
Actually this optimization is also done in block-placement pass, which means we
can remove this optimization from AnalyzeBranch(). However, currently
X86::COND_NE_OR_P and X86::COND_NP_OR_E are not reversible: there is no defined
negation conditions for them.
In order to reverse them, this patch defines two new CondCode X86::COND_E_AND_NP
and X86::COND_P_AND_NE. It also defines how to synthesize instructions for them.
Here only the second conditional jump is reversed. This is valid as we only need
them to do this "unconditional jump removal" optimization.
Differential Revision: http://reviews.llvm.org/D11393
llvm-svn: 264199
We were just completely ignoring the types when determining whether we could
safely emit a libcall as a tail call. This is clearly wrong.
Theoretically, we could dig deeper looking for incidental matches (much like
the generic code in Analysis.cpp does), but it's probably not worth it for the
few libcalls that exist.
llvm-svn: 264084
Improve vector extension of vectors on hardware without dedicated VSEXT/VZEXT instructions.
We already convert these to SIGN_EXTEND_VECTOR_INREG/ZERO_EXTEND_VECTOR_INREG but can further improve this by using the legalizer instead of prematurely splitting into legal vectors in the combine as this only properly helps for lowering to VSEXT/VZEXT.
Removes a lot of unnecessary any_extend + mask pattern - (Fix for PR25718).
Reapplied with a fix for PR26953 (missing vector widening legalization).
Differential Revision: http://reviews.llvm.org/D17932
llvm-svn: 264062
Summary:
After this change, deopt operand bundles can be lowered directly by
SelectionDAG into STATEPOINT instructions (which are then lowered to a
call or sequence of nop, with an associated __llvm_stackmaps entry0.
This obviates the need to round-trip deoptimization state through
gc.statepoint via RewriteStatepointsForGC.
Reviewers: reames, atrick, majnemer, JosephTremoulet, pgavlin
Subscribers: sanjoy, mcrosier, majnemer, llvm-commits
Differential Revision: http://reviews.llvm.org/D18257
llvm-svn: 264015
Improve computeZeroableShuffleElements to be able to peek through bitcasts to extract zero/undef values from BUILD_VECTOR nodes of different element sizes to the shuffle mask.
Differential Revision: http://reviews.llvm.org/D14261
llvm-svn: 263906
We were being too aggressive in trying to combine a shuffle into a blend-with-zero pattern, often resulting in a endless loop of contrasting combines
This patch stops the combine if we already have a blend in place (means we miss some domain corrections)
llvm-svn: 263717
We can currently only match zeroable vector elements of the same size as the shuffle type - these tests demonstrate the problem and a solution will be shortly added in an updated D14261
llvm-svn: 263606
This patch adds support for the MachO .alt_entry assembly directive, and uses
it for global aliases with non-zero GEP offsets. The alt_entry flag indicates
that a symbol should be layed out immediately after the preceding symbol.
Conceptually it introduces an alternate entry point for a function or data
structure. E.g.:
safe_foo:
// check preconditions for foo
.alt_entry fast_foo
fast_foo:
// body of foo, can assume preconditions.
The .alt_entry flag is also implicitly set on assembly aliases of the form:
a = b + C
where C is a non-zero constant, since these have the same effect as an
alt_entry symbol: they introduce a label that cannot be moved relative to the
preceding one. Setting the alt_entry flag on aliases of this form fixes
http://llvm.org/PR25381.
llvm-svn: 263521
Converting masked vector loads to regular vector loads for x86 AVX should always be a win.
I raised the legality issue of reading the extra memory bytes on llvm-dev. I did not see any
objections.
1. x86 already does this kind of optimization for multiple scalar loads -> vector load.
2. If other targets have the same flexibility, we could move this transform up to CGP or DAGCombiner.
Differential Revision: http://reviews.llvm.org/D18094
llvm-svn: 263446
For cases where we are truncating an integer vector arithmetic result, it may be better to pre-truncate the input operands - no code to support this yet (scalar is done with SimplifyDemandedBits but adding vector support could be a lot of work) but these tests represent the current codegen status.
Example bugs: PR14666, PR22703
llvm-svn: 263384
The SSE41 v8i16 shift lowering using (v)pblendvb is great for non-constant shift amounts, but if it is constant then we can efficiently reduce the VSELECT to shuffles with the pre-SSE41 lowering.
llvm-svn: 263383
cmpxchg[8|16]b uses RBX as one of its argument.
In other words, using this instruction clobbers RBX as it is defined to hold one
the input. When the backend uses dynamically allocated stack, RBX is used as a
reserved register for the base pointer.
Reserved registers have special semantic that only the target understands and
enforces, because of that, the register allocator don’t use them, but also,
don’t try to make sure they are used properly (remember it does not know how
they are supposed to be used).
Therefore, when RBX is used as a reserved register but defined by something that
is not compatible with that use, the register allocator will not fix the
surrounding code to make sure it gets saved and restored properly around the
broken code. This is the responsibility of the target to do the right thing with
its reserved register.
To fix that, when the base pointer needs to be preserved, we use a different
pseudo instruction for cmpxchg that save rbx.
That pseudo takes two more arguments than the regular instruction:
- One is the value to be copied into RBX to set the proper value for the
comparison.
- The other is the virtual register holding the save of the value of RBX as the
base pointer. This saving is done as part of isel (i.e., we emit a copy from
rbx).
cmpxchg_save_rbx <regular cmpxchg args>, input_for_rbx_reg, save_of_rbx_as_bp
This gets expanded into:
rbx = copy input_for_rbx_reg
cmpxchg <regular cmpxchg args>
rbx = save_of_rbx_as_bp
Note: The actual modeling of the pseudo is a bit more complicated to make sure
the interferes that appears after the pseudo gets expanded are properly modeled
before that expansion.
This fixes PR26883.
llvm-svn: 263325
Improve vector extension of vectors on hardware without dedicated VSEXT/VZEXT instructions.
We already convert these to SIGN_EXTEND_VECTOR_INREG/ZERO_EXTEND_VECTOR_INREG but can further improve this by using the legalizer instead of prematurely splitting into legal vectors in the combine as this only properly helps for lowering to VSEXT/VZEXT.
Removes a lot of unnecessary any_extend + mask pattern - (Fix for PR25718).
Differential Revision: http://reviews.llvm.org/D17932
llvm-svn: 263303
Its not enough that we test for SSSE3 - that's only OK for 128-bit vectors - we also need to test for AVX2 / AVX512BW for 256/512 bit vector cases.
llvm-svn: 263239
Generalise the existing SIGN_EXTEND to SIGN_EXTEND_VECTOR_INREG combine to support zero extension as well and get rid of a lot of unnecessary ANY_EXTEND + mask patterns.
Reapplied with a fix for PR26870 (avoid premature use of TargetConstant in ZERO_EXTEND_VECTOR_INREG expansion).
Differential Revision: http://reviews.llvm.org/D17691
llvm-svn: 263159
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
llvm-svn: 263158
When trying to replace an add to esp with pops, we need to choose dead
registers to pop into. Registers clobbered by the call and not imp-def'd
by it should be safe. Except that it's not enough to check the register
itself isn't defined, we also need to make sure no overlapping registers
are defined either.
This fixes PR26711.
Differential Revision: http://reviews.llvm.org/D18029
llvm-svn: 263139
This patch reorders the combining of target shuffle masks so that when a unary shuffle takes a binary shuffle as its input but only references one of its inputs it can correctly combine into a unary shuffle mask.
This is starting to encroach on the purpose of resolveTargetShuffleInputs, but I don't want to remove it until we definitely know we won't need it for full binary shuffle combining.
There is a lot more work before we can properly support binary target shuffle masks but this was an easy case to add support for.
Differential Revision: http://reviews.llvm.org/D17858
llvm-svn: 263102
Operation SCALAR_TO_VECTOR for v64i8 and v32i16 should be lowered if BW feature is "on".
Differential Revision: http://reviews.llvm.org/D17994
llvm-svn: 263097
Instead of a variable-blend instruction, form a blend with immediate because those are always cheaper.
Differential Revision: http://reviews.llvm.org/D17899
llvm-svn: 263067
The fix consisting in using the library call for atomic compare and swap when
the instruction is not safe to use may be incorrect. Indeed the library call may
not exist on all platform. In other words, we need a better fix!
llvm-svn: 262943
I noticed this test as part of:
http://reviews.llvm.org/D11393
...which is confusing enough as-is.
Let's show the exact codegen, so the changes will be more obvious.
llvm-svn: 262874
Patch to add support for target shuffle combining of X86ISD::VPERMV3 nodes, including support for detecting unary shuffles.
This uncovered several issues with the X86ISD::VPERMV3 shuffle mask decoding of non-64 bit shuffle mask elements - the bit masking wasn't being correctly computed.
Removed non-constant pool mask decode path as we have no way of testing it right now.
Differential Revision: http://reviews.llvm.org/D17916
llvm-svn: 262809
Added support for decoding VPERMILPS variable shuffle masks that aren't in the constant pool.
Added target shuffle mask decoding for SCALAR_TO_VECTOR+VZEXT_MOVL cases - these can happen for v2i64 constant re-materialization
Followup to D17681
llvm-svn: 262784
When the lowering of the setjmp intrinsic requires
a global base pointer to be set, make sure such pointer
gets defined by the CGBR pass.
This fixes PR26742.
llvm-svn: 262762
cmpxchgXXb uses RBX as one of its implicit argument. I.e., when
we use that instruction we need to clobber RBX. This is generally
fine, expect when RBX is a reserved register because in that case,
the register allocator will not track its value and will not
save and restore it when interferences occur.
rdar://problem/24851412
llvm-svn: 262759
The x86 ret instruction has a 16 bit immediate indicating how many bytes
to pop off of the stack beyond the return address.
There is a problem when extremely large structs are passed by value: we
might not be able to fit the number of bytes to pop into the return
instruction.
To fix this, expand RET_FLAG a little later and use a special sequence
to clean the stack:
pop %ecx ; return address is now in %ecx
add $n, %esp ; clean the stack
push %ecx ; bring the return address back on the stack
ret ; pop the return address and jmp to it's value
llvm-svn: 262755
The divrem combine assumed the type of the div/rem is simple, which isn't
necessarily true. This probably worked fine until r250825, since it only
saw legal types, but now breaks when it runs as a pre-type-legalization
combine.
This fixes PR26835.
Differential Revision: http://reviews.llvm.org/D17878
llvm-svn: 262746
The variable mask form of VPERMILPD/VPERMILPS were only partially implemented, with much of it still performed as an intrinsic.
This patch properly defines the instructions in terms of X86ISD::VPERMILPV, permitting the opcode to be easily combined as a target shuffle.
Differential Revision: http://reviews.llvm.org/D17681
llvm-svn: 262635
That's not the case for VPERMV/VPERMV3, which cover all possible
combinations (the C intrinsics use a different order; the AVX vs
AVX512 intrinsics are different still).
Since:
r246981 AVX-512: Lowering for 512-bit vector shuffles.
VPERMV is recognized in getTargetShuffleMask.
This breaks assumptions in most callers, as they expect
the non-mask operands to start at index 0.
VPERMV has the mask as operand #0; VPERMV3 has it in the middle.
Instead of the faulty assumption, have getTargetShuffleMask return
its operands as well.
One alternative we considered was to change the operand order of
VPERMV, but we agreed to stick to the instruction order, as there
are more AVX512 weirdness to cover (vpermt2/vpermi2 in particular).
Differential Revision: http://reviews.llvm.org/D17041
llvm-svn: 262627
This test failed in some ARM bots after a divmod change because it was
running on a native llc, instead of targeted one. This makes sure the test
is target-specific (as intended), and also copies to ARM and AArch64
directories. If it is also supposed to work on other architectures, I'll
leave as an exercise to the respective maintainers.
llvm-svn: 262620
Generalise the existing SIGN_EXTEND to SIGN_EXTEND_VECTOR_INREG combine to support zero extension as well and get rid of a lot of unnecessary ANY_EXTEND + mask patterns.
Differential Revision: http://reviews.llvm.org/D17691
llvm-svn: 262599
The code was previously not able to track a boolean argument
at a call site back to the formal argument of the caller.
Differential Revision: http://reviews.llvm.org/D17786
llvm-svn: 262575
If we have a loop with a rarely taken path, we will prune that from the blocks which get added as part of the loop chain. The problem is that we weren't then recognizing the loop chain as schedulable when considering the preheader when forming the function chain. We'd then fall to various non-predecessors before finally scheduling the loop chain (as if the CFG was unnatural.) The net result was that there could be lots of garbage between a loop preheader and the loop, even though we could have directly fallen into the loop. It also meant we separated hot code with regions of colder code.
The particular reason for the rejection of the loop chain was that we were scanning predecessor of the header, seeing the backedge, believing that was a globally more important predecessor (true), but forgetting to account for the fact the backedge precessor was already part of the existing loop chain (oops!.
Differential Revision: http://reviews.llvm.org/D17830
llvm-svn: 262547
Catch objects with a displacement of zero do not initialize a catch
object. The displacement is relative to %rsp at the end of the
function's prologue for x86_64 targets.
If we place an object at the top-of-stack, we will end up wit a
displacement of zero resulting in our catch object remaining
uninitialized.
Address this by creating our catch objects as fixed objects. We will
ensure that the UnwindHelp object is created after the catch objects so
that no catch object will have a displacement of zero.
Differential Revision: http://reviews.llvm.org/D17823
llvm-svn: 262546
This reverts commit r262370.
It turns out there is code out there that does sequences of allocas
greater than 4K: http://crbug.com/591404
The goal of this change was to improve the code size of inalloca call
sequences, but we got tangled up in the mess of dynamic allocas.
Instead, we should come back later with a separate MI pass that uses
dominance to optimize the full sequence. This should also be able to
remove the often unneeded stacksave/stackrestore pairs around the call.
llvm-svn: 262505
We have a number of useful lowering strategies for VBROADCAST instructions (both from memory and register element 0) which the 128-bit form of the MOVDDUP instruction can make use of.
This patch tweaks lowerVectorShuffleAsBroadcast to enable it to broadcast 2f64 args using MOVDDUP as well.
It does require a slight tweak to the lowerVectorShuffleAsBroadcast mechanism as the existing MOVDDUP lowering uses isShuffleEquivalent which can match binary shuffles that can lower to (unary) broadcasts.
Differential Revision: http://reviews.llvm.org/D17680
llvm-svn: 262478
We modeled the RDFLAGS{32,64} operations as "using" {E,R}FLAGS.
While technically correct, this is not be desirable for folks who want
to examine aspects of the FLAGS register which are not related to
computation like whether or not CPUID is a valid instruction.
Differential Revision: http://reviews.llvm.org/D17782
llvm-svn: 262465
The _chkstk function is called by the compiler to probe the stack in an
order consistent with Windows' expectations. However, it is possible to
elide the call to _chkstk and manually adjust the stack pointer if we
can prove that the allocation is fixed size and smaller than the probe
size.
This shrinks chrome.dll, chrome_child.dll and chrome.exe by a
cummulative ~133 KB.
Differential Revision: http://reviews.llvm.org/D17679
llvm-svn: 262370
In the code below on 32-bit targets, x would previously get forwarded to g()
without sign-extension to 32 bits as required by the parameter attribute.
void g(signed short);
void f(unsigned short x) {
g(x);
}
llvm-svn: 262352
The CatchObjOffset is relative to the end of the EH registration node
for 32-bit x86 WinEH targets. A special sentinel value, 0, is used to
indicate that no catch object should be initialized.
This means that a catch object allocated immediately before the
registration node would be assigned a CatchObjOffset of 0, leading the
runtime to believe that a catch object should not be initialized.
To handle this, allocate the registration node prior to any other frame
object. This will ensure that catch objects will not be allocated
before the registration node.
This fixes PR26757.
Differential Revision: http://reviews.llvm.org/D17689
llvm-svn: 262294
In the case where op = add, y = base_ptr, and x = offset, this
transform:
(op y, (op x, c1)) -> (op (op x, y), c1)
breaks the canonical form of add by putting the base pointer in the
second operand and the offset in the first.
This fix is important for the R600 target, because for some address
spaces the base pointer and the offset are stored in separate register
classes. The old pattern caused the ISel code for matching addressing
modes to put the base pointer and offset in the wrong register classes,
which required no-trivial code transformations to fix.
llvm-svn: 262148
This is one of the cases shown in:
https://llvm.org/bugs/show_bug.cgi?id=26701
Shift and negate is what InstCombine appears to prefer, so I've started with that pattern.
Note that the 'pcmpeq' instructions are always generating the negative one for the actual
'pcmpgt' comparison in each case (side note: why isn't there an alias mnemonic for that?).
Differential Revision: http://reviews.llvm.org/D17630
llvm-svn: 262036
MBB slot index intervals are half open, not closed. getMBBEndIndex()
returns the slot index of the start of the next block in layout order.
Placing a register mask there is incorrect if the successor of the
funclet return is not laid out after the return. Clang generates IR for
catch bodies before generating the following normal code, so we never
noticed this issue until the D frontend authors filed a bug about it.
Instead, we can put the clobber mask on the last instruction of the
funclet return block. We still aren't using a register mask operand on
the CATCHRET instruction because it would cause PEI to spill all CSRs,
including XMM regs, in the prologue.
Fixes PR26679.
llvm-svn: 262035
This also simplifies the code by removing the overly conservative
NoInterveningSideEffect() function. This function checked:
- That the two copies belong to the same block: We only process one
block at a time and clear our maps in between it is impossible to find a
copy from a different block.
- There is no terminator between the two copy instructions: This is not
allowed anyway (the MachineVerifier would complain)
- Does not have instructions with hasUnmodeledSideEffects() or isCall()
set: Even for those instructuction we must have all clobbers/defs of
registers explicit as an operand. If the register is explicitely
clobbered we would never come to the point of checking for
NoInterveningSideEffect() anyway.
(I also checked this with a temporary build of the test-suite with all
potentially failing conditions in NoInterveningSideEffect() turned into
asserts)
Differential Revision: http://reviews.llvm.org/D17474
llvm-svn: 261965
Summary:
Both the hardware and LLVM have changed since 2012.
Now, load-based heuristic don't show big differences any more on OoO cores.
There is no notable regressons and improvements on spec2000/2006. (Cortex-A57, Core i5).
Reviewers: spatel, zansari
Differential Revision: http://reviews.llvm.org/D16836
llvm-svn: 261809
This fixes bugs in copy elimination code in llvm. It slightly changes the
semantics of clearRegisterKills(). This is appropriate because:
- Users in lib/CodeGen/MachineCopyPropagation.cpp and
lib/Target/AArch64RedundantCopyElimination.cpp and
lib/Target/SystemZ/SystemZElimCompare.cpp are incorrect without it
(see included testcase).
- All other users in llvm are unaffected (they pass TRI==nullptr)
- (Kill flags are optional anyway so removing too many shouldn't hurt.)
Differential Revision: http://reviews.llvm.org/D17554
llvm-svn: 261763
Part 2 of 2
This patch add support for combining target shuffles into blends-with-zero.
Differential Revision: http://reviews.llvm.org/D17483
llvm-svn: 261745
Part 1 of 2
This patch attempts to replace the insertion of zero scalars with a vector blend with zero, avoiding the use of the integer insertion instructions (which are particularly slow on many targets).
(Part 2 will add support for combining multiple blends-with-zero).
Differential Revision: http://reviews.llvm.org/D17483
llvm-svn: 261743
Summary: If a function is hot, put it in text.hot section.
Reviewers: davidxl
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17532
llvm-svn: 261607
Summary: If a function is hot, put it in text.hot section.
Reviewers: davidxl
Subscribers: eraman, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D17460
llvm-svn: 261582
Add support for the case where we have a consecutive load (which must include the first + last elements) with a mixture of undef/zero elements. We load the vector and then apply a shuffle to clear the zero'd elements.
Differential Revision: http://reviews.llvm.org/D17297
llvm-svn: 261490
Summary:
- Rename `"skylake"` == SkylakeServerProc to `"skylake-avx512"`
- Change `"skylake"` to denote SkylakeClientProc
- Fix the detection of cpu family 6 and model 94 to be
SkylakeClientProc instead of SkylakeServerProc
- Remove the `"cnl"` for CannonLake
Reviewers: craig.topper, delena
Subscribers: zansari, echristo, qcolombet, RKSimon, spatel, DavidKreitzer, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D17090
llvm-svn: 261482
COFF doesn't have sections with mergeable contents. Instead, each
constant pool entry ends up in a COMDAT section. The linker, when
choosing between COMDAT sections, doesn't choose the max alignment of
the two sections. You just get whatever alignment was on the section.
If one constant needed a higher alignment in one object file from
another one, then we will get into trouble if the linker chooses the
lower alignment one.
Instead, lets promote the alignment of the constant pool entry to make
sure we don't use an under aligned constant with an instruction which
assumed otherwise.
This fixes PR26680.
llvm-svn: 261462
As discussed on PR24580, this patch adds some (more to come) initial fast-isel codegen tests to match the IR generated in clang/test/CodeGen/sse41-builtins.c
llvm-svn: 261438
Fixed a bug introduced by D16683 when a binary shuffle is simplified to a unary shuffle (with undef/zero sentinel mask indices) - if this resulted in only the second input being used combineX86ShuffleChain failed to take this into account and still referenced the first input.
llvm-svn: 261434
TLSADDR nodes are lowered into actuall calls inside MC. In order to prevent
shrink-wrapping from pushing prologue/epilogue past them (which result
in TLS variables being accessed before the stack frame is set up), we
put markers, so that the stack gets adjusted properly.
Thanks to Quentin Colombet for guidance/help on how to fix this problem!
llvm-svn: 261387
Summary:
When optimizing for size, sqrt calls can be incorrectly selected as
AVX512 VSQRT instructions. This is because X86InstrAVX512.td has a
`Requires<[OptForSize]>` in its `avx512_sqrt_scalar` multiclass
definition. Even if the target does not support AVX512, the class can
apparently still be chosen, leading to an incorrect selection of
`vsqrtss`.
In PR26625, this lead to an assertion: Reg >= X86::FP0 && Reg <=
X86::FP6 && "Expected FP register!", because the `vsqrtss` instruction
requires an XMM register, which is not available on i686 CPUs.
Reviewers: grosbach, resistor, joker.eph
Subscribers: spatel, emaste, llvm-commits
Differential Revision: http://reviews.llvm.org/D17414
llvm-svn: 261360
allocateStackSlot did not consider the size of the value to be spilled
before deciding to re-use a spill slot. This was originally okay (since
originally we'd only ever spill pointers), but it became not okay when
we changed our scheme to directly spill vectors of pointers.
While this change fixes the bug pointed out, it has two performance
caveats:
- It matches spill slot and spillee size exactly, while in theory we
can spill, e.g., an 8 byte pointer into a 16 byte slot. This is
slightly complicated to fix since in the stackmaps section, we report
the size of the spill slot as the size of the "indirect value"; and
if they're no longer equivalent, we'll have to keep track of the
(indirect) value size separately from the stack slot size.
- It will "spuriously run out" of reusable slots, since we now have an
second check in the search loop in addition to the availablity
check (e.g. you had two free scalar slots, and you first ask for a
vector slot followed by a scalar slot). I'll fix this in a later
commit.
llvm-svn: 261336
As discussed on PR24580, this patch adds some (more to come) initial fast-isel codegen tests to match the IR generated in clang/test/CodeGen/avx-builtins.c
llvm-svn: 261329
Summary:
Without this, this command
$ llvm-run llc -stop-after machine-cp -o - <( echo '' )
outputs an error, because we close stdout twice -- once when closing the
file opened for "-o", and again when closing outs().
Also clarify in the outs() definition that you can't ever call it if you
want to open your own raw_fd_ostream on stdout.
Reviewers: jroelofs, tstellarAMD
Subscribers: jholewinski, qcolombet, dsanders, llvm-commits
Differential Revision: http://reviews.llvm.org/D17422
llvm-svn: 261286
If we know that all of our successors want to be in the exact same
state, it makes sense to hoist the state transition into their common
predecessor.
Differential Revision: http://reviews.llvm.org/D17391
llvm-svn: 261262
In r260133, LLVM was changed to no longer extend i8/i16 return values,
as it's not required by the ABI. However, code was found in the wild
that relies on the old behaviour on Darwin, so this commit reverts
back to that old behaviour for Darwin.
On other platforms, it's less likely that code would be depending on
the old behaviour, as GCC and MSVC haven't been extending such return
values.
llvm-svn: 261235
In cases where the PSHUFB shuffle mask is shared it might not be bitcasted to a vXi8 byte vector. This patch adds support for decoding these wider shuffle masks from the ConstantPool.
The test case in question makes use of this to recognise the shuffle mask is an unary UNPCKL pattern and simplifies accordingly.
llvm-svn: 261201
32-bit x86 Windows targets use a linked-list of nodes allocated on the
stack, referenced to via thread-local storage. The personality routine
interprets one of the fields in the node as a 'state number' which
indicates where the personality routine should transfer control.
State transitions are possible only before call-sites which may throw
exceptions. Our previous scheme had us update the state number before
all call-sites which may throw.
Instead, we can try to minimize the number of times we need to store by
reasoning about the nearest store which dominates the current call-site.
If the last store agrees with the current call-site, then we know that
the state-update is redundant and can be elided.
This is largely straightforward: an RPO walk of the blocks allows us to
correctly forward propagate the information when the function is a DAG.
Currently, loops are not handled optimally and may trigger superfluous
state stores.
Differential Revision: http://reviews.llvm.org/D16763
llvm-svn: 261122
We are getting better at combining constant pshufb masks - use a real input instead of undef.
Add test for decoding multi-use bitcasted masks as well (actual support will come soon).
llvm-svn: 261101
Bug description:
The bug was discovered when test was compiled with -O0.
In case scatter result is DAG root , VectorLegalizer failed (assert) due to LowerMSCATTER() return kmask as result.
Change LowerMSCATTER() to return chain as original node do.
Differential Revision: http://reviews.llvm.org/D17331
llvm-svn: 261090
AVX1 doesn't support the shuffling of 256-bit integer vectors. For 32/64-bit elements we get around this by shuffling as float/double but for 8/16-bit elements (assuming they can't widen) we currently just split, shuffle as 128-bit vectors and concatenate the results back.
This patch adds the ability to lower using the bit-blend patterns before defaulting to the splitting behaviour.
Part 2 of 2
Differential Revision: http://reviews.llvm.org/D17292
llvm-svn: 261082
AVX1 doesn't support the shuffling of 256-bit integer vectors. For 32/64-bit elements we get around this by shuffling as float/double but for 8/16-bit elements (assuming they can't widen) we currently just split, shuffle as 128-bit vectors and concatenate the results back.
This patch adds the ability to lower using the bit-mask patterns before defaulting to the splitting behaviour. In some cases this ends up matching what AVX2 would do anyhow or what AVX1 does on the split vectors.
Part 1 of 2
Differential Revision: http://reviews.llvm.org/D17292
llvm-svn: 261081
__chkstk clobbers EAX. If EAX is live across the prologue, then we have
to take extra steps to save it. We already had code to do this if EAX
was a register parameter. This change adapts it to work when shrink
wrapping is used.
llvm-svn: 261039
Currently, we sometimes miscompile this vector pattern:
(c ? -v : v)
We lower it to (because "c" is <4 x i1>, lowered as a vector mask):
(~c & v) | (c & -v)
When we have SSSE3, we incorrectly lower that to PSIGN, which does:
(c < 0 ? -v : c > 0 ? v : 0)
in other words, when c is either all-ones or all-zero:
(c ? -v : 0)
While this is an old bug, it rarely triggers because the PSIGN combine
is too sensitive to operand order. This will be improved separately.
Note that the PSIGN tests are also incorrect. Consider:
%b.lobit = ashr <4 x i32> %b, <i32 31, i32 31, i32 31, i32 31>
%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = xor <4 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = and <4 x i32> %a, %0
%2 = and <4 x i32> %b.lobit, %sub
%cond = or <4 x i32> %1, %2
ret <4 x i32> %cond
if %b is zero:
%b.lobit = <4 x i32> zeroinitializer
%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = <4 x i32> %a
%2 = <4 x i32> zeroinitializer
%cond = or <4 x i32> %a, zeroinitializer
ret <4 x i32> %a
whereas we currently generate:
psignd %xmm1, %xmm0
retq
which returns 0, as %xmm1 is 0.
Instead, use a pure logic sequence, as described in:
https://graphics.stanford.edu/~seander/bithacks.html#ConditionalNegate
Fixes PR26110.
Differential Revision: http://reviews.llvm.org/D17181
llvm-svn: 261023
We're going to stop generating PSIGN, so calling a test "psign"
isn't ideal. Instead, call these tests what they really are:
variable blends using logic.
Also add a test to exhibit a case we're currently missing in
the PSIGN combine.
llvm-svn: 261022
If KMOVB not supported (require AVX512DQ) only KMOVW can be used so store size should be 2 bytes.
Differential Revision: http://reviews.llvm.org/D17138
llvm-svn: 260878
This patch attempts to represent a shuffle as a repeating shuffle (recognisable by is128BitLaneRepeatedShuffleMask) with the source input(s) in their original lanes, followed by a single permutation of the 128-bit lanes to their final destinations.
On AVX2 we can additionally attempt to match using 64-bit sub-lane permutation. AVX2 can also now match a similar 'broadcasted' repeating shuffle.
This patch has several benefits:
* Avoids prematurely matching with lowerVectorShuffleByMerging128BitLanes which can require both inputs to have their input lanes permuted before shuffling.
* Can replace PERMPS/PERMD instructions - although these are useful for cross-lane unary shuffling, they require their shuffle mask to be pre-loaded (and increase register pressure).
* Matching the repeating shuffle makes use of a lot of existing shuffle lowering.
There is an outstanding minor AVX1 regression (combine_unneeded_subvector1 in vector-shuffle-combining.ll) of a previously 128-bit shuffle + subvector splat being converted to a subvector splat + (2 instruction) 256-bit shuffle, I intend to fix this in a followup patch for review.
Differential Revision: http://reviews.llvm.org/D16537
llvm-svn: 260834
As shown in:
https://llvm.org/bugs/show_bug.cgi?id=23203
...we currently die because lowering believes that mfence is allowed without SSE2 on x86-64,
but the instruction def doesn't know that.
I don't know if allowing mfence without SSE is right, but if not, at least now it's consistently wrong. :)
Differential Revision: http://reviews.llvm.org/D17219
llvm-svn: 260828
Summary:
This patch skips DAG combine of fp_round (fp_round x) if it results in
an fp_round from f80 to f16.
fp_round from f80 to f16 always generates an expensive (and as yet,
unimplemented) libcall to __truncxfhf2. This prevents selection of
native f16 conversion instructions from f32 or f64. Moreover, the first
(value-preserving) fp_round from f80 to either f32 or f64 may become a
NOP in platforms like x86.
Reviewers: ab
Subscribers: srhines, llvm-commits
Differential Revision: http://reviews.llvm.org/D17221
llvm-svn: 260769
The code change is simple enough: instead of attaching an anonymous SDLoc to splatted
vector constants, use the scalar constant's existing SDLoc since that is what is passed
into getConstant() as a param. But this changes instruction scheduling, so I'll explain
why that happens.
The motivation for this patch starts near:
http://reviews.llvm.org/rL258833
...x86's getZeroVector() could be similarly cleaned up and I thought it would be 'NFC'.
But when I made that change locally, several x86 codegen tests wiggled.
It turns out that the lack of SDLoc consistency in getConstant() changes the way
ScheduleDAGRRList behaves. This is because the SDLoc contains 'IROrder' and some DAG
scheduler algorithms use IROrder for tie-breaking.
Differential Revision: http://reviews.llvm.org/D16972
llvm-svn: 260582
On AVX2 target we are poorly legalizing SIGN_EXTEND ops for which the input's legalized type doesn't have the same number of elements as the destination, resulting in an ANY_EXTEND followed by a SIGN_EXTEND_INREG.
This patch uses the existing SIGN_EXTEND -> SIGN_EXTEND_VECTOR_INREG combine to extend the input to the size of the result and using SIGN_EXTEND_VECTOR_INREG instead.
Differential Revision: http://reviews.llvm.org/D16994
llvm-svn: 260210
As discussed on PR26491, this patch adds support for lowering v4f32 shuffles to the MOVLHPS/MOVHLPS instructions. It also adds support for memory folding with their MOVLPS/MOVHPS load equivalents.
This first patch only really helps SSE1 targets as SSE2+ targets will widen the shuffle mask and use v2f64 equivalents (although they still combine to MOVLHPS/MOVHLPS for v2f64 splats). This will have to be addressed in a future patch, most likely when we add support for binary target shuffle combines.
Differential Revision: http://reviews.llvm.org/D16956
llvm-svn: 260168
Another opportunity to reduce masked stores: in D16691, we decided not to attempt the 'one mask element is set'
transform in InstCombine, but this should be a win for any AVX machine.
Code comments note that this transform could be extended for other targets / cases.
Differential Revision: http://reviews.llvm.org/D16828
llvm-svn: 260145
This matches GCC and MSVC's behaviour, and saves on code size.
We were already not extending i1 return values on x86_64 after r127766. This
takes that patch further by applying it to x86 target as well, and also for i8
and i16.
The ABI docs have been unclear about the required behaviour here. The new i386
psABI [1] clearly states (Table 2.4, page 14) that i1, i8, and i16 return
vales do not need to be extended beyond 8 bits. The x86_64 ABI doc is being
updated to say the same [2].
Differential Revision: http://reviews.llvm.org/D16907
[1]. https://01.org/sites/default/files/file_attach/intel386-psabi-1.0.pdf
[2]. https://groups.google.com/d/msg/x86-64-abi/E8O33onbnGQ/_RFWw_ixDQAJ
llvm-svn: 260133
The combineX86ShufflesRecursively only supports unary shuffles, but was missing the opportunity to combine binary shuffles with a zero / undef second input.
This patch resolves target shuffle inputs, converting the shuffle mask elements to SM_SentinelUndef/SM_SentinelZero where possible. It then resolves the updated mask to check if we have created a faux unary shuffle.
Additionally, we now attempt to recursively call combineX86ShufflesRecursively for all input operands (we used to just recurse for unary integer shuffles and unary unpacks) - it safely returns early if its not a target shuffle.
Differential Revision: http://reviews.llvm.org/D16683
llvm-svn: 260063