Summary:
In mergeSPUpdates, debug values need to be ignored when getting the
previous element, otherwise debug data could have an impact on codegen.
In eliminateCallFramePseudoInstr, debug values after the erased element
could have an impact on codegen and should be skipped.
Closes PR31319 (https://llvm.org/bugs/show_bug.cgi?id=31319)
Reviewers: mkuper, MatzeB, aprantl
Subscribers: gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D27688
llvm-svn: 290423
Summary:
This change rewrites a core component in the ImplicitNullChecks pass for
greater simplicity since the original design was over-complicated for no
good reason. Please review this as essentially a new pass. The change
is almost NFC and I've added a test case for a scenario that this new
code handles that wasn't handled earlier.
The implicit null check pass, at its core, is a code hoisting transform.
It differs from "normal" code transforms in that it speculates
potentially faulting instructions (by design), but a lot of the usual
hazard detection logic (register read-after-write etc.) still applies.
We previously detected hazards by keeping track of registers defined and
used by machine instructions over an instruction range, but that was
unwieldy and did not actually confer any performance benefits. The
intent was to have linear time complexity over the number of machine
instructions considered, but it ended up being N^2 is practice.
This new version is more obviously O(N^2) (with N capped to 8 by
default) in hazard detection. It does not attempt to be clever in
tracking register uses or defs (the previous cleverness here was a
source of bugs).
Once this is checked in, I'll extract out the `IsSuitableMemoryOp` and
`CanHoistLoadInst` lambda into member functions (they're too complicated
to be inline lambdas) and do some other related NFC cleanups.
Reviewers: reames, anna, atrick
Subscribers: mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D27592
llvm-svn: 290394
This is a succeeding patch of https://reviews.llvm.org/D22840 to address the
issue when a value to be merged into an int64 pair is in a different BB. Redoing
the store splitting in CodeGenPrepare so we can match the pattern across multiple
BBs and move some instructions into the same BB. We still keep the code in dag
combine so that we can catch cases that show up after DAG combining runs.
Differential Revision: https://reviews.llvm.org/D25914
llvm-svn: 290365
This patch renumbers the metadata nodes in debug info testcases after
https://reviews.llvm.org/D26769. This is a separate patch because it
causes so much churn. This was implemented with a python script that
pipes the testcases through llvm-as - | llvm-dis - and then goes
through the original and new output side-by side to insert all
comments at a close-enough location.
Differential Revision: https://reviews.llvm.org/D27765
llvm-svn: 290292
I added API for creation a target specific memory node in DAG. Today, all memory nodes are common for all targets and their constructors are located in SelectionDAG.cpp.
There are some cases in X86 where we need to create a special node - truncation-with-saturation store, float-to-half-store.
In the current patch I added truncation-with-saturation nodes and I'm using them for intrinsics. In the future I plan to implement DAG lowering for truncation-with-saturation pattern.
Differential Revision: https://reviews.llvm.org/D27899
llvm-svn: 290250
The vectorcall calling convention specifies that arguments to functions are to be passed in registers, when possible.
vectorcall uses more registers for arguments than fastcall or the default x64 calling convention use.
The vectorcall calling convention is only supported in native code on x86 and x64 processors that include Streaming SIMD Extensions 2 (SSE2) and above.
The current implementation does not handle Homogeneous Vector Aggregates (HVAs) correctly and this review attempts to fix it.
This aubmit also includes additional lit tests to cover better HVAs corner cases.
Differential Revision: https://reviews.llvm.org/D27392
llvm-svn: 290240
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.
Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:
(1) The DIGlobalVariable should describe the source level variable,
not how to get to its location.
(2) It makes it unsafe/hard to update the expressions when we call
replaceExpression on the DIGLobalVariable.
(3) It makes it impossible to represent a global variable that is in
more than one location (e.g., a variable with multiple
DW_OP_LLVM_fragment-s). We also moved away from attaching the
DIExpression to DILocalVariable for the same reasons.
This reapplies r289902 with additional testcase upgrades and a change
to the Bitcode record for DIGlobalVariable, that makes upgrading the
old format unambiguous also for variables without DIExpressions.
<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769
llvm-svn: 290153
The original version of the code in XRayInstrumentation.cpp assumed that
functions may not have empty machine basic blocks (or that the first one
couldn't be). This change addresses that by special-casing that specific
situation.
We provide two .mir test-cases to make sure we're handling this
appropriately.
Fixes llvm.org/PR31424.
Reviewers: chandlerc
Subscribers: varno, llvm-commits
Differential Revision: https://reviews.llvm.org/D27913
llvm-svn: 290091
Not sure whether it causes and ASAN false positive or whether it
actually leads to incorrect code or whether it even exposes bad code.
Hans, I'll get you instructions to reproduce this.
llvm-svn: 290066
These nodes are only emitted for lowering FABS/FNEG/FNABS/FCOPYSIGN. Ideally we just wouldn't create these nodes if SSE2 or higher is available, but it was simple to just convert them in DAG combine.
For SSE2, AVX, and AVX512 with DQI this is no functional change as the execution domain fixing pass ensures the right domain is selected regardless of the ISD opcode.
For AVX-512 without DQI we end up using integer instructions since the floating point versions aren't available. But we were already doing that for any logical operations in code that didn't come from FABS/FNEG/FNABS/FCOPYSIGN so this seems no worse. And we get the benefit of being able to fold broadcasts now.
llvm-svn: 290060
This is recommit of r287553 after fixing the invalid loop info after eliminating an empty block and unit test failures in AVR and WebAssembly :
Summary: Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, joerg, davidxl
Subscribers: joerg, qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 289988
This reverts commit 289920 (again).
I forgot to implement a Bitcode upgrade for the case where a DIGlobalVariable
has not DIExpression. Unfortunately it is not possible to safely upgrade
these variables without adding a flag to the bitcode record indicating which
version they are.
My plan of record is to roll the planned follow-up patch that adds a
unit: field to DIGlobalVariable into this patch before recomitting.
This way we only need one Bitcode upgrade for both changes (with a
version flag in the bitcode record to safely distinguish the record
formats).
Sorry for the churn!
llvm-svn: 289982
atomic_load_add returns the value before addition, but sets EFLAGS based on the
result of the addition. That means it's setting the flags based on effectively
subtracting C from the value at x, which is also what the outer cmp does.
This targets a pattern that occurs frequently with reference counting pointers:
void decrement(long volatile *ptr) {
if (_InterlockedDecrement(ptr) == 0)
release();
}
Clang would previously compile it (for 32-bit at -Os) as:
00000000 <?decrement@@YAXPCJ@Z>:
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: 31 c9 xor %ecx,%ecx
6: 49 dec %ecx
7: f0 0f c1 08 lock xadd %ecx,(%eax)
b: 83 f9 01 cmp $0x1,%ecx
e: 0f 84 00 00 00 00 je 14 <?decrement@@YAXPCJ@Z+0x14>
14: c3 ret
and with this patch it becomes:
00000000 <?decrement@@YAXPCJ@Z>:
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: f0 ff 08 lock decl (%eax)
7: 0f 84 00 00 00 00 je d <?decrement@@YAXPCJ@Z+0xd>
d: c3 ret
(Equivalent variants with _InterlockedExchangeAdd, std::atomic<>'s fetch_add
or pre-decrement operator generate the same code.)
Differential Revision: https://reviews.llvm.org/D27781
llvm-svn: 289955
This is recommit of r287553 after fixing the invalid loop info after eliminating an empty block:
Summary: Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, joerg, davidxl
Subscribers: joerg, qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 289951
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.
Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:
(1) The DIGlobalVariable should describe the source level variable,
not how to get to its location.
(2) It makes it unsafe/hard to update the expressions when we call
replaceExpression on the DIGLobalVariable.
(3) It makes it impossible to represent a global variable that is in
more than one location (e.g., a variable with multiple
DW_OP_LLVM_fragment-s). We also moved away from attaching the
DIExpression to DILocalVariable for the same reasons.
This reapplies r289902 with additional testcase upgrades.
<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769
llvm-svn: 289920
idiom.
r289538: Match load by bytes idiom and fold it into a single load
r289540: Fix a buildbot failure introduced by r289538
r289545: Use more detailed assertion messages in the code ...
r289646: Add a couple of assertions to the load combine code ...
This DAG combine has a bad crash in it that is quite hard to trigger
sadly -- it relies on sneaking code with UB through the SDAG build and
into this particular combine. I've responded to the original commit with
a test case that reproduces it.
However, the code also has other problems that will require substantial
changes to address and so I'm going ahead and reverting it for now. This
should unblock us and perhaps others that are hitting the crash in the
wild and will let a fresh patch with updated approach come in cleanly
afterward.
Sorry for any trouble or disruption!
llvm-svn: 289916
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.
Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:
(1) The DIGlobalVariable should describe the source level variable,
not how to get to its location.
(2) It makes it unsafe/hard to update the expressions when we call
replaceExpression on the DIGLobalVariable.
(3) It makes it impossible to represent a global variable that is in
more than one location (e.g., a variable with multiple
DW_OP_LLVM_fragment-s). We also moved away from attaching the
DIExpression to DILocalVariable for the same reasons.
<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769
llvm-svn: 289902
The IRTranslator uses an additional block before the LLVM-IR entry block
to perform all the ABI lowering and the constant hoisting. Thus, this
block is the actual entry block and it falls through the LLVM-IR entry
block. However, with such representation, we end up with two basic
blocks that are not maximal.
Therefore, this patch adds a bit of canonicalization by merging both the
LLVM-IR entry block and the ABI lowering/constants hoisting into one
block, making the resulting block more likely to be maximal (indeed the
LLVM-IR entry block might not have been maximal).
llvm-svn: 289891
This is a tiny patch with a big pile of test changes.
This partially fixes PR27885:
https://llvm.org/bugs/show_bug.cgi?id=27885
My motivating case looks like this:
- vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,2]
- vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
- vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
+ vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
And this happens several times in the diffs. For chips with domain-crossing penalties,
the instruction count and size reduction should usually overcome any potential
domain-crossing penalty due to using an FP op in a sequence of int ops. For chips such
as recent Intel big cores and Atom, there is no domain-crossing penalty for shufps, so
using shufps is a pure win.
So the test case diffs all appear to be improvements except one test in
vector-shuffle-combining.ll where we miss an opportunity to use a shift to generate
zero elements and one test in combine-sra.ll where multiple uses prevent the expected
shuffle combining.
Differential Revision: https://reviews.llvm.org/D27692
llvm-svn: 289837
Add the missing domain equivalences for movss, movsd, movd and movq zero extending loading instructions.
Differential Revision: https://reviews.llvm.org/D27684
llvm-svn: 289825
Summary:
This fixes an issue with MachineBlockPlacement due to a badly timed call
to `analyzeBranch` with `AllowModify` set to true. The timeline is as
follows:
1. `MachineBlockPlacement::maybeTailDuplicateBlock` calls
`TailDup.shouldTailDuplicate` on its argument, which in turn calls
`analyzeBranch` with `AllowModify` set to true.
2. This `analyzeBranch` call edits the terminator sequence of the block
based on the physical layout of the machine function, turning an
unanalyzable non-fallthrough block to a unanalyzable fallthrough
block. Normally MBP bails out of rearranging such blocks, but this
block was unanalyzable non-fallthrough (and thus rearrangeable) the
first time MBP looked at it, and so it goes ahead and decides where
it should be placed in the function.
3. When placing this block MBP fails to analyze and thus update the
block in keeping with the new physical layout.
Concretely, before (1) we have something like:
```
LBL0:
< unknown terminator op that may branch to LBL1 >
jmp LBL1
LBL1:
... A
LBL2:
... B
```
In (2), analyze branch simplifies this to
```
LBL0:
< unknown terminator op that may branch to LBL2 >
;; jmp LBL1 <- redundant jump removed
LBL1:
... A
LBL2:
... B
```
In (3), MachineBlockPlacement goes ahead with its plan of putting LBL2
after the first block since that is profitable.
```
LBL0:
< unknown terminator op that may branch to LBL2 >
;; jmp LBL1 <- redundant jump
LBL2:
... B
LBL1:
... A
```
and the program now has incorrect behavior (we no longer fall-through
from `LBL0` to `LBL1`) because MBP can no longer edit LBL0.
There are several possible solutions, but I went with removing the teeth
off of the `analyzeBranch` calls in TailDuplicator. That makes thinking
about the result of these calls easier, and breaks nothing in the lit
test suite.
I've also added some bookkeeping to the MachineBlockPlacement pass and
used that to write an assert that would have caught this.
Reviewers: chandlerc, gberry, MatzeB, iteratee
Subscribers: mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D27783
llvm-svn: 289764
Retrying after fixing after removing load-store factoring through
token factors in favor of improved token factor operand pruning
Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.
Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the the chain aggregation in the merged stores across
code paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seemed sufficient to not cause regressions in
tests.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations
Noteworthy tests:
CodeGen/AArch64/argument-blocks.ll -
It's not entirely clear what the test_varargs_stackalign test is
supposed to be asserting, but the new code looks right.
CodeGen/AArch64/arm64-memset-inline.lli -
CodeGen/AArch64/arm64-stur.ll -
CodeGen/ARM/memset-inline.ll -
The backend now generates *worse* code due to store merging
succeeding, as we do do a 16-byte constant-zero store efficiently.
CodeGen/AArch64/merge-store.ll -
Improved, but there still seems to be an extraneous vector insert
from an element to itself?
CodeGen/PowerPC/ppc64-align-long-double.ll -
Worse code emitted in this case, due to the improved store->load
forwarding.
CodeGen/X86/dag-merge-fast-accesses.ll -
CodeGen/X86/MergeConsecutiveStores.ll -
CodeGen/X86/stores-merging.ll -
CodeGen/Mips/load-store-left-right.ll -
Restored correct merging of non-aligned stores
CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
Improved. Correctly merges buffer_store_dword calls
CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
Improved. Sidesteps loading a stored value and
merges two stores
CodeGen/X86/pr18023.ll -
This test has been removed, as it was asserting incorrect
behavior. Non-volatile stores *CAN* be moved past volatile loads,
and now are.
CodeGen/X86/vector-idiv.ll -
CodeGen/X86/vector-lzcnt-128.ll -
It's basically impossible to tell what these tests are actually
testing. But, looks like the code got better due to the memory
operations being recognized as non-aliasing.
CodeGen/X86/win32-eh.ll -
Both loads of the securitycookie are now merged.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel
Differential Revision: https://reviews.llvm.org/D14834
llvm-svn: 289659
Generalize sdiv/udiv/srem/urem combines using APInt::isPowerOf2, which only works for const/splat-const values, to call SelectionDAG::isKnownToBeAPowerOfTwo instead which recognises many more cases.
Added a DAGCombiner::BuildLogBase2 helper since PowerOf2 combines often involve taking the log2 of such a value.
Differential Revision: https://reviews.llvm.org/D27714
llvm-svn: 289654
adding new optimization opportunity by adding new X86ISelLowering pattern. The test case was shown in https://llvm.org/bugs/show_bug.cgi?id=30945.
Test explanation:
Select gets three arguments mask, op and op2. In this case, the Mask is a result of ICMP. The ICMP instruction compares (with equal operand) the zero initializer vector and the result of the first ICMP.
In general, The result of "cmp eq, op1, zero initializers" is "not(op1)" where op1 is a mask. By rearranging of the two arguments inside the Select instruction, we can get the same result. Without the necessary of the middle phase ("cmp eq, op1, zero initializers").
Missed optimization opportunity:
vpcmpled %zmm0, %zmm1, %k0
knotw %k0, %k1
can be combine to
vpcmpgtd %zmm0, %zmm2, %k1
Reviewers:
1. delena
2. igorb
Commited after check all
Differential Revision: https://reviews.llvm.org/D27160
llvm-svn: 289653
Follow-up to r289256, address a FIXME to avoid resetting the column
number. This reduced .debug_line by 2.6% in a RelWithDebInfo
self-build of clang.
llvm-svn: 289620
Match a pattern where a wide type scalar value is loaded by several narrow loads and combined by shifts and ors. Fold it into a single load or a load and a bswap if the targets supports it.
Assuming little endian target:
i8 *a = ...
i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
=>
i32 val = *((i32)a)
i8 *a = ...
i32 val = (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | a[3]
=>
i32 val = BSWAP(*((i32)a))
This optimization was discussed on llvm-dev some time ago in "Load combine pass" thread. We came to the conclusion that we want to do this transformation late in the pipeline because in presence of atomic loads load widening is irreversible transformation and it might hinder other optimizations.
Eventually we'd like to support folding patterns like this where the offset has a variable and a constant part:
i32 val = a[i] | (a[i + 1] << 8) | (a[i + 2] << 16) | (a[i + 3] << 24)
Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.
The general scheme is to match OR expressions by recursively calculating the origin of individual bits which constitute the resulting OR value. If all the OR bits come from memory verify that they are adjacent and match with little or big endian encoding of a wider value. If so and the load of the wider type (and bswap if needed) is allowed by the target generate a load and a bswap if needed.
Reviewed By: hfinkel, RKSimon, filcab
Differential Revision: https://reviews.llvm.org/D26149
llvm-svn: 289538
The general idea here is to get enough of the existing restrictions out of the way that the already existing folding logic in foldMemoryOperand can kick in for STATEPOINTs and fold references to immutable stack slots. The key changes are:
Support for folding multiple operands at once which reference the same load
Support for folding multiple loads into a single instruction
Walk all the operands of the instruction for varidic instructions (this is a bug fix!)
Once this lands, I'll post another patch which refactors the TII interface here. There's nothing actually x86 specific about the x86 code used here.
Differential Revision: https://reviews.llvm.org/D24103
llvm-svn: 289510
The stack slot reuse code had a really amusing bug. We ended up only reusing a stack slot exact once (initial use + reuse) within a basic block. If we had a third statepoint to process, we ended up allocating a new set of stack slots. If we crossed a basic block boundary, the set got cleared. As a result, code which is invoke heavy doesn't see the problem, but multiple calls within a basic block does. Net result: as we optimize invokes into calls, lowering gets worse.
The root error here is that the bitmap uses by the custom allocator wasn't kept in sync. The result was that we ended up resizing the bitmap on the next statepoint (to handle the cross block case), reset the bit once, but then never reset it again.
Differential Revision: https://reviews.llvm.org/D25243
llvm-svn: 289509
DWARF specifies that "line 0" really means "no appropriate source
location" in the line table. By default, use this for branch targets
and some other cases that have no specified source location, to
prevent inheriting unfortunate line numbers from physically preceding
instructions (which might be from completely unrelated source).
Updated patch allows enabling or suppressing this behavior for all
unspecified source locations.
Differential Revision: http://reviews.llvm.org/D24180
llvm-svn: 289468
PMULDQ returns the 64-bit result of the signed multiplication of the lower 32-bits of vXi64 vector inputs, we can lower with this if the sign bits stretch that far.
Differential Revision: https://reviews.llvm.org/D27657
llvm-svn: 289426
Summary:
These intrinsic instructions are all selected from intrinsics that have well defined behavior for where the upper bits come from. It's not the same place as the lower bits.
As you can see we were suppressing load folding for these instructions in some cases. In none of the cases was the separate load helping avoid a partial dependency on the destination register. So we should just go ahead and allow the load to be folded.
Only foldMemoryOperand was suppressing folding for these. They all have patterns for folding sse_load_f32/f64 that aren't gated with OptForSize, but sse_load_f32/f64 doesn't allow 128-bit vector loads. It only allows scalar_to_vector and vzmovl of scalar loads to match. There's no reason we can't allow a 128-bit vector load to be narrowed so I would like to fix sse_load_f32/f64 to allow that. And if I do that it changes some of these same test cases to fold the load too.
Reviewers: spatel, zvi, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27611
llvm-svn: 289419
When the load node which the broadcast instruction broadcasts has multiple uses, it cannot be folded.
A fallback pattern is added to catch these cases and provide another solution.
Differential Revision: https://reviews.llvm.org/D27661
llvm-svn: 289404
Regcall calling convention passes mask types arguments in x86 GPR registers.
The review includes the changes required in order to support v32i1, v16i1 and v8i1.
Differential Revision: https://reviews.llvm.org/D27148
llvm-svn: 289383
Reapplied with fix for PR31323 - X86 SSE2 vXi16 multiplies for illegal types were creating CONCAT_VECTORS nodes with vector inputs that might not total the number of elements in the result type.
llvm-svn: 289232
Retrying after fixing overly aggressive load-store forwarding optimization.
Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.
Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the the chain aggregation in the merged stores across
code paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seemed sufficient to not cause regressions in
tests.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations
Noteworthy tests:
CodeGen/AArch64/argument-blocks.ll -
It's not entirely clear what the test_varargs_stackalign test is
supposed to be asserting, but the new code looks right.
CodeGen/AArch64/arm64-memset-inline.lli -
CodeGen/AArch64/arm64-stur.ll -
CodeGen/ARM/memset-inline.ll -
The backend now generates *worse* code due to store merging
succeeding, as we do do a 16-byte constant-zero store efficiently.
CodeGen/AArch64/merge-store.ll -
Improved, but there still seems to be an extraneous vector insert
from an element to itself?
CodeGen/PowerPC/ppc64-align-long-double.ll -
Worse code emitted in this case, due to the improved store->load
forwarding.
CodeGen/X86/dag-merge-fast-accesses.ll -
CodeGen/X86/MergeConsecutiveStores.ll -
CodeGen/X86/stores-merging.ll -
CodeGen/Mips/load-store-left-right.ll -
Restored correct merging of non-aligned stores
CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
Improved. Correctly merges buffer_store_dword calls
CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
Improved. Sidesteps loading a stored value and
merges two stores
CodeGen/X86/pr18023.ll -
This test has been removed, as it was asserting incorrect
behavior. Non-volatile stores *CAN* be moved past volatile loads,
and now are.
CodeGen/X86/vector-idiv.ll -
CodeGen/X86/vector-lzcnt-128.ll -
It's basically impossible to tell what these tests are actually
testing. But, looks like the code got better due to the memory
operations being recognized as non-aliasing.
CodeGen/X86/win32-eh.ll -
Both loads of the securitycookie are now merged.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel
Differential Revision: https://reviews.llvm.org/D14834
llvm-svn: 289221
Adds support for bitcasting a little endian 'small element' vector to 'large element' scalar/vector (e.g. v16i8 to v4i32 or v2i32 to i64), which is required for PR30845. We extract the knownbits for each 'small element' part and concatenate the results together.
We can add support for big endian and 'large element' scalar/vector to 'small element' vector bitcasting once we have test cases for them.
Differential Revision: https://reviews.llvm.org/D27129
llvm-svn: 289200
This reverts commit r288916 as it is currently causing a crasher in
Halide. Reproducer on llvm.org/PR31323. While it might be that halide is
generating invalid IR, llc shouldn't crash.
llvm-svn: 289194
Summary:
Scalar intrinsics have specific semantics about the which input's upper bits are passed through to the output. The same input is also supposed to be the input we use for the lower element when the mask bit is 0 in a masked operation. We aren't currently keeping these semantics with instruction selection.
This patch corrects this by introducing new scalar FMA ISD nodes that indicate whether operand 1(one of the multiply inputs) or operand 3(the additon/subtraction input) should pass thru its upper bits.
We use this information to select 213/132 form for the operand 1 version and the 231 form for the operand 3 version.
We also use this information to suppress combining FNEG operations on the passthru input since semantically the passthru bits aren't negated. This is stronger than the earlier check added for a user being SELECTS so we can remove that.
This fixes PR30913.
Reviewers: delena, zvi, v_klochkov
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27144
llvm-svn: 289190
We were falsely claiming that we had an LSDA for the relevant EH
personality before this change, which could lead to the EH machinery
interpreting random adjacent data as an LSDA.
Fixes PR31317
This change is safe because cleanups can't contain exception handlers
today. We do these things to maintain that invariant:
- C++ destructors are naturally out-of-line
- __finally blocks are outlined in clang
- LLVM's inliner will not inline EH constructs into cleanups
llvm-svn: 289101
Summary:
Attaching !absolute_symbol to a global variable does two things:
1) Marks it as an absolute symbol reference.
2) Specifies the value range of that symbol's address.
Teach the X86 backend to allow absolute symbols to appear in place of
immediates by extending the relocImm and mov64imm32 matchers. Start using
relocImm in more places where it is legal.
As previously proposed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2016-October/105800.html
Differential Revision: https://reviews.llvm.org/D25878
llvm-svn: 289087
This re-adds checks for the patterns that were disabled with r288506.
Reviewers: spatel, delena, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27346
llvm-svn: 289049
MachineIRBuilder had weird before/after and beginning/end flags for the insert
point. Unfortunately the non-default means that instructions will be inserted
in reverse order which is almost never what anyone wants.
Really, I think we just want (like IRBuilder has) the ability to insert at any
C++ iterator-style point (i.e. before any instruction or before MBB.end()). So
this fixes MIRBuilders to behave like IRBuilders in this respect.
llvm-svn: 288980
If we don't skip over DEBUG_VALUEs, we get differences between -g and non-g
code.
This fixes PR31242.
Differential Revision: https://reviews.llvm.org/D27485
llvm-svn: 288965
The second operand of an "ri" instruction may be an immediate, but it may
also be a globalvariable, so we should make any assumptions.
This fixes PR31271.
Differential Revision: https://reviews.llvm.org/D27481
llvm-svn: 288964
We are being inconsistent with these instructions (and all their variants.....) with a random mix of them using the default float domain.
Differential Revision: https://reviews.llvm.org/D27419
llvm-svn: 288902
The non-constant pool version of DecodeVPERMIL2PMask was not offsetting correctly for the second input. I've updated the code to match the implementation in the constant-pool version.
Annoyingly this bug was hidden for so long as it's tricky to combine to useful variable shuffle masks that don't become constant-pool entries.
llvm-svn: 288898
Summary:
Prefer expansions such as: pmullw,pmulhw,unpacklwd,unpackhwd over pmulld.
On Silvermont [source: Optimization Reference Manual]:
PMULLD has a throughput of 1/11 [instruction/cycles].
PMULHUW/PMULHW/PMULLW have a throughput of 1/2 [instruction/cycles].
Fixes pr31202.
Analysis of this issue was done by Fahana Aleen.
Reviewers: wmi, delena, mkuper
Subscribers: RKSimon, llvm-commits
Differential Revision: https://reviews.llvm.org/D27203
llvm-svn: 288844
Handle the case where a sign extension has ended up being split into separate stages (typically to get around vector legal ops) and a zext + sext_in_reg gets inserted.
Differential Revision: https://reviews.llvm.org/D27461
llvm-svn: 288842
Check if a build_vector node includes a repeated constant pattern and replace it with a broadcast of that pattern.
For example:
"build_vector <0, 1, 2, 3, 0, 1, 2, 3>" would be replaced by "broadcast <0, 1, 2, 3>"
Differential Revision: https://reviews.llvm.org/D26802
llvm-svn: 288804
Summary: This patch makes sure FirstCSPop and MBBI never point to DBG_VALUE instructions, which affected the code generated.
Reviewers: mkuper, aprantl, MatzeB
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27343
llvm-svn: 288794
This pattern turned a vector sqrt/rcp/rsqrt operation of sse_load_f32/f64 into the the scalar instruction for the operation and put undef into the upper bits. For correctness, the resulting code should still perform the sqrt/rcp/rsqrt on the upper bits after the load is extended since that's what the operation asked for. Particularly in the case where the upper bits are 0, in that case we need calculate the sqrt/rcp/rsqrt of the zeroes and keep the result in the upper-bits. This implies we should be using the packed instruction still.
The only test case for this pattern is one I just added so there was no coverage of this.
llvm-svn: 288784
This occurs due to a pattern that uses sse_load_f32/f64 with vector sqrt/rcp/rsqrt operations and turns them into scalar instructions. Perhaps for the case were the upper bits come from undef this is ok. I believe a (vzmovl load64) would do the same thing but those seems to become vzload instead and selectScalarSSELoad doesn't handle that today. In that case we should be performing the vector operation on the zeros in the upper bits which is not equivalent to using a scalar instruction.
I will remove this pattern in a follow up patch. There appears to be no other test content for it.
llvm-svn: 288783
The intrinsics are supposed to pass the upper bits straight through to their output register. This means we need to make sure we still perform the 128-bit load to get those upper bits to pass to give to the instruction since the memory form of the instruction only reads 32 or 64 bits.
llvm-svn: 288781
The sqrtsd instruction only loads 64-bits and writes bits 63:0 with the sqrt result. Bits 127:64 are preserved in the destination register. The semantics of the intrinsic indicate bits 127:64 should come from the intrinsic argument which in this case is a 128-bit load. So the generated code should have a 128-bit load and use a register form of sqrtsd.
llvm-svn: 288780
The intrinsic takes one argument, the lower bits are affected by the operation and the upper bits should be passed through. The instruction itself takes two operands, the high bits of the first operand are passed through and the low bits of the second operand are modified by the operation. To match this to the intrinsic we should pass the single intrinsic input to both operands.
I had to remove the stack folding test for these instructions since they depended on the incorrect behavior. The same register is now used for both inputs so the load can't be folded.
llvm-svn: 288779
Summary:
This patch removes the scalar logical operation alias instructions. We can just use reg class copies and use the normal packed instructions instead. This removes the need for putting these instructions in the execution domain fixing tables as was done recently.
I removed the loadf64_128 and loadf32_128 patterns as DAG combine creates a narrower load for (extractelt (loadv4f32)) before we ever get to isel.
I plan to add similar patterns for AVX512DQ in a future commit to allow use of the larger register class when available.
Reviewers: spatel, delena, zvi, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27401
llvm-svn: 288771
This changes the scalar non-intrinsic non-avx roundss/sd instruction
definitions not to read their destination register - allowing partial dependency
breaking.
This fixes PR31143.
Differential Revision: https://reviews.llvm.org/D27323
llvm-svn: 288703
so we can stop using DW_OP_bit_piece with the wrong semantics.
The entire back story can be found here:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161114/405934.html
The gist is that in LLVM we've been misinterpreting DW_OP_bit_piece's
offset field to mean the offset into the source variable rather than
the offset into the location at the top the DWARF expression stack. In
order to be able to fix this in a subsequent patch, this patch
introduces a dedicated DW_OP_LLVM_fragment operation with the
semantics that we used to apply to DW_OP_bit_piece, which is what we
actually need while inside of LLVM. This patch is complete with a
bitcode upgrade for expressions using the old format. It does not yet
fix the DWARF backend to use DW_OP_bit_piece correctly.
Implementation note: We discussed several options for implementing
this, including reserving a dedicated field in DIExpression for the
fragment size and offset, but using an custom operator at the end of
the expression works just fine and is more efficient because we then
only pay for it when we need it.
Differential Revision: https://reviews.llvm.org/D27361
rdar://problem/29335809
llvm-svn: 288683
We treat bitwise 'not' as a special operation and try not to reduce its all-ones mask.
Presumably, this is because a 'not' may be cheaper than a generic 'xor' or it may get
folded into another logic op if the target has those. However, if we can remove a logic
instruction by changing the xor's constant mask value, that should always be a win.
Note that the IR version of SimplifyDemandedBits() does not treat 'not' as a special-case
currently (although that's marked with a FIXME). So if you run this IR through -instcombine,
you should get the same end result. I'm hoping to add a different backend transform that
will expose this problem though, so I need to solve this first.
Differential Revision: https://reviews.llvm.org/D27356
llvm-svn: 288676
Currently the fast isel code emits an avx1 instruction sequence even with avx512. This is different than normal isel. A follow up commit will fix this.
llvm-svn: 288635
Summary:
When X = 0 and Y = inf, the original code produces inf, but the transformed
code produces nan. So this transform (and its relatives) should only be
used when the no-infs-fp-math flag is explicitly enabled.
Also disable the transform using fmad (intermediate rounding) when unsafe-math
is not enabled, since it can reduce the precision of the result; consider this
example with binary floating point numbers with two bits of mantissa:
x = 1.01
y = 111
x * (y + 1) = 1.01 * 1000 = 1010 (this is the exact result; no rounding occurs at any step)
x * y + x = 1000.11 + 1.01 =r 1000 + 1.01 = 1001.01 =r 1000 (with rounding towards zero)
The example relies on rounding towards zero at least in the second step.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98578
Reviewers: RKSimon, tstellarAMD, spatel, arsenm
Subscribers: wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D26602
llvm-svn: 288506
This change fixes a regression in r279537 and
makes getRawSubclassData behave like r279536.
Without this change, the fp128-g.ll test case will have an
infinite loop involving SoftenFloatRes_LOAD.
Differential Revision: http://reviews.llvm.org/D26942
llvm-svn: 288420
Choosing a "cfi" name makes the intend a bit clearer in an assembly dump
and more importantly the assembly dumps are slightly more stable as the
numbers don't move around anymore when unrelated code calls
createTempSymbol() more or less often.
As they are temp labels the name doesn't influence the generated object
code.
Differential Revision: https://reviews.llvm.org/D27244
llvm-svn: 288290
The LLDB tests are now ready for this patch.
DWARF specifies that "line 0" really means "no appropriate source
location" in the line table. Use this for branch targets and some
other cases that have no specified source location, to prevent
inheriting unfortunate line numbers from physically preceding
instructions (which might be from completely unrelated source).
Differential Revision: http://reviews.llvm.org/D24180
llvm-svn: 288283
Initial support for target shuffle constant folding in cases where all shuffle inputs are constant. We may be able to relax this and merge shuffles with only some constant inputs in the future.
I've added the helper function getTargetConstantBitsFromNode (based off a similar function in X86ShuffleDecodeConstantPool.cpp) that could be reused for other cases requiring constant vector extraction.
Differential Revision: https://reviews.llvm.org/D27220
llvm-svn: 288250
DWARF specifies that "line 0" really means "no appropriate source
location" in the line table. Use this for branch targets and some
other cases that have no specified source location, to prevent
inheriting unfortunate line numbers from physically preceding
instructions (which might be from completely unrelated source).
Differential Revision: http://reviews.llvm.org/D24180
llvm-svn: 288212
It looks like this logic was duplicated long ago and the GCC side of
things has grown additional functionality. We need ${:uid} at least to
generate unique MS inline asm labels (PR23715), so expose these.
llvm-svn: 288092
Bit-shifts by a whole number of bytes can be represented as a shuffle mask suitable for combining.
Added a 'getFauxShuffleMask' function to allow us to create shuffle masks from other suitable operations.
llvm-svn: 288040
Summary: When selectScalarSSELoad is looking for a scalar_to_vector of a scalar load, it makes sure the load is only used by the scalar_to_vector. But it doesn't make sure the scalar_to_vector is only used once. This can cause the same load to be folded multiple times. This can be bad for performance. This also causes the chain output to be duplicated, but not connected to anything so chain dependencies will not be satisfied.
Reviewers: RKSimon, zvi, delena, spatel
Subscribers: andreadb, llvm-commits
Differential Revision: https://reviews.llvm.org/D26790
llvm-svn: 287983
Summary:
Shuffle lowering may have widened the element size of a i32 shuffle to i64 before selecting X86ISD::SHUF128. If this shuffle was used by a vselect this can prevent us from selecting masked operations.
This patch detects this and changes the element size to match the vselect.
I don't handle changing integer to floating point or vice versa as its not clear if its better to push such a bitcast to the inputs of the shuffle or to the user of the vselect. So I'm ignoring that case for now.
Reviewers: delena, zvi, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27087
llvm-svn: 287939
Vectorize UINT_TO_FP v2i32 -> v2f64 instead of scalarization (albeit still on the SIMD unit).
The codegen matches that generated by legalization (and is in fact used by AVX for UINT_TO_FP v4i32 -> v4f64), but has to be done in the x86 backend to account for legalization via 4i32.
Differential Revision: https://reviews.llvm.org/D26938
llvm-svn: 287886
The bug arises during register allocation on i686 for
CMPXCHG8B instruction when base pointer is needed. CMPXCHG8B
needs 4 implicit registers (EAX, EBX, ECX, EDX) and a memory address,
plus ESI is reserved as the base pointer. With such constraints the only
way register allocator would do its job successfully is when the addressing
mode of the instruction requires only one register. If that is not the case
- we are emitting additional LEA instruction to compute the address.
It fixes PR28755.
Patch by Alexander Ivchenko <alexander.ivchenko@intel.com>
Differential Revision: https://reviews.llvm.org/D25088
llvm-svn: 287875
We did not support subregs in InlineSpiller:foldMemoryOperand() because targets
may not deal with them correctly.
This adds a target hook to let the spiller know that a target can handle
subregs, and actually enables it for x86 for the case of stack slot reloads.
This fixes PR30832.
Differential Revision: https://reviews.llvm.org/D26521
llvm-svn: 287792
We have the following DAGCombiner transformations:
(mul (shl X, c1), c2) -> (mul X, c2 << c1)
(mul (shl X, C), Y) -> (shl (mul X, Y), C)
(shl (mul x, c1), c2) -> (mul x, c1 << c2)
Usually the constant shift is optimised by SelectionDAG::getNode when it is
constructed, by SelectionDAG::FoldConstantArithmetic, but when we're dealing
with vectors and one of those vector constants contains an undef element
FoldConstantArithmetic does not fold and we enter an infinite loop.
Fix this by making FoldConstantArithmetic use getNode to decide how to fold each
vector element, the same as FoldConstantVectorArithmetic does, and rather than
adding the constant shift to the work list instead only apply the transformation
if it's already been folded into a constant, as if it's not we're going to loop
endlessly. Additionally add missing NoOpaques to one of those transformations,
which I noticed when writing the tests for this.
Differential Revision: https://reviews.llvm.org/D26605
llvm-svn: 287766
Implemented widening (v2f32) and splitting (v16f64).
On splitting, I use "popcnt" to calculate memory increment.
More type legalization work will come in the next patches.
llvm-svn: 287761
This occurs during UINT_TO_FP v2f64 lowering.
We can easily generalize this to other horizontal ops (FHSUB, PACKSS, PACKUS) as required - we are doing something similar with PACKUS in lowerV2I64VectorShuffle
llvm-svn: 287676
Add basic ComputeNumSignBits support for TRUNCATE ops for cases where the source's number of sign bits overlaps with the truncated size.
Improves X86 SIGN_EXTEND_IN_REG vector cases which were needlessly sign extending boolean vector results.
Differential Revision: https://reviews.llvm.org/D26851
llvm-svn: 287635
Summary:
The index and one of the table operands can be swapped by changing the opcode to the other version. Neither of these operands are the one that can load from memory so this can't be used to increase memory folding opportunities.
We need to handle the unmasked forms and the kz forms. Since the load operand isn't being commuted we can commute the load and broadcast instructions too.
Reviewers: igorb, delena, Ayal, Farhana, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D25652
llvm-svn: 287621
Summary:
Shuffle lowering widens the element size of a shuffle if elements are contiguous. This is sometimes help because wider element types have more shuffle options. If the shuffle is one of the arguments to a vselect this shuffle widening can introduce a bitcast between the vselect and the shuffle. This will prevent isel from selecting a masked operation. If the shuffle can be written equally efficiently with a different element size to match the vselect type we should change the shuffle type to allow masking.
This patch does this conversion for all VALIGND/VALIGNQ sizes. It also supports turning 128-bit PALIGNR into VALIGND/VALIGNQ. This fixes the case shown in PR31018.
I plan to add support for more operations in future patches.
Reviewers: RKSimon, zvi, delena
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26902
llvm-svn: 287612
Summary: Merging an empty case block into the header block of switch could cause
ISel to add COPY instructions in the header of switch, instead of the case
block, if the case block is used as an incoming block of a PHI. This could
potentially increase dynamic instructions, especially when the switch is in a
loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, davidxl
Subscribers: qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 287553
At the moment we only use truncateVectorCompareWithPACKSS with direct vector comparison results (just one example of a known all/none signbits input).
This change relaxes the direct matching of a SETCC opcode by moving the logic up into SelectionDAG::ComputeNumSignBits and accepting any input with a known splatted signbit.
llvm-svn: 287535
Many of these problems are because shuffle lowering widens element size and reduces element count when possible. This causes the shuffle to become separated from the select by a bitcast. Future patches will work to improve these cases by rewriting the shuffle back to a narrow element type if we think it can result in folding the mask.
llvm-svn: 287503
The change is part of RegCall calling convention support for LLVM.
Long double (f80) requires special treatment as the first f80 parameter is saved in FP0 (floating point stack).
This review present the change and the corresponding tests.
Differential Revision: https://reviews.llvm.org/D26151
llvm-svn: 287485
The same thing was done to 32-bit and 64-bit element sizes previously.
This will allow us to support these shuffls in InstCombineCalls along with the other variable shift intrinsics.
llvm-svn: 287312
Summary:
This extends FCOPYSIGN support to 512-bit vectors.
I've also added tests to show what the 128-bit and 256-bit cases look like with broadcast loads.
Reviewers: delena, zvi, RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26791
llvm-svn: 287298
vXi64 multiplication is lowered into 3 calls of vpmuludq with the upper/lower 32-bit halves.
If any of these halves are zero then we can remove individual calls. Although there was isBuildVectorAllZeros code to do this I don't think it ever worked (maybe just for constant folded cases that don't seem to be tested for any longer).
This requires additional X86ISD support for computeKnownBitsForTargetNode, so far I've just added support for X86ISD::VZEXT (VPMOVZX* - helping the AVX2+ cases).
Partial fix for PR30845
Differential Revision: https://reviews.llvm.org/D26590
llvm-svn: 287223
Register Calling Convention defines a new behavior for v64i1 types.
This type should be saved in GPR.
However for 32 bit machine we need to split the value into 2 GPRs (because each is 32 bit).
Differential Revision: https://reviews.llvm.org/D26181
llvm-svn: 287217
ImplicitNullCheck keeps track of one instruction that the memory
operation depends on that it also hoists with the memory operation.
When hoisting this dependency, it would sometimes clobber a live-in
value to the basic block we were hoisting the two things out of. Fix
this by explicitly looking for such dependencies.
I also noticed two redundant checks on `MO.isDef()` in IsMIOperandSafe.
They're redundant since register MachineOperands are either Defs or Uses
-- there is no third kind. I'll change the checks to asserts in a later
commit.
llvm-svn: 287213
We save an inter-register file move this way. If there's any CPU where
the FP logic is slower, we could transform this back to int-logic in
MachineCombiner.
This helps, but doesn't solve, PR6137:
https://llvm.org/bugs/show_bug.cgi?id=6137
The 'andn' test shows that we're missing a pattern match to
recognize the xor with -1 constant as a 'not' op.
llvm-svn: 287171
We don't track callee clobbered registers correctly, so avoid hoisting
across calls.
Note: for this bug to trigger we need a `readonly` call target, since we
already have logic to not hoist across potentially storing instructions
either.
llvm-svn: 287159
We can replace "scalar" FP-bitwise-logic with other forms of bitwise-logic instructions.
Scalar SSE/AVX FP-logic instructions only exist in your imagination and/or the bowels of
compilers, but logically equivalent int, float, and double variants of bitwise-logic
instructions are reality in x86, and the float variant may be a shorter instruction
depending on which flavor (SSE or AVX) of vector ISA you have...so just prefer float all
the time.
This is a preliminary step towards solving PR6137:
https://llvm.org/bugs/show_bug.cgi?id=6137
Differential Revision:
https://reviews.llvm.org/D26712
llvm-svn: 287122
Both the (V)CVTDQ2PD (i32 to f64) and (V)CVTUDQ2PD (u32 to f64) conversion instructions are lossless and can be safely represented as generic SINT_TO_FP/UINT_TO_FP calls instead of x86 intrinsics without affecting final codegen.
LLVM counterpart to D26686
Differential Revision: https://reviews.llvm.org/D26736
llvm-svn: 287108
Summary: These intrinsics have been unused for clang for a while. This patch removes them. We auto upgrade them to extractelements, a scalar operation and then an insertelement. This matches the sequence used by clangs intrinsic file.
Reviewers: zvi, delena, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26660
llvm-svn: 287083
In https://reviews.llvm.org/D25347, Geoff noticed that we still have
useless copy that we can eliminate after register allocation. At the
time the allocation is chosen for those copies, they are not useless
but, because of changes in the surrounding code, later on they might
become useless.
The Greedy allocator already has a mechanism to deal with such cases
with a late recoloring. However, we missed to record the some of the
missed hints.
This commit fixes that.
llvm-svn: 287070