This patch expands the support of lowerInterleavedStore to 32x8i stride 4.
LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=4 VF=32) and we plan to include more patterns in the future. To reach our goal of "more patterns". We include two mask creators. The first function creates shuffle's mask equivalent to unpacklo/unpackhi instructions. The other creator creates mask equivalent to a concat of two half vectors(high/low).
The patch goal is to optimize the following sequence:
At the end of the computation, we have ymm2, ymm0, ymm12 and ymm3 holding
each 32 chars:
c0, c1, , c31
m0, m1, , m31
y0, y1, , y31
k0, k1, ., k31
And these need to be transposed/interleaved and stored like so:
c0 m0 y0 k0 c1 m1 y1 k1 c2 m2 y2 k2 c3 m3 y3 k3 ....
Reviewers:
dorit
Farhana
RKSimon
guyblank
DavidKreitzer
Differential Revision: https://reviews.llvm.org/D34601
llvm-svn: 309086
Summary: The aligned load predicates don't suppress themselves if the load is non-temporal the way the unaligned predicates do. For the most part this isn't a problem because the aligned predicates are mostly used for instructions that only load the the non-temporal loads have priority over those. The exception are masked loads.
Reviewers: RKSimon, zvi
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D35712
llvm-svn: 309079
D35067/rL308322 attempted to support up to 4 load pairs for memcmp inlining which resulted in regressions for some optimized libc memcmp implementations (PR33914).
Until we can match these more optimal cases, this patch reduces the memcmp expansion to a maximum of 2 load pairs (which matches what we do for -Os).
This patch should be considered for the 5.0.0 release branch as well
Differential Revision: https://reviews.llvm.org/D35830
llvm-svn: 308986
These patterns were only missing to favor using the legacy instructions when the shift was a constant. With careful adjustment of the pattern complexity we can make sure the immediate instructions still have priority over these patterns.
llvm-svn: 308834
Currently we only support (i32 bitcast(v32i1)) using the AVX2 VPMOVMSKB ymm instruction.
This patch adds support for splitting pre-AVX2 targets into 2 x (V)PMOVMSKB xmm instructions and merging the integer results.
In future we could probably generalize this to handle more cases.
Differential Revision: https://reviews.llvm.org/D35303
llvm-svn: 308723
On AMDGPU SGPR spills are really spilled to another register.
The spiller creates the spills to new frame index objects,
which is used as a placeholder.
This will eventually be replaced with a reference to a position
in a VGPR to write to and the frame index deleted. It is
most likely not a real stack location that can be shared
with another stack object.
This is a problem when StackSlotColoring decides it should
combine a frame index used for a normal VGPR spill with
a real stack location and a frame index used for an SGPR.
Add an ID field so that StackSlotColoring has a way
of knowing the different frame index types are
incompatible.
llvm-svn: 308673
We allow wider than 5 bits in the 16 and 32 bit store forms. And we allow wider than 6 bits on the 64-bit regsiter form.:w
I'm assuming this was a mistake made back in r148024.
llvm-svn: 308656
Summary:
When pushing an extension of a constant bitwise operator on a load
into the load, change other uses of the load value if they exist to
prevent the old load from persisting.
Reviewers: spatel, RKSimon, efriedma
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D35030
llvm-svn: 308618
Test constant folding both on node creation (which already works) and once the input nodes have been folded themselves (not working yet).
llvm-svn: 308611
Most combines currently recognise scalar and splat-vector constants, but not non-uniform vector constants.
This patch introduces a matching mechanism that uses predicates to check against BUILD_VECTOR of ConstantSDNode, as well as scalar ConstantSDNode cases.
I've changed a couple of predicates to demonstrate - the combine-shl changes add currently unsupported cases, while the MatchRotate replaces an existing mechanism.
Differential Revision: https://reviews.llvm.org/D35492
llvm-svn: 308598
Add optimization remarks support to the PrologueEpilogueInserter. For
now, emit the stack size as an analysis remark, but more additions wrt
shrink-wrapping may be added.
https://reviews.llvm.org/D35645
llvm-svn: 308556
Summary:
This patch adds the following
1. Adds a skeleton scheduler model for AMD Znver1.
2. Introduces the znver1 execution units and pipes.
3. Caters the instructions based on the generic scheduler classes.
4. Further additions to the scheduler model with instruction itineraries will be carried out incrementally based on
a. Instructions types
b. Registers used
5. Since itineraries are not added based on instructions, throughput information are bound to change when incremental changes are added.
6. Scheduler testcases are modified accordingly to suit the new model.
Patch by Ganesh Gopalasubramanian. With minor formatting tweaks from me.
Reviewers: craig.topper, RKSimon
Subscribers: javed.absar, shivaram, ddibyend, vprasad
Differential Revision: https://reviews.llvm.org/D35293
llvm-svn: 308411
Re-recommiting after landing DAG extension-crash fix.
Recommiting after adding check to avoid miscomputing alias information
on addresses of the same base but different subindices.
Memory accesses offset from frame indices may alias, e.g., we
may merge write from function arguments passed on the stack when they
are contiguous. As a result, when checking aliasing, we consider the
underlying frame index's offset from the stack pointer.
Static allocs are realized as stack objects in SelectionDAG, but its
offset is not set until post-DAG causing DAGCombiner's alias check to
consider access to static allocas to frequently alias. Modify isAlias
to consider access between static allocas and access from other frame
objects to be considered aliasing.
Many test changes are included here. Most are fixes for tests which
indirectly relied on our aliasing ability and needed to be modified to
preserve their original intent.
The remaining tests have minor improvements due to relaxed
ordering. The exception is CodeGen/X86/2011-10-19-widen_vselect.ll
which has a minor degradation dispite though the pre-legalized DAG is
improved.
Reviewers: rnk, mkuper, jonpa, hfinkel, uweigand
Reviewed By: rnk
Subscribers: sdardis, nemanjai, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D33345
llvm-svn: 308350
When replacing a node and it's operand, replacing the operand node may
cause the deletion of the original node leading to an assertion
failure. Case around these replacements to avoid this without relying
on inspecting the DELETED_NODE opcode in various extend
dagcombiner cases.
Fixes PR32515.
Reviewers: dbabokin, RKSimon, davide, chandlerc
Subscribers: chandlerc, llvm-commits
Differential Revision: https://reviews.llvm.org/D34095
llvm-svn: 308330
It should be a win to avoid going out to the system lib for all small memcmp() calls using scalar ops. For x86 32-bit, this means most everything up to 16 bytes. For 64-bit, that doubles because we can do 8-byte loads.
Notes:
Reduced from 4 to 2 loads for -Os behavior, which might not be optimal in all cases. It's effectively a question of how much do we trust the system implementation. Linux and macOS (and Windows I assume, but did not test) have optimized memcmp() code for x86, so it's probably not bad either way? PPC is using 8/4 for defaults on these. We do not expand at all for -Oz.
There are still potential improvements to make for the CGP expansion IR and/or lowering such as avoiding select-of-constants (D34904) and not doing zexts to the max load type before doing a compare.
We have special-case SSE/AVX codegen for (memcmp(x, y, 16/32) == 0) that will no longer be produced after this patch. I've shown the experimental justification for that change in PR33329:
https://bugs.llvm.org/show_bug.cgi?id=33329#c12
TLDR: While the vector code is a likely winner, we can't guarantee that it's a winner in all cases on all CPUs, so I'm willing to sacrifice it for the greater good of expanding all small memcmp(). If we want to resurrect that codegen, it can be done by adjusting the CGP params or poking a hole to let those fall-through the CGP expansion.
Committed on behalf of Sanjay Patel
Differential Revision: https://reviews.llvm.org/D35067
llvm-svn: 308322
As discussed by @spatel on D35067:
"I added the cmov attribute to the 32-bit codegen test because it removes some noise for that file. I think the intent for the SSE vs no-SSE runs is to show the potential difference for the 16 and 32 byte cases rather than the lack of cmov (which has been available for all CPUs since SSE1, so that's why it shows up automatically with -mattr=sse2)."
llvm-svn: 308309
This isn't legal code, but we shouldn't crash on it. Now we just don't convert the gather intrinsic if the scale isn't constant and let it go through to isel where we'll report an isel failure.
Fixes PR33772.
llvm-svn: 308267
Rename the enum value from X86_64_Win64 to plain Win64.
The symbol exposed in the textual IR is changed from 'x86_64_win64cc'
to 'win64cc', but the numeric value is kept, keeping support for
old bitcode.
Differential Revision: https://reviews.llvm.org/D34474
llvm-svn: 308208
LLVM compiler recognizes opportunities to transform a branch into IR select instruction(s) - later it will be lowered into X86::CMOV instruction, assuming no other optimization eliminated the SelectInst.
However, it is not always profitable to emit X86::CMOV instruction. For example, branch is preferable over an X86::CMOV instruction when:
1. Branch is well predicted
2. Condition operand is expensive, compared to True-value and the False-value operands
In CodeGenPrepare pass there is a shallow optimization that tries to convert SelectInst into branch, but it is not enough.
This commit, implements machine optimization pass that converts X86::CMOV instruction(s) into branch, based on a conservative heuristic.
Differential Revision: https://reviews.llvm.org/D34769
llvm-svn: 308142
Recommiting after adding check to avoid miscomputing alias information
on addresses of the same base but different subindices.
Memory accesses offset from frame indices may alias, e.g., we
may merge write from function arguments passed on the stack when they
are contiguous. As a result, when checking aliasing, we consider the
underlying frame index's offset from the stack pointer.
Static allocs are realized as stack objects in SelectionDAG, but its
offset is not set until post-DAG causing DAGCombiner's alias check to
consider access to static allocas to frequently alias. Modify isAlias
to consider access between static allocas and access from other frame
objects to be considered aliasing.
Many test changes are included here. Most are fixes for tests which
indirectly relied on our aliasing ability and needed to be modified to
preserve their original intent.
The remaining tests have minor improvements due to relaxed
ordering. The exception is CodeGen/X86/2011-10-19-widen_vselect.ll
which has a minor degradation dispite though the pre-legalized DAG is
improved.
Reviewers: rnk, mkuper, jonpa, hfinkel, uweigand
Reviewed By: rnk
Subscribers: sdardis, nemanjai, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D33345
llvm-svn: 308025
Summary: Continuing the work from https://reviews.llvm.org/D33240, this change introduces an element unordered-atomic memset intrinsic. This intrinsic is essentially memset with the implementation requirement that all stores used for the assignment are done with unordered-atomic stores of a given element size.
Reviewers: eli.friedman, reames, mkazantsev, skatkov
Reviewed By: reames
Subscribers: jfb, dschuff, sbc100, jgravelle-google, aheejin, efriedma, llvm-commits
Differential Revision: https://reviews.llvm.org/D34885
llvm-svn: 307854
FastIsel can't handle them, so we would end up crashing during
register class selection.
Fixes PR26522.
Differential Revision: https://reviews.llvm.org/D35272
llvm-svn: 307797
Summary: Continuing the work from https://reviews.llvm.org/D33240, this change introduces an element unordered-atomic memmove intrinsic. This intrinsic is essentially memmove with the implementation requirement that all loads/stores used for the copy are done with unordered-atomic loads/stores of a given element size.
Reviewers: eli.friedman, reames, mkazantsev, skatkov
Reviewed By: reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34884
llvm-svn: 307796
Summary:
NetBSD shell sh(1) does not support ">& /dev/null" construct.
This is bashism. The portable and POSIX solution is to use:
"> /dev/null 2>&1".
This change fixes 22 Unexpected Failures on NetBSD/amd64
for the "check-llvm" target.
Sponsored by <The NetBSD Foundation>
Reviewers: joerg, dim, rnk
Reviewed By: joerg, rnk
Subscribers: rnk, davide, llvm-commits
Differential Revision: https://reviews.llvm.org/D35277
llvm-svn: 307789
This is a second attempt to land this patch.
The first one resulted in a crash of clang sanitizer buildbot.
The fix is here and regression test is added.
This is a last fix for the corner case of PR32214. Actually this is not really corner case in general.
We should not do a loop rotation if we create an additional branch due to it.
Consider the case where we have a loop chain H, M, B, C , where
H is header with viable fallthrough from pre-header and exit from the loop
M - some middle block
B - backedge to Header but with exit from the loop also.
C - some cold block of the loop.
Let's H is determined as a best exit. If we do a loop rotation M, B, C, H we can introduce the extra branch.
Let's compute the change in number of branches:
+1 branch from pre-header to header
-1 branch from header to exit
+1 branch from header to middle block if there is such
-1 branch from cold bock to header if there is one
So if C is not a predecessor of H then we introduce extra branch.
This change actually prohibits rotation of the loop if both true
Best Exit has next element in chain as successor.
Last element in chain is not a predecessor of first element of chain.
Reviewers: iteratee, xur, sammccall, chandlerc
Reviewed By: iteratee
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34745
llvm-svn: 307631
CodeGenPrepare::optimizeMemoryInst contains a check that we do nothing
if all instructions combining the address for memory instruction is in the same
block as memory instruction itself.
However if any of these instruction are placed after memory instruction then
address calculation will not be folded to memory instruction.
The added test case shows an example.
Reviewers: loladiro, spatel, efriedma
Reviewed By: efriedma
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34862
llvm-svn: 307628
Reverting as it breaks tramp3d-v4 in the llvm test-suite. I added some
comments to https://reviews.llvm.org/D33345 about it.
This reverts commit r307546.
llvm-svn: 307589
Memory accesses offset from frame indices may alias, e.g., we
may merge write from function arguments passed on the stack when they
are contiguous. As a result, when checking aliasing, we consider the
underlying frame index's offset from the stack pointer.
Static allocs are realized as stack objects in SelectionDAG, but its
offset is not set until post-DAG causing DAGCombiner's alias check to
consider access to static allocas to frequently alias. Modify isAlias
to consider access between static allocas and access from other frame
objects to be considered aliasing.
Many test changes are included here. Most are fixes for tests which
indirectly relied on our aliasing ability and needed to be modified to
preserve their original intent.
The remaining tests have minor improvements due to relaxed
ordering. The exception is CodeGen/X86/2011-10-19-widen_vselect.ll
which has a minor degradation dispite though the pre-legalized DAG is
improved.
Reviewers: rnk, mkuper, jonpa, hfinkel, uweigand
Reviewed By: rnk
Subscribers: sdardis, nemanjai, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D33345
llvm-svn: 307546
The SandyBridge architects have provided us with a more accurate information about each instruction latency, number of uOPs and used ports and I used it to replace the existing estimated SNB instructions scheduling and to add missing scheduling information.
Please note that the patch extensively affects the X86 MC instr scheduling for SNB.
Also note that this patch will be followed by additional patches for the remaining target architectures HSW, IVB, BDW, SKL and SKX.
The updated and extended information about each instruction includes the following details:
•static latency of the instruction
•number of uOps from which the instruction consists of
•all ports used by the instruction's' uOPs
For example, the following code dictates that instructions, ADC64mr, ADC8mr, SBB64mr, SBB8mr have a static latency of 9 cycles. Each of these instructions is decoded into 6 micro operations which use ports 4, ports 2 or 3 and port 0 and ports 0 or 1 or 5:
def SBWriteResGroup94 : SchedWriteRes<[SBPort4,SBPort23,SBPort0,SBPort015]> {
let Latency = 9;
let NumMicroOps = 6;
let ResourceCycles = [1,2,2,1];
}
def: InstRW<[SBWriteResGroup94], (instregex "ADC64mr")>;
def: InstRW<[SBWriteResGroup94], (instregex "ADC8mr")>;
def: InstRW<[SBWriteResGroup94], (instregex "SBB64mr")>;
def: InstRW<[SBWriteResGroup94], (instregex "SBB8mr")>;
Note that apart for the header, most of the X86SchedSandyBridge.td file was generated by a script.
Reviewers: zvi, chandlerc, RKSimon, m_zuckerman, craig.topper, igorb
Differential Revision: https://reviews.llvm.org/D35019#inline-304691
llvm-svn: 307529
Summary:
Mark G_ZEXT/G_SEXT i1 to i8/i16, i8 to i16 as legal.
Support G_ZEXT i1 to i8/i16 instruction selection ( C++ code).
This patch requred to support G_LOAD/G_STORE i1.
Reviewers: zvi, guyblank
Reviewed By: guyblank
Subscribers: rovka, llvm-commits, kristof.beyls
Differential Revision: https://reviews.llvm.org/D35177
llvm-svn: 307526
WidenVSELECTAndMask can fold (and it folds in this case) so we
get a BUILD_VECTOR of constants as mask. convertMask() seems to
work fine when the input is a vector of constants, and we still
need to call it to extend/add elements at the end. but the current
code just asserts on anything but a SETCC or AND/OR/XOR of 2xSETCC.
This change was discussed briefly with Simon Pilgrim, who also
suggests we might consider dropping this assertion in the future.
Fixes PR33715.
llvm-svn: 307508
x86 scalar select-of-constants (Cond ? C1 : C2) combining/lowering is a mess
with missing optimizations. We handle some patterns, but miss logical variants.
To clean that up, we should convert all select-of-constants to logic/math and
enhance the combining for the expected patterns from that. Selecting 0 or -1
needs extra attention to produce the optimal code as shown here.
Attempt to verify that all of these IR forms are logically equivalent:
http://rise4fun.com/Alive/plxs
Earlier steps in this series:
rL306040
rL306072
rL307404 (D34652)
As acknowledged in the earlier review, there's a possibility that some Intel
uarch would prefer to produce an xor to clear the fake register operand with
sbb %eax, %eax. This will likely need to be addressed in a separate pass.
llvm-svn: 307471
x86 scalar select-of-constants (Cond ? C1 : C2) combining/lowering is a mess
with missing optimizations. We handle some patterns, but miss logical variants.
To clean that up, we should convert all select-of-constants to logic/math and
enhance the combining for the expected patterns from that. DAGCombiner already
has the foundation to allow the transforms, so we just need to fill in the holes
for x86 math op lowering. Selecting 0 or -1 needs extra attention to produce the
optimal code as shown here.
Attempt to verify that all of these IR forms are logically equivalent:
http://rise4fun.com/Alive/plxs
Earlier steps in this series:
rL306040
rL306072
Differential Revision: https://reviews.llvm.org/D34652
llvm-svn: 307404
The patch is to adjust the strategy of frequency based consthoisting:
Previously when the candidate block has the same frequency with the existing
blocks containing a const, it will not hoist the const to the candidate block.
For that case, now we change the strategy to hoist the const if only existing
blocks have more than one block member. This is helpful for reducing code size.
Differential Revision: https://reviews.llvm.org/D35084
llvm-svn: 307328
When the formulae search space is huge, LSR uses a series of heuristic to keep
pruning the search space until the number of possible solutions are within
certain limit.
The big hammer of the series of heuristics is NarrowSearchSpaceByPickingWinnerRegs,
which picks the register which is used by the most LSRUses and deletes the other
formulae which don't use the register. This is a effective way to prune the search
space, but quite often not a good way to keep the best solution. We saw cases before
that the heuristic pruned the best formula candidate out of search space.
To relieve the problem, we introduce a new heuristic called
NarrowSearchSpaceByFilterFormulaWithSameScaledReg. The basic idea is in order to
reduce the search space while keeping the best formula, we want to keep as many
formulae with different Scale and ScaledReg as possible. That is because the central
idea of LSR is to choose a group of loop induction variables and use those induction
variables to represent LSRUses. An induction variable candidate is often represented
by the Scale and ScaledReg in a formula. If we have more formulae with different
ScaledReg and Scale to choose, we have better opportunity to find the best solution.
That is why we believe pruning search space by only keeping the best formula with the
same Scale and ScaledReg should be more effective than PickingWinnerReg. And we use
two criteria to choose the best formula with the same Scale and ScaledReg. The first
criteria is to select the formula using less non shared registers, and the second
criteria is to select the formula with less cost got from RateFormula. The patch
implements the heuristic before NarrowSearchSpaceByPickingWinnerRegs, which is the
last resort.
Testing shows we get 1.8% and 2% on two internal benchmarks on x86. llvm nightly
testsuite performance is neutral. We also tried lsr-exp-narrow and it didn't help
on the two improved internal cases we saw.
Differential Revision: https://reviews.llvm.org/D34583
llvm-svn: 307269
For two ROTR operations with shifts C1, C2; combined shift operand will be (C1 + C2) % bitsize.
Differential revision: https://reviews.llvm.org/D12833
llvm-svn: 307179
Relanding after rewriting undef.ll test to avoid host-dependant
endianness.
As discussed in D34087, rewrite areNonVolatileConsecutiveLoads using
generic checks. Also, propagate missing local handling from there to
BaseIndexOffset checks.
Tests of note:
* test/CodeGen/X86/build-vector* - Improved.
* test/CodeGen/BPF/undef.ll - Improved store alignment allows an
additional store merge
* test/CodeGen/X86/clear_upper_vector_element_bits.ll - This is a
case we already do not handle well. Here, the DAG is improved, but
scheduling causes a code size degradation.
Reviewers: RKSimon, craig.topper, spatel, andreadb, filcab
Subscribers: nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D34472
llvm-svn: 307114
Made some updates to the half.ll test under CodeGen to make it friendly to the update_llc_test_checks .py tool as follows:
1.Removing the llc flag -asm-verbose=false
2.Grouping the multiple check-prefix directives
3.Apply update_llc_test_checks.py tool on the test
This change is needed to easily update scheduling changes in an upcoming patch.
Reviewers: zvi, RKSimon, craig.topper
Differential Revision: https://reviews.llvm.org/D34934
llvm-svn: 307108
Move from generic to X86 directory since gc intrinsics only supposed in
X86 64 bit.
Add target triple as well.
Fixes build failure in i686-linux-RA caused by rL307084.
llvm-svn: 307086
Converting the Codegen test "extractelement-legalization-store-ordering.ll" to be "update_llc_test_checks" friendly.
The changes to the test are needed for an upcoming scheduling patch.
Reviewers: zvi, RKSimon
Differential Revision: https://reviews.llvm.org/D34935
llvm-svn: 307066
Summary:
When broadcasting from the constant pool its useful to print out the final vector similar to what we do for normal moves from the constant pool.
I changed only a couple tests that were broadcast focused. One of them had been previously hand tweaked after running the script so that it could check the constant pool declaration. But I think this patch makes that unnecessary now since we can check the comment instead.
Reviewers: spatel, RKSimon, zvi
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34923
llvm-svn: 307062
The patch makes SoftenFloatResult/Operand logic just the same as all other legalization routines have: SoftenFloatResult() now fills the SoftenFloats map and SoftenFloatOperand() perform all needed replacements. This prevents softening mashinery from leaving stale entries in SoftenFloats map (that resulted in errors during the legalize type checking) and clarifies softening. The patch replaces https://reviews.llvm.org/D29265.
Differential Revision: https://reviews.llvm.org/D31946
llvm-svn: 307053
Summary:
Add a combine for creating a truncate to replace a build_vector composed of extracts with
indices that form a stride-2^N series.
Example:
v8i32 V = ...
v4i32 build_vector((extract_elt V, 0), (extract_elt V, 2), (extract_elt V, 4), (extract_elt V, 6))
-->
v4i32 truncate (bitcast V to v4i64)
Related discussion in llvm-dev about canonicalizing shuffles to
truncates in LLVM IR:
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108936.html.
Reviewers: spatel, RKSimon, efriedma, igorb, craig.topper, wolfgangp, delena
Reviewed By: delena
Subscribers: guyblank, delena, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D34077
llvm-svn: 307036
These all used 'CHECK-NOT' which isn't necessary if we have complete checks.
There were also over-specifications in the RUN params such as CPU model.
llvm-svn: 307033
These all used 'CHECK-NOT' which isn't necessary if we have complete checks.
There were also several over-specifications in the RUN params such as CPU model or OS requirement
llvm-svn: 307028
We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive
llvm-svn: 306978
We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive
The 32-bit shuffles are a bit tricky and will be dealt with in a later patch
llvm-svn: 306977
We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive
llvm-svn: 306976
This is NFC after rerunning the "update_llc_test_checks.py" tool on the CodeGen X86 tests in order to submit a patch.
Minor differences due to added "End of Function" lines.
Reviewers: zvi
Differential Revision: https://reviews.llvm.org/D34933
llvm-svn: 306973
Summary: Support G_GLOBAL_VALUE operation. For now most of the PIC configurations not implemented yet.
Reviewers: zvi, guyblank
Reviewed By: guyblank
Subscribers: rovka, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D34738
Conflicts:
test/CodeGen/X86/GlobalISel/regbankselect-X86_64.mir
llvm-svn: 306972
Summary:
Support vector type G_UNMERGE_VALUES selection.
For now G_UNMERGE_VALUES marked as legal for any type, so nothing to do in legalizer.
Reviewers: t.p.northover, qcolombet, zvi, guyblank
Reviewed By: guyblank
Subscribers: rovka, kristof.beyls, guyblank, llvm-commits
Differential Revision: https://reviews.llvm.org/D33665
llvm-svn: 306971
As discussed in D34087, rewrite areNonVolatileConsecutiveLoads using
generic checks. Also, propagate missing local handling from there to
BaseIndexOffset checks.
Tests of note:
* test/CodeGen/X86/build-vector* - Improved.
* test/CodeGen/BPF/undef.ll - Improved store alignment allows an
additional store merge
* test/CodeGen/X86/clear_upper_vector_element_bits.ll - This is a
case we already do not handle well. Here, the DAG is improved, but
scheduling causes a code size degradation.
Reviewers: RKSimon, craig.topper, spatel, andreadb, filcab
Subscribers: nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D34472
llvm-svn: 306819
Summary:
Support vector type G_MERGE_VALUES selection. For now G_MERGE_VALUES marked as legal for any type, so nothing to do in legalizer.
Split from https://reviews.llvm.org/D33665
Reviewers: qcolombet, t.p.northover, zvi, guyblank
Reviewed By: guyblank
Subscribers: rovka, kristof.beyls, guyblank, llvm-commits
Differential Revision: https://reviews.llvm.org/D33958
llvm-svn: 306665
CFI instructions that set appropriate cfa offset and cfa register are now
inserted in emitEpilogue() in X86FrameLowering.
Majority of the changes in this patch:
1. Ensure that CFI instructions do not affect code generation.
2. Enable maintaining correct information about cfa offset and cfa register
in a function when basic blocks are reordered, merged, split, duplicated.
These changes are target independent and described below.
Changed CFI instructions so that they:
1. are duplicable
2. are not counted as instructions when tail duplicating or tail merging
3. can be compared as equal
Add information to each MachineBasicBlock about cfa offset and cfa register
that are valid at its entry and exit (incoming and outgoing CFI info). Add
support for updating this information when basic blocks are merged, split,
duplicated, created. Add a verification pass (CFIInfoVerifier) that checks
that outgoing cfa offset and register of predecessor blocks match incoming
values of their successors.
Incoming and outgoing CFI information is used by a late pass
(CFIInstrInserter) that corrects CFA calculation rule for a basic block if
needed. That means that additional CFI instructions get inserted at basic
block beginning to correct the rule for calculating CFA. Having CFI
instructions in function epilogue can cause incorrect CFA calculation rule
for some basic blocks. This can happen if, due to basic block reordering,
or the existence of multiple epilogue blocks, some of the blocks have wrong
cfa offset and register values set by the epilogue block above them.
Patch by Violeta Vukobrat.
Differential Revision: https://reviews.llvm.org/D18046
llvm-svn: 306529
As noted in D34071, there are some IR optimization opportunities that could be
handled by normal IR passes if this expansion wasn't happening so late in CGP.
Regardless of that, it seems wasteful to knowingly produce suboptimal IR here,
so I'm proposing this change:
%s = sub i32 %x, %y
%r = icmp ne %s, 0
=>
%r = icmp ne %x, %y
Changing the predicate to 'eq' mimics what InstCombine would do, so that's just
an efficiency improvement if we decide this expansion should happen sooner.
The fact that the PowerPC backend doesn't eliminate the 'subf.' might be
something for PPC folks to investigate separately.
Differential Revision: https://reviews.llvm.org/D34416
llvm-svn: 306471
•static latency
•number of uOps from which the instructions consists
•all ports used by the instruction
Reviewers:
RKSimon
zvi
aymanmus
m_zuckerman
Differential Revision: https://reviews.llvm.org/D33897
llvm-svn: 306414
[X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions).
AVX512 compare instructions return v*i1 types.
In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type.
Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes.
The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class.
When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction.
Differential Revision: https://reviews.llvm.org/D33188
llvm-svn: 306402