We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive
llvm-svn: 306978
We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive
The 32-bit shuffles are a bit tricky and will be dealt with in a later patch
llvm-svn: 306977
We are combining shuffles to bit shifts before unary permutes, which means we can't fold loads plus the destination register is destructive
llvm-svn: 306976
This is NFC after rerunning the "update_llc_test_checks.py" tool on the CodeGen X86 tests in order to submit a patch.
Minor differences due to added "End of Function" lines.
Reviewers: zvi
Differential Revision: https://reviews.llvm.org/D34933
llvm-svn: 306973
Summary: Support G_GLOBAL_VALUE operation. For now most of the PIC configurations not implemented yet.
Reviewers: zvi, guyblank
Reviewed By: guyblank
Subscribers: rovka, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D34738
Conflicts:
test/CodeGen/X86/GlobalISel/regbankselect-X86_64.mir
llvm-svn: 306972
Summary:
Support vector type G_UNMERGE_VALUES selection.
For now G_UNMERGE_VALUES marked as legal for any type, so nothing to do in legalizer.
Reviewers: t.p.northover, qcolombet, zvi, guyblank
Reviewed By: guyblank
Subscribers: rovka, kristof.beyls, guyblank, llvm-commits
Differential Revision: https://reviews.llvm.org/D33665
llvm-svn: 306971
As discussed in D34087, rewrite areNonVolatileConsecutiveLoads using
generic checks. Also, propagate missing local handling from there to
BaseIndexOffset checks.
Tests of note:
* test/CodeGen/X86/build-vector* - Improved.
* test/CodeGen/BPF/undef.ll - Improved store alignment allows an
additional store merge
* test/CodeGen/X86/clear_upper_vector_element_bits.ll - This is a
case we already do not handle well. Here, the DAG is improved, but
scheduling causes a code size degradation.
Reviewers: RKSimon, craig.topper, spatel, andreadb, filcab
Subscribers: nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D34472
llvm-svn: 306819
Summary:
Support vector type G_MERGE_VALUES selection. For now G_MERGE_VALUES marked as legal for any type, so nothing to do in legalizer.
Split from https://reviews.llvm.org/D33665
Reviewers: qcolombet, t.p.northover, zvi, guyblank
Reviewed By: guyblank
Subscribers: rovka, kristof.beyls, guyblank, llvm-commits
Differential Revision: https://reviews.llvm.org/D33958
llvm-svn: 306665
CFI instructions that set appropriate cfa offset and cfa register are now
inserted in emitEpilogue() in X86FrameLowering.
Majority of the changes in this patch:
1. Ensure that CFI instructions do not affect code generation.
2. Enable maintaining correct information about cfa offset and cfa register
in a function when basic blocks are reordered, merged, split, duplicated.
These changes are target independent and described below.
Changed CFI instructions so that they:
1. are duplicable
2. are not counted as instructions when tail duplicating or tail merging
3. can be compared as equal
Add information to each MachineBasicBlock about cfa offset and cfa register
that are valid at its entry and exit (incoming and outgoing CFI info). Add
support for updating this information when basic blocks are merged, split,
duplicated, created. Add a verification pass (CFIInfoVerifier) that checks
that outgoing cfa offset and register of predecessor blocks match incoming
values of their successors.
Incoming and outgoing CFI information is used by a late pass
(CFIInstrInserter) that corrects CFA calculation rule for a basic block if
needed. That means that additional CFI instructions get inserted at basic
block beginning to correct the rule for calculating CFA. Having CFI
instructions in function epilogue can cause incorrect CFA calculation rule
for some basic blocks. This can happen if, due to basic block reordering,
or the existence of multiple epilogue blocks, some of the blocks have wrong
cfa offset and register values set by the epilogue block above them.
Patch by Violeta Vukobrat.
Differential Revision: https://reviews.llvm.org/D18046
llvm-svn: 306529
As noted in D34071, there are some IR optimization opportunities that could be
handled by normal IR passes if this expansion wasn't happening so late in CGP.
Regardless of that, it seems wasteful to knowingly produce suboptimal IR here,
so I'm proposing this change:
%s = sub i32 %x, %y
%r = icmp ne %s, 0
=>
%r = icmp ne %x, %y
Changing the predicate to 'eq' mimics what InstCombine would do, so that's just
an efficiency improvement if we decide this expansion should happen sooner.
The fact that the PowerPC backend doesn't eliminate the 'subf.' might be
something for PPC folks to investigate separately.
Differential Revision: https://reviews.llvm.org/D34416
llvm-svn: 306471
•static latency
•number of uOps from which the instructions consists
•all ports used by the instruction
Reviewers:
RKSimon
zvi
aymanmus
m_zuckerman
Differential Revision: https://reviews.llvm.org/D33897
llvm-svn: 306414
[X86][AVX512] Improve lowering of AVX512 compare intrinsics (remove redundant shift left+right instructions).
AVX512 compare instructions return v*i1 types.
In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type.
Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes.
The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class.
When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction.
Differential Revision: https://reviews.llvm.org/D33188
llvm-svn: 306402
The non-AVX-512 behavior was changed in r248266 to match N1778
(C bindings for IEEE-754 (2008)), which defined the four functions
to not raise the inexact exception ("rint" is still defined as raising
it).
Update the AVX-512 lowering of these functions to match that: it should
not be different.
llvm-svn: 306299
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
This is a last fix for the corner case of PR32214. Actually this is not really corner case in general.
We should not do a loop rotation if we create an additional branch due to it.
Consider the case where we have a loop chain H, M, B, C , where
H is header with viable fallthrough from pre-header and exit from the loop
M - some middle block
B - backedge to Header but with exit from the loop also.
C - some cold block of the loop.
Let's H is determined as a best exit. If we do a loop rotation M, B, C, H we can introduce the extra branch.
Let's compute the change in number of branches:
+1 branch from pre-header to header
-1 branch from header to exit
+1 branch from header to middle block if there is such
-1 branch from cold bock to header if there is one
So if C is not a predecessor of H then we introduce extra branch.
This change actually prohibits rotation of the loop if both true
1) Best Exit has next element in chain as successor.
2) Last element in chain is not a predecessor of first element of chain.
Reviewers: iteratee, xur
Reviewed By: iteratee
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34271
llvm-svn: 306272
The compiler fails with assertion during legalization of SETCC for <3 x i8> operands.
The result is extended to <4 x i8> and then truncated <4 x i1>. It does not happen on AVX2, because the final result of SETCC is <4 x i32>.
Differential Revision: https://reviews.llvm.org/D34503
llvm-svn: 306242
Summary:
Support vector type G_EXTRACT selection. For now G_EXTRACT marked as legal for any type, so nothing to do in legalizer.
Split from https://reviews.llvm.org/D33665
Reviewers: qcolombet, t.p.northover, zvi, guyblank
Reviewed By: guyblank
Subscribers: guyblank, rovka, llvm-commits, kristof.beyls
Differential Revision: https://reviews.llvm.org/D33957
llvm-svn: 306240
The command-line params override the target setting in the file itself, so delete that.
Also, remove the cpu and arch because those don't matter and neither does the OS specification in the triple.
llvm-svn: 306109
This is very similar to the transform in:
https://reviews.llvm.org/rL306040
...but in this case, we use cmp X, 1 to set the carry bit as needed.
Again, we can show that all of these are logically equivalent (although
InstCombine currently canonicalizes to a form not seen here), and if
we believe IACA, then this is the smallest/fastest code. Eg, with SNB:
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | 1.0 | | | | | | | cmp edi, 0x1
| 2 | | 1.0 | | | | 1.0 | CP | sbb eax, eax
The larger motivation is to clean up all select-of-constants combining/lowering
because we're missing some common cases.
llvm-svn: 306072
Summary:
These intrinsics aren't used by clang and haven't been for a while.
There's some really terrible codegen in the 32-bit target for avx512bw due to i64 not being legal. But as I said these intrinsics aren't used by clang even before this patch so this codegen reflects our clang behavior today.
Reviewers: spatel, RKSimon, zvi, igorb
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34389
llvm-svn: 306047
Our handling of select-of-constants is lumpy in IR (https://reviews.llvm.org/D24480),
lumpy in DAGCombiner, and lumpy in X86ISelLowering. That's why we only had the 'sbb'
codegen in 1 out of the 4 tests. This is a step towards smoothing that out.
First, show that all of these IR forms are equivalent:
http://rise4fun.com/Alive/mx
Second, show that the 'sbb' version is faster/smaller. IACA output for SandyBridge
(later Intel and AMD chips are similar based on Agner's tables):
This is the "obvious" x86 codegen (what gcc appears to produce currently):
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1* | | | | | | | | xor eax, eax
| 1 | 1.0 | | | | | | CP | test edi, edi
| 1 | | | | | | 1.0 | CP | setnz al
| 1 | | 1.0 | | | | | CP | neg eax
This is the adc version:
| 1* | | | | | | | | xor eax, eax
| 1 | 1.0 | | | | | | CP | cmp edi, 0x1
| 2 | | 1.0 | | | | 1.0 | CP | adc eax, 0xffffffff
And this is sbb:
| 1 | 1.0 | | | | | | | neg edi
| 2 | | 1.0 | | | | 1.0 | CP | sbb eax, eax
If IACA is trustworthy, then sbb became a single uop in Broadwell, so this will be
clearly better than the alternatives going forward.
llvm-svn: 306040
Masked gather for vector length 2 is lowered incorrectly for element type i32.
The type <2 x i32> was automatically extended to <2 x i64> and we generated VPGATHERQQ instead of VPGATHERQD.
The type <2 x float> is extended to <4 x float>, so there is no bug for this type, but the sequence may be more optimal.
In this patch I'm fixing <2 x i32>bug and optimizing <2 x float> sequence for GATHERs only. The same fix should be done for Scatters as well.
Differential revision: https://reviews.llvm.org/D34343
llvm-svn: 305987
Add support for combining a build vector to a shuffle.
When the build vector is of extracted elements from 2 vectors (vec1, vec2) where vec2 is 2 times smaller than vec1.
llvm-svn: 305883
Summary:
When we're building with XRay instrumentation, we use a trick that
preserves references from the function to a function sled index. This
index table lives in a separate section, and without this trick the
linker is free to garbage-collect this section and all the segments it
refers to. Until we're able to tell the linkers to preserve these
sections, we use this reference trick to keep around both the index and
the entries in the instrumentation map.
Before this change we emitted both a synthetic reference to the label in
the instrumentation map, and to the entry in the function map index.
This change removes the first synthetic reference and only emits one
synthetic reference to the index -- the index entry has the references
to the labels in the instrumentation map, so the linker will still
preserve those if the function itself is preserved.
This reduces the amount of synthetic references we emit from 16 bytes to
just 8 bytes in x86_64, and similarly to other platforms.
Reviewers: dblaikie
Subscribers: javed.absar, kpw, pelikan, llvm-commits
Differential Revision: https://reviews.llvm.org/D34340
llvm-svn: 305880
Right now areMemoryOpsAliased has an assertion justified as:
MMO1 should have a value due it comes from operation we'd like to use
as implicit null check.
assert(MMO1->getValue() && "MMO1 should have a Value!");
However, it is possible for that invariant to not be upheld in the
following situation (conceptually):
Null check %RAX
NotNullSucc:
%RAX = LEA %RSP, 16 // I0
%RDX = MOV64rm %RAX // I1
With the current code, we will have an early exit from
ImplicitNullChecks::isSuitableMemoryOp on I0 with SR_Unsuitable.
However, I1 will look plausible (since it loads from %RAX) and
will go ahead and call areMemoryOpsAliased(I1, I0). This will cause
us to fail the assert mentioned above since I1 does not load from an
IR level value and thus is allowed to have a non-Value base address.
The fix is to bail out earlier whenever we see an unsuitable
instruction overwrite PointerReg. This would guarantee that when we
call areMemoryOpsAliased, we're guaranteed to be looking at an
instruction that loads from or stores to an IR level value.
Original Patch Author: sanjoy
Reviewers: sanjoy, mkazantsev, reames
Reviewed By: sanjoy
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34385
llvm-svn: 305879
There are a couple of potential improvements as seen in the IR and asm:
1. We're unnecessarily extending to a larger type to compare values.
2. The codegen for (select cond, 1, -1) could avoid a cmov.
(or we could change the order of the compares, so we have a select with 0 operand)
llvm-svn: 305802
Summary:
In some cases legalization ends up with not symmetric merge/unmerge nodes.
Transform it to merge/unmerge nodes.
Reviewers: t.p.northover, qcolombet, zvi
Reviewed By: t.p.northover
Subscribers: rovka, kristof.beyls, guyblank, llvm-commits
Differential Revision: https://reviews.llvm.org/D33626
llvm-svn: 305783
The AMD64rm instruction used in the test case was incorrect. Since
the first input register to AND64rm is tied to output register, they
must be the same.
Thanks for Jesper Antonsson for pointing this out!
llvm-svn: 305756
This seems to be interacting badly with ASan somehow, causing false reports of
heap-buffer overflows: PR33514.
> Summary:
> The patch makes instruction count the highest priority for
> LSR solution for X86 (previously registers had highest priority).
>
> Reviewers: qcolombet
>
> Differential Revision: http://reviews.llvm.org/D30562
>
> From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 305720
Summary: Implement some of the simplest addressing modes.It should help to test ABI.
Reviewers: zvi, guyblank
Reviewed By: guyblank
Subscribers: rovka, llvm-commits, kristof.beyls
Differential Revision: https://reviews.llvm.org/D33888
llvm-svn: 305691
Re-apply r276044/r279124/r305516. Fixed a problem where we would refuse
to place spills as the very first instruciton of a basic block and thus
artifically increase pressure (test in
test/CodeGen/PowerPC/scavenging.mir:spill_at_begin)
This is a variant of scavengeRegister() that works for
enterBasicBlockEnd()/backward(). The benefit of the backward mode is
that it is not affected by incomplete kill flags.
This patch also changes
PrologEpilogInserter::doScavengeFrameVirtualRegs() to use the register
scavenger in backwards mode.
Differential Revision: http://reviews.llvm.org/D21885
llvm-svn: 305625
Revert because of reports of some PPC input starting to spill when it
was predicted that it wouldn't and no spillslot was reserved.
This reverts commit r305516.
llvm-svn: 305566
Summary:
Background: http://lists.llvm.org/pipermail/llvm-dev/2017-May/112779.html
This change is to alter the prototype for the atomic memcpy intrinsic. The prototype itself is being changed to more closely resemble the semantics and parameters of the llvm.memcpy intrinsic -- to ease later combination of the llvm.memcpy and atomic memcpy intrinsics. Furthermore, the name of the atomic memcpy intrinsic is being changed to make it clear that it is not a generic atomic memcpy, but specifically a memcpy is unordered atomic.
Reviewers: reames, sanjoy, efriedma
Reviewed By: reames
Subscribers: mzolotukhin, anna, llvm-commits, skatkov
Differential Revision: https://reviews.llvm.org/D33240
llvm-svn: 305558
Re-apply r276044/r279124. Trying to reproduce or disprove the ppc64
problems reported in the stage2 build last time, which I cannot
reproduce right now.
This is a variant of scavengeRegister() that works for
enterBasicBlockEnd()/backward(). The benefit of the backward mode is
that it is not affected by incomplete kill flags.
This patch also changes
PrologEpilogInserter::doScavengeFrameVirtualRegs() to use the register
scavenger in backwards mode.
Differential Revision: http://reviews.llvm.org/D21885
llvm-svn: 305516
The code assumed that we process instructions in basic block order. FastISel
processes instructions in reverse basic block order. We need to pre-assign
virtual registers before selecting otherwise we get def-use relationships wrong.
This only affects code with swifterror registers.
rdar://32659327
llvm-svn: 305484
AVX512 compare instructions return v*i1 types.
In cases where the number of elements in the returned value are less than 8, clang adds zeroes to get a mask of v8i1 type.
Later on it's replaced with CONCAT_VECTORS, which then is lowered to many DAG nodes including insert/extract element and shift right/left nodes.
The fact that AVX512 compare instructions put the result in a k register and zeroes all its upper bits allows us to remove the extra nodes simply by copying the result to the required register class.
When lowering, identify these cases and transform them into an INSERT_SUBVECTOR node (marked legal), then catch this pattern in instructions selection phase and transform it into one avx512 cmp instruction.
Differential Revision: https://reviews.llvm.org/D33188
llvm-svn: 305465
Summary:
This patch is part of 3 patches that together form a single patch, but must be introduced in stages in order not to break things.
The way that LLVM interprets DW_OP_plus in DIExpression nodes is basically that of the DW_OP_plus_uconst operator since LLVM expects an unsigned constant operand. This unnecessarily restricts the DW_OP_plus operator, preventing it from being used to describe the evaluation of runtime values on the expression stack. These patches try to align the semantics of DW_OP_plus and DW_OP_minus with that of the DWARF definition, which pops two elements off the expression stack, performs the operation and pushes the result back on the stack.
This is done in three stages:
• The first patch (LLVM) adds support for DW_OP_plus_uconst.
• The second patch (Clang) contains changes all its uses from DW_OP_plus to DW_OP_plus_uconst.
• The third patch (LLVM) changes the semantics of DW_OP_plus and DW_OP_minus to be in line with its DWARF meaning. This patch includes the bitcode upgrade from legacy DIExpressions.
Patch by Sander de Smalen.
Reviewers: echristo, pcc, aprantl
Reviewed By: aprantl
Subscribers: fhahn, javed.absar, aprantl, llvm-commits
Differential Revision: https://reviews.llvm.org/D33894
llvm-svn: 305386
The dream of a unified check-line auto-generator for all phases of compilation is dead.
The llc script has already diverged to be better at its goal, so having 2 scripts that
do almost the same thing is just causing confusion.
We can rip out the llc ability in update_test_checks.py next and rename it, so it will
be clear that we have one script for llc check auto-generation and another for opt.
llvm-svn: 305206
Summary:
This change enables the sin(x) cos(x) -> sincos(x) optimization on GNU
target triples. This optimization was being inhibited when -ffast-math
wasn't set because sincos in GLibC does not set errno, while sin and cos
do. However, this optimization will only run if the attributes on the
sin/cos calls include readnone, which is how clang represents the fact
that it doesn't care about the errno values set by these functions (via
the -fno-math-errno flag).
Reviewers: hfinkel, bogner
Subscribers: mcrosier, javed.absar, llvm-commits, paul.redmond
Differential Revision: https://reviews.llvm.org/D32921
llvm-svn: 305204
The dream of a unified check-line auto-generator for all phases of compilation is dead.
The llc script has already diverged to be better at its goal, so having 2 scripts that
do almost the same thing is just causing confusion for newcomers. I plan to fix up more
x86 tests in a next commit. We can rip out the llc ability in update_test_checks.py after
that.
llvm-svn: 305202
Summary:
The old check for slot overlap treated 2 slots `S` and `T` as
overlapping if there existed a CFG node in which both of the slots could
possibly be active. That is overly conservative and caused stack blowups
in Rust programs. Instead, check whether there is a single CFG node in
which both of the slots are possibly active *together*.
Fixes PR32488.
Patch by Ariel Ben-Yehuda <ariel.byd@gmail.com>
Reviewers: thanm, nagisa, llvm-commits, efriedma, rnk
Reviewed By: thanm
Subscribers: dotdash
Differential Revision: https://reviews.llvm.org/D31583
llvm-svn: 305193
I was looking closer at the x86 test diffs in D33866, and the first change seems like it
shouldn't happen in the first place. So this patch will resolve that.
Using Agner's tables and AMD docs, vperm2f128 and vinsertf128 have identical timing for
any given CPU model, so we should be able to interchange those without affecting perf.
But as we can see in some of the diffs here, using vperm2f128 allows load folding, so
we should take that opportunity to reduce code size and register pressure.
A secondary advantage is making AVX1 and AVX2 codegen more similar. Given that vperm2f128
was introduced with AVX1, we should be selecting it in all of the same situations that we
would with AVX2. If there's some reason that an AVX1 CPU would not want to use this
instruction, that should be fixed up in a later pass.
Differential Revision: https://reviews.llvm.org/D33938
llvm-svn: 305171
Summary: UADDO has 2 result, and one must check the result no before doing any kind of combine. Without it, the transform is invalid.
Reviewers: joerg
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34088
llvm-svn: 305162
This prevents against assertion errors like PR32659 which occur from a
replacement deleting a node after it's been added to the list argument
of RemoveDeadNodes. The specific failure from PR32659 does not
currently happen, but it is still potentially possible. The underlying
cause is that the callers of the change dfunction builds up a list of
nodes to delete after having moved their uses and it possible that a
move of a later node will cause a previously deleted nodes to be
deleted.
Reviewers: bkramer, spatel, davide
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D33731
llvm-svn: 305070
If we know that both operands of an unsigned integer vector comparison are non-negative,
then it's safe to directly use a signed-compare-greater-than instruction (the only non-equality
integer vector compare predicate provided by SSE/AVX).
We're intentionally not changing the condition code to signed in order to preserve the
existing transforms that use min/max/psubus below here.
This should solve PR33276:
https://bugs.llvm.org/show_bug.cgi?id=33276
Differential Revision: https://reviews.llvm.org/D33862
llvm-svn: 304909
Summary:
The patch makes instruction count the highest priority for
LSR solution for X86 (previously registers had highest priority).
Reviewers: qcolombet
Differential Revision: http://reviews.llvm.org/D30562
From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 304824
If -simplify-mir option is passed then MIRPrinter will not print such fields.
This change also required some lit test cases in CodeGen directory to be changed.
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D32304
llvm-svn: 304779
This is a negative test as pextrw doesn't write to all 32-bits of the
spilled GPR. This fold ended up happening when D32684 was landed and
covers the regression that motivated reverting it in r304762.
llvm-svn: 304763
In testing, we've found yet another miscompile caused by the new tables.
And this one is even less clear how to fix (we could teach it to fold
a 16-bit load instead of the 32-bit load it wants, or block folding
entirely).
Also, the approach to excluding instructions seems increasingly to not
scale well.
I have left a more detailed analysis on the review log for the original
patch (https://reviews.llvm.org/D32684) along with suggested path
forward. I will land an additional test case that I wrote which covers
the code that was miscompiling (folding into the output of `pextrw`) in
a subsequent commit to keep this a pure revert.
For each commit reverted here, I've restricted the revert to the
non-test code touching the x86 fold table emission until the last commit
where I did revert the test updates. This means the *new* test cases
added for `insertps` and `xchg` remain untouched (and continue to pass).
Reverted commits:
r304540: [X86] Don't fold into memory operands into insertps in the ...
r304347: [TableGen] Adapt more places to getValueAsString now ...
r304163: [X86] Don't fold away the memory operand of an xchg.
r304123: Don't capture a temporary std::string in a StringRef.
r304122: Resubmit "[X86] Adding new LLVM TableGen backend that ..."
Original commit was in r304088, and after a string of fixes was reverted
previously in r304121 to fix build bots, and then re-landed in r304122.
llvm-svn: 304762
- Move ISel (and pre-isel) pass construction into TargetPassConfig
- Extract AsmPrinter construction into a helper function
Putting the ISel code into TargetPassConfig seems a lot more natural and
both changes together make make it easier to build custom pipelines
involving .mir in an upcoming commit. This moves MachineModuleInfo to an
earlier place in the pass pipeline which shouldn't have any effect.
llvm-svn: 304754
There's nothing darwin-specific in these tests, and using that
setting causes extra phantom diffs when the auto-generated check
lines are regenerated today.
llvm-svn: 304753
Running `llc -verify-dom-info` on the attached testcase results in a
crash in the verifier, due to a stale dominator tree.
i.e.
DominatorTree is not up to date!
Computed:
=============================--------------------------------
Inorder Dominator Tree:
[1] %safe_mod_func_uint8_t_u_u.exit.i.i.i {0,7}
[2] %lor.lhs.false.i61.i.i.i {1,2}
[2] %safe_mod_func_int8_t_s_s.exit.i.i.i {3,6}
[3] %safe_div_func_int64_t_s_s.exit66.i.i.i {4,5}
Actual:
=============================--------------------------------
Inorder Dominator Tree:
[1] %safe_mod_func_uint8_t_u_u.exit.i.i.i {0,9}
[2] %lor.lhs.false.i61.i.i.i {1,2}
[2] %safe_mod_func_int8_t_s_s.exit.i.i.i {3,8}
[3] %safe_div_func_int64_t_s_s.exit66.i.i.i {4,5}
[3] %safe_mod_func_int8_t_s_s.exit.i.i.i.lor.lhs.false.i61.i.i.i_crit_edge {6,7}
This is because in `SelectionDAGIsel` we split critical edges without
updating the corresponding dominator for the function (and we claim
in `MachineFunctionPass::getAnalysisUsage()` that the domtree is preserved).
We could either stop preserving the domtree in `getAnalysisUsage`
or tell `splitCriticalEdge()` to update it.
As the second option is easy to implement, that's the one I chose.
Differential Revision: https://reviews.llvm.org/D33800
llvm-svn: 304742
We currently generate BUILD_VECTOR as a tree of UNPCKL shuffles of the same type:
e.g. for v4f32:
Step 1: unpcklps 0, 2 ==> X: <?, ?, 2, 0>
: unpcklps 1, 3 ==> Y: <?, ?, 3, 1>
Step 2: unpcklps X, Y ==> <3, 2, 1, 0>
The issue is because we are not placing sequential vector elements together early enough, we fail to recognise many combinable patterns - consecutive scalar loads, extractions etc.
Instead, this patch unpacks progressively larger sequential vector elements together:
e.g. for v4f32:
Step 1: unpcklps 0, 2 ==> X: <?, ?, 1, 0>
: unpcklps 1, 3 ==> Y: <?, ?, 3, 2>
Step 2: unpcklpd X, Y ==> <3, 2, 1, 0>
This does mean that we are creating UNPCKL shuffle of different value types, but the relevant combines that benefit from this are quite capable of handling the additional BITCASTs that are now included in the shuffle tree.
Differential Revision: https://reviews.llvm.org/D33864
llvm-svn: 304688
There's nothing darwin-specific in these tests, and using
that setting causes extra phantom diffs when the auto-generated
check lines are regenerated today.
llvm-svn: 304614
This pass allows to run the register scavenging independently of
PrologEpilogInserter to allow targeted testing.
Also adds some basic register scavenging tests.
llvm-svn: 304606
Since r288804, we try to lower build_vectors on AVX using broadcasts of
float/double. However, when we broadcast integer values that happen to
have a NaN float bitpattern, we lose the NaN payload, thereby changing
the integer value being broadcast.
This is caused by ConstantFP::get, to which we pass the splat i32 as
a float (by bitcasting it using bitsToFloat). ConstantFP::get takes
a double parameter, so we end up lossily converting a single-precision
NaN to double-precision.
Instead, avoid any kinds of conversions by directly building an APFloat
from the splatted APInt.
Note that this also fixes another piece of code (broadcast of
subvectors), that currently isn't susceptible to the same problem.
Also note that we could really just use APInt and ConstantInt
throughout: the constant pool type doesn't matter much. Still, for
consistency, use the appropriate type.
llvm-svn: 304590
This initial patch doesn't actually do much useful. It's just to show where the new code goes. Once this is in, I'll extend the verification logic to check more useful properties.
For those curious, the more complicated version of this patch already found one very suspicious thing.
Differential Revision: https://reviews.llvm.org/D33819
llvm-svn: 304564
insertps behaves differently, the register form selects from an input
register based on the immediate operand while the memory form just loads
the given address. We have custom code to change the immediate in cases
where that's legal, so completely remove insertps from the generated
tables.
llvm-svn: 304540
Summary:
Add an early combine to match patterns such as:
(i16 bitcast (v16i1 x))
->
(i16 movmsk (v16i8 sext (v16i1 x)))
This combine needs to happen early enough before
type-legalization scalarizes the result of the setcc.
Reviewers: igorb, craig.topper, RKSimon
Subscribers: delena, llvm-commits
Differential Revision: https://reviews.llvm.org/D33311
llvm-svn: 304406
Summary: This pattern is no very useful per se, but it exposes optimization for toehr patterns that wouldn't kick in otherwize. It's very common and worth optimizing for.
Reviewers: jyknight, nemanjai, mkuper, spatel, RKSimon, zvi, bkramer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D32756
llvm-svn: 304402
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
We should have a single call site entry with no landing pad. This
indicates that no EH action should be taken and the unwinder should
unwind to the next frame.
We currently don't recognize __gxx_personality_seh0 as a known
personality, so we forcibly emit a table, and that table was wrong. This
was filed as PR33220. Now we emit a correct table for that personality.
The next step is to recognize that we can completely skip the table for
this personality.
llvm-svn: 304363
Summary:
If we attempt to unfold an SUnit in ScheduleDAG that results in
finding an already scheduled load, we must should abort the
unfold as it will not improve scheduling.
This fixes PR32610.
Reviewers: jmolloy, sunfish, bogner, spatel
Subscribers: llvm-commits, MatzeB
Differential Revision: https://reviews.llvm.org/D32911
llvm-svn: 304321
xchg with a mem operand has different locking semantics. If we unfold it
into a xchg r,r we will loose the implicit lock. Likewise we never want
to fold a register xchg into a memory one as it would be a lot slower.
This triggers during LLVM selfhost.
llvm-svn: 304163
The extending load possibility was missed in:
https://reviews.llvm.org/rL304072
We might want to handle this cases as a follow-up, but bailing out for now
to avoid miscompiling.
llvm-svn: 304153
This was reverted due to buildbot breakages and I was not familiar
with this code to investigate it. But while trying to get a
useful backtrace for the author, it turns out the fix was very
obvious. Resubmitting this patch as is, and will submit the
fix in a followup so that the fix is not hidden in the larger
CL.
llvm-svn: 304122
This reverts commit 28cb1003507f287726f43c771024a1dc102c45fe as well
as all subsequent followups. llvm-tblgen currently segfaults with
this change, and it seems it has been broken on the bots all
day with no fixes in preparation. See, for example:
http://lab.llvm.org:8011/builders/clang-x86-windows-msvc2015/
llvm-svn: 304121
X86 backend holds huge tables in order to map between the register and memory forms of each instruction.
This TableGen Backend automatically generated all these tables with the appropriate flags for each entry.
Differential Revision: https://reviews.llvm.org/D32684
llvm-svn: 304088
If we have (extract_subvector(load wide vector)) with no other users,
that can just be (load narrow vector). This is intentionally conservative.
Follow-ups may loosen the one-use constraint to account for the extract cost
or just remove the one-use check.
The memop chain updating is based on code that already exists multiple times
in x86 lowering, so that should be pulled into a helper function as a follow-up.
Background: this is a potential improvement noticed via regressions caused by
making x86's peekThroughBitcasts() not loop on consecutive bitcasts (see
comments in D33137).
Differential Revision: https://reviews.llvm.org/D33578
llvm-svn: 304072
Rewrite fixupKills() to use the LivePhysRegs class. Simplifies the code
and fixes a bug where the CSR registers in return blocks where missed
leading to invalid kill flags. Also remove the unnecessary rule that we
wouldn't set kill flags on tied operands.
No tests as I have an upcoming commit improving MachineVerifier checks
to catch these cases in multiple existing lit tests.
llvm-svn: 304055
In the best case:
extract (binop (concat X1, X2), (concat Y1, Y2)), N --> binop XN, YN
...we kill all of the extract/concat and just have narrow binops remaining.
If only one of the binop operands is amenable, this transform is still
worthwhile because we kill some of the extract/concat.
Optional bitcasting makes the code more complicated, but there doesn't
seem to be a way to avoid that.
The TODO about extending to more than bitwise logic is there because we really
will regress several x86 tests including madd, psad, and even a plain
integer-multiply-by-2 or shift-left-by-1. I don't think there's anything
fundamentally wrong with this patch that would cause those regressions; those
folds are just missing or brittle.
If we extend to more binops, I found that this patch will fire on at least one
non-x86 regression test. There's an ARM NEON test in
test/CodeGen/ARM/coalesce-subregs.ll with a pattern like:
t5: v2f32 = vector_shuffle<0,3> t2, t4
t6: v1i64 = bitcast t5
t8: v1i64 = BUILD_VECTOR Constant:i64<0>
t9: v2i64 = concat_vectors t6, t8
t10: v4f32 = bitcast t9
t12: v4f32 = fmul t11, t10
t13: v2i64 = bitcast t12
t16: v1i64 = extract_subvector t13, Constant:i32<0>
There was no functional change in the codegen from this transform from what I
could see though.
For the x86 test changes:
1. PR32790() is the closest call. We don't reduce the AVX1 instruction count in that case,
but we improve throughput. Also, on a core like Jaguar that double-pumps 256-bit ops,
there's an unseen win because two 128-bit ops have the same cost as the wider 256-bit op.
SSE/AVX2/AXV512 are not affected which is expected because only AVX1 has the extract/concat
ops to match the pattern.
2. do_not_use_256bit_op() is the best case. Everyone wins by avoiding the concat/extract.
Related bug for IR filed as: https://bugs.llvm.org/show_bug.cgi?id=33026
3. The SSE diffs in vector-trunc-math.ll are just scheduling/RA, so nothing real AFAICT.
4. The AVX1 diffs in vector-tzcnt-256.ll are all the same pattern: we reduced the instruction
count by one in each case by eliminating two insert/extract while adding one narrower logic op.
https://bugs.llvm.org/show_bug.cgi?id=32790
Differential Revision: https://reviews.llvm.org/D33137
llvm-svn: 303997
Rename the DEBUG_TYPE to match the names of corresponding passes where
it makes sense. Also establish the pattern of simply referencing
DEBUG_TYPE instead of repeating the passname where possible.
llvm-svn: 303921
AVX512_VPOPCNTDQ is a new feature set that was published by Intel.
The patch represents the LLVM side of the addition of two new intrinsic based instructions (vpopcntd and vpopcntq).
Differential Revision: https://reviews.llvm.org/D33169
llvm-svn: 303858