While various DAG combines try to guarantee that a vector SETCC
operation will have the same output size as input, there's nothing
intrinsic to either creation or LegalizeTypes that actually guarantees
it, so the function needs to be ready to handle a mismatch.
Fortunately this is easy enough, just extend or truncate the naturally
compared result.
I couldn't reproduce the failure in other backends that I know have
SIMD, so it's probably only an issue for these two due to shared
heritage.
Should fix PR21645.
llvm-svn: 228518
COFF section flags are not idempotent:
'rd' will make a read-write section because 'd' implies write
'dr' will make a read-only section because 'r' disables write
llvm-svn: 228490
If a loop predecessor has an invoke as its terminator, and the return value
from that invoke is used to determine the loop iteration space, then we can't
insert a computation based on that value in the loop predecessor prior to the
terminator (oops). If there's such an invoke, or just no predecessor for that
matter, insert a new loop preheader.
llvm-svn: 228488
Unfortunately, even with the workaround of disabling the linker TLS
optimizations in Clang restored (which has already been done), this still
breaks self-hosting on my P7 machine (-O3 -DNDEBUG -mcpu=native).
Bill is currently working on an alternate implementation to address the TLS
issue in a way that also fully elides the linker bug (which, unfortunately,
this approach did not fully), so I'm reverting this now.
llvm-svn: 228460
Doesn't seem necessary anymore. I think this was mostly compensating for
not enabling WQM for texture sampling instructions.
v2: Add test coverage
Reviewed-by: Tom Stellard <tom@stellard.net>
llvm-svn: 228373
If whole quad mode isn't enabled for these, the level of detail is
calculated incorrectly for pixels along diagonal triangle edges, causing
artifacts.
v2: Use a TSFlag instead of lots of switch cases
v3: Add test coverage
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=88642
Reviewed-by: Tom Stellard <tom@stellard.net>
llvm-svn: 228372
Avoid the creation of select instructions which can result in different
scheduling of the selects.
I also added a bunch of additional store volatiles. Those avoid A
CodeGen problem (bug?) where normalizes and denomarlizing the control
moves all shift instructions into the first block where ISel can't match
them together with the cmps.
llvm-svn: 228362
PowerPC supports pre-increment load/store instructions (except for Altivec/VSX
vector load/stores). Using these on embedded cores can be very important, but
most loops are not naturally set up to use them. We can often change that,
however, by placing loops into a non-canonical form. Generically, this means
transforming loops like this:
for (int i = 0; i < n; ++i)
array[i] = c;
to look like this:
T *p = array[-1];
for (int i = 0; i < n; ++i)
*++p = c;
the key point is that addresses accessed are pulled into dedicated PHIs and
"pre-decremented" in the loop preheader. This allows the use of pre-increment
load/store instructions without loop peeling.
A target-specific late IR-level pass (running post-LSR), PPCLoopPreIncPrep, is
introduced to perform this transformation. I've used this code out-of-tree for
generating code for the PPC A2 for over a year. Somewhat to my surprise,
running the test suite + externals on a P7 with this transformation enabled
showed no performance regressions, and one speedup:
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk
-2.32514% +/- 1.03736%
So I'm going to enable it on everything for now. I was surprised by this
because, on the POWER cores, these pre-increment load/store instructions are
cracked (and, thus, harder to schedule effectively). But seeing no regressions,
and feeling that it is generally easier to split instructions apart late than
it is to combine them late, this might be the better approach regardless.
In the future, we might want to integrate this functionality into LSR (but
currently LSR does not create new PHI nodes, so (for that and other reasons)
significant work would need to be done).
llvm-svn: 228328
PowerPC supports pre-increment floating-point load/store instructions, both r+r
and r+i, and we had patterns for them, but they were not marked as legal. Mark
them as legal (and add a test case).
llvm-svn: 228327
The combine that forms extloads used to be disabled on vector types,
because "None of the supported targets knows how to perform load and
sign extend on vectors in one instruction."
That's not entirely true, since at least SSE4.1 X86 knows how to do
those sextloads/zextloads (with PMOVS/ZX).
But there are several aspects to getting this right.
First, vector extloads are controlled by a profitability callback.
For instance, on ARM, several instructions have folded extload forms,
so it's not always beneficial to create an extload node (and trying to
match extloads is a whole 'nother can of worms).
The interesting optimization enables folding of s/zextloads to illegal
(splittable) vector types, expanding them into smaller legal extloads.
It's not ideal (it introduces some legalization-like behavior in the
combine) but it's better than the obvious alternative: form illegal
extloads, and later try to split them up. If you do that, you might
generate extloads that can't be split up, but have a valid ext+load
expansion. At vector-op legalization time, it's too late to generate
this kind of code, so you end up forced to scalarize. It's better to
just avoid creating egregiously illegal nodes.
This optimization is enabled unconditionally on X86.
Note that the splitting combine is happy with "custom" extloads. As
is, this bypasses the actual custom lowering, and just unrolls the
extload. But from what I've seen, this is still much better than the
current custom lowering, which does some kind of unrolling at the end
anyway (see for instance load_sext_4i8_to_4i64 on SSE2, and the added
FIXME).
Also note that the existing combine that forms extloads is now also
enabled on legal vectors. This doesn't have a big effect on X86
(because sext+load is usually combined to sext_inreg+aextload).
On ARM it fires on some rare occasions; that's for a separate commit.
Differential Revision: http://reviews.llvm.org/D6904
llvm-svn: 228325
The return value's address must be returned in %rax.
i.e. the callee needs to copy the sret argument (%rdi)
into the return value (%rax).
This probably won't manifest as a bug when the caller is LLVM-compiled
code. But it is an ABI guarantee and tools expect it.
llvm-svn: 228321
We should be setting UnrollingPreferences::MaxCount to MAX_UINT instead
of UnrollingPreferences::Count.
Count is a 'forced unrolling factor', while MaxCount sets an upper
limit to the unrolling factor.
Setting Count to MAX_UINT was causing the loop in the testcase to be
unrolled 15 times, when it only had a maximum of 4 iterations.
llvm-svn: 228303
The llvm.SI.end.cf intrinsic is used to mark the end of if-then blocks,
if-then-else blocks, and loops. It is responsible for updating the
exec mask to re-enable threads that had been masked during the preceding
control flow block. For example:
s_mov_b64 exec, 0x3 ; Initial exec mask
s_mov_b64 s[0:1], exec ; Saved exec mask
v_cmpx_gt_u32 exec, s[2:3], v0, 0 ; llvm.SI.if
do_stuff()
s_or_b64 exec, exec, s[0:1] ; llvm.SI.end.cf
The bug fixed by this patch was one where the llvm.SI.end.cf intrinsic
was being inserted into the header of loops. This would happen when
an if block terminated in a loop header and we would end up with
code like this:
s_mov_b64 exec, 0x3 ; Initial exec mask
s_mov_b64 s[0:1], exec ; Saved exec mask
v_cmpx_gt_u32 exec, s[2:3], v0, 0 ; llvm.SI.if
do_stuff()
LOOP: ; Start of loop header
s_or_b64 exec, exec, s[0:1] ; llvm.SI.end.cf <-BUG: The exec mask has the
same value at the beginning of each loop
iteration.
do_stuff();
s_cbranch_execnz LOOP
The fix is to create a new basic block before the loop and insert the
llvm.SI.end.cf there. This way the exec mask is restored before the
start of the loop instead of at the beginning of each iteration.
llvm-svn: 228302
Patch by Kit Barton.
Add the vector count leading zeros instruction for byte, halfword,
word, and doubleword sizes. This is a fairly straightforward addition
after the changes made for vpopcnt:
1. Add the correct definitions for the various instructions in
PPCInstrAltivec.td
2. Make the CTLZ operation legal on vector types when using P8Altivec
in PPCISelLowering.cpp
Test Plan
Created new test case in test/CodeGen/PowerPC/vec_clz.ll to check the
instructions are being generated when the CTLZ operation is used in
LLVM.
Check the encoding and decoding in test/MC/PowerPC/ppc_encoding_vmx.s
and test/Disassembler/PowerPC/ppc_encoding_vmx.txt respectively.
llvm-svn: 228301
Implement a BITCAST dag combine to transform i32->mmx conversion patterns
into a X86 specific node (MMX_MOVW2D) and guarantee that moves between
i32 and x86mmx are better handled, i.e., don't use store-load to do the
conversion..
llvm-svn: 228293
Avoid regression in previously supported MMX code by adding different
combinations of tests which exercise MMX bitcasts. Small improvements
to these patterns should come next.
llvm-svn: 228292
Parts of llvm were not expecting it and we wouldn't print
the entity size of the section.
Given what comdats are used for, having SHF_MERGE sections would be
just a small improvement, so just disable it for now.
Fixes pr22463.
llvm-svn: 228196
v2i32, i32, trunc i32 to i16, and truc i32 to i8 stores are legal for
all address spaces. We had marked them as custom in order to lower
them for the private address space, but this is no longer necessary.
This enables lowering of misaligned stores of these types in the
DAGLegalizer.
llvm-svn: 228189
In case CSE reuses a previoulsy unused register the dead-def flag has to
be cleared on the def operand, as exposed by the arm64-cse.ll test.
This fixes PR22439 and the corresponding rdar://19694987
Differential Revision: http://reviews.llvm.org/D7395
llvm-svn: 228178
Currently, Cortex-A72 is modelled as an Cortex-A57 except the fp
load balancing pass isn't enabled for Cortex-A72 as it's not
profitable to have it enabled for this core.
Patch by Ranjeet Singh.
llvm-svn: 228140
This associates movss and movsd with the packed single and packed double
execution domains (resp.). While this is largely cosmetic, as we now
don't have weird ping-pong-ing between single and double precision, it
is also useful because it avoids the domain fixing algorithm from seeing
domain breaks that don't actually exist. It will also be much more
important if we have an execution domain default other than packed
single, as that would cause us to mix movss and movsd with integer
vector code on a regular basis, a very bad mixture.
llvm-svn: 228135
version of the script.
Changes include:
- Using the VEX prefix
- Skipping more detail when we have useful shuffle comments to match
- Matching more shuffle comments that have been added to the printer
(yay!)
- Matching the destination registers of some AVX instructions
- Stripping trailing whitespace that crept in
- Fixing indentation issues
Nothing interesting going on here. I'm just trying really hard to ensure
these changes don't show up in the diffs with actual changes to the
backend.
llvm-svn: 228132
This reverts patches 223862, 224198, 224203, and 224754, which were all
related to the vector load/store combining and were reverted/reaplied
a few times due to the same alignment problems we're seeing now.
Further tests, mainly self-hosting Clang, will be needed to reapply this
patch in the future.
llvm-svn: 228129
zero for v8i16 as well.
These exhibit the same domain badness, but also exhibit other weaknesses
in our blend lowering. More fixes to come.
llvm-svn: 228126
This is the simplest form of bit-math based blending which only fires
when we are blending with zero and is relatively profitable. I've only
enabled this path on very specific lowering strategies. I'm planning to
widen its applicability in subsequent patches, but so far you'll notice
that even though we get fewer shufps instructions, we *still* do the bit
math in the FP execution port. I'm looking into why this is still
happening.
llvm-svn: 228124
update_llc_test_checks.py.
The exact format of the checks has changed over time. This includes
different indenting rules, new shuffle comments that have been added,
and more operand hiding behind regular expressions.
No functional change to the tests are expected here, but this will make
subsequent patches have a clean diff as they change shuffle lowering.
llvm-svn: 228097
update_llc_test_checks.py script uses, and refresh the checks in this
test.
No functionality changed here, just bringing this test up to work with
automated updates using the python script.
llvm-svn: 228096
This will make it easy to update as I change some parts of the X86
backend, makes it more clear what instruction differences are
introduced, and I find it makes it a bit easier to read as well.
llvm-svn: 228095
Patch to match cases where shuffle masks can be reduced to bit shifts. Similar to byte shift shuffle matching from D5699.
Differential Revision: http://reviews.llvm.org/D6649
llvm-svn: 228047
with 'stress' to indicate that the specific output isn't interesting and
relax them to only check the last instruction (a ret).
I've updated the one test case that really uses this to name the one
'stress_test' which was actually producing output we can directly check.
With this, the script doesn't introduce noise when run over the v16 test
file.
llvm-svn: 228033
This patch adds general shuffle pattern matching for the MOVQ zero-extend instruction (copy lower 64bits, zero upper) for all 128-bit integer vectors, it is added as a fallback test in lowerVectorShuffleAsZeroOrAnyExtend.
llvm-svn: 228022
This patch detects consecutive vector loads using the existing
EltsFromConsecutiveLoads() logic. This fixes:
http://llvm.org/bugs/show_bug.cgi?id=22329
This patch effectively reverts the tablegen additions of D6492 /
http://reviews.llvm.org/rL224344 ...which in hindsight were a horrible hack.
The test cases that were added with that patch are simply modified to load
from varying offsets of a base pointer. These loads did not match the existing
tablegen patterns.
A happy side effect of doing this optimization earlier is that we can now fold
the load into a math op where possible; this is shown in some of the updated
checks in the test file.
Differential Revision: http://reviews.llvm.org/D7303
llvm-svn: 228006
This can happen when a REV instruction is commuted.
The trick is not to define the _vi versions of instructions, which has these
consequences:
- code generation will always fail if a pseudo cannot be lowered
(very useful to catch bugs where an unsupported instruction somehow makes
it to the printer)
- ability to query if a pseudo can be lowered, which is done in commuteOpcode
to prevent REV from commuting to non-REV on VI
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 227990
This fixes a hang when using an empty geometry shader.
v2: - don't add s_nop when followed by s_waitcnt
- comestic changes
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 227986
r224330 introduced a bug by misinterpreting the "FeatureVectorUAMem" bit.
The commit log says that change did not affect anything, but that's not correct.
That change allowed SSE instructions to have unaligned mem operands folded into
math ops, and that's not allowed in the default specification for any SSE variant.
The bug is exposed when compiling for an AVX-capable CPU that had this feature
flag but without enabling AVX codegen. Another mistake in r224330 was not adding
the feature flag to all AVX CPUs; the AMD chips were excluded.
This is part of the fix for PR22371 ( http://llvm.org/bugs/show_bug.cgi?id=22371 ).
This feature bit is SSE-specific, so I've renamed it to "FeatureSSEUnalignedMem".
Changed the existing test case for the feature bit to reflect the new name and
renamed the test file itself to better reflect the feature.
Added runs to fold-vex.ll to check for the failing codegen.
Note that the feature bit is not set by default on any CPU because it may require a
configuration register setting to enable the enhanced unaligned behavior.
llvm-svn: 227983
This patch is a third attempt to properly handle the local-dynamic and
global-dynamic TLS models.
In my original implementation, calls to __tls_get_addr were hidden
from view until the asm-printer phase, at which point the underlying
branch-and-link instruction was created with proper relocations. This
mostly worked well, but I used some repellent techniques to ensure
that the TLS_GET_ADDR nodes at the SD and MI levels correctly received
input from GPR3 and produced output into GPR3. This proved to work
badly in the presence of multiple TLS variable accesses, with the
copies to and from GPR3 being scheduled incorrectly and generally
creating havoc.
In r221703, I addressed that problem by representing the calls to
__tls_get_addr as true calls during instruction lowering. This had
the advantage of removing all of the bad hacks and relying on the
existing call machinery to properly glue the copies in place. It
looked like this was going to be the right way to go.
However, as a side effect of the recent discovery of problems with
linker optimizations for TLS, we discovered cases of suboptimal code
generation with this strategy. The problem comes when tls_get_addr is
called for the same address, and there is a resulting CSE
opportunity. It turns out that in such cases MachineCSE will common
the addis/addi instructions that set up the input value to
tls_get_addr, but will not common the calls themselves. MachineCSE
does not have any machinery to common idempotent calls. This is
perfectly sensible, since presumably this would be done at the IR
level, and introducing calls in the back end isn't commonplace. In
any case, we end up with two calls to __tls_get_addr when one would
suffice, and that isn't good.
I presumed that the original design would have allowed commoning of
the machine-specific nodes that hid the __tls_get_addr calls, so as
suggested by Ulrich Weigand, I went back to that design and cleaned it
up so that the copies were properly held together by glue
nodes. However, it turned out that this didn't work either...the
presence of copies to physical registers kept the machine-specific
nodes from being commoned also.
All of which leads to the design presented here. This is a return to
the original design, except that no attempt is made to introduce
copies to and from GPR3 during instruction lowering. Virtual registers
are used until prior to register allocation. At that point, a special
pass is run that identifies the machine-specific nodes that hide the
tls_get_addr calls and introduces the copies to and from GPR3 around
them. The register allocator then coalesces these copies away. With
this design, MachineCSE succeeds in commoning tls_get_addr calls where
possible, and we get nice optimal code generation (better than GCC at
the moment, which does not common these calls).
One additional problem must be dealt with: After introducing the
mentions of the physical register GPR3, the aggressive anti-dependence
breaker sees opportunities to improve scheduling by selecting a
different register instead. Flags must be used on the instruction
descriptions to tell the anti-dependence breaker to keep its hands in
its pockets.
One thing missing from the original design was recording a definition
of the link register on the GET_TLS_ADDR nodes. Doing this was found
to be insufficient to force a stack frame to be created, which led to
looping behavior because two different LR values were stored at the
same address. This appears to have been an oversight in
PPCFrameLowering::determineFrameLayout(), which is repaired here.
Because MustSaveLR() returns true for calls to builtin_return_address,
this changed the expected behavior of
test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but
formerly did not. I've fixed the test case to reflect this.
There are existing TLS tests to catch regressions; the checks in
test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the
face of instruction scheduling with these changes, so I fixed that
up.
I've added a new test case based on the PrettyStackTrace module that
demonstrated the original problem. This checks that we get correct
code generation and that CSE of the calls to __get_tls_addr has taken
place.
llvm-svn: 227976
This test was checking for lack of a "movaps" (an aligned load)
rather than a "movups" (an unaligned load). It also included
a store which complicated the checking.
Add specific CPU runs to prevent subtarget feature flag overrides
from inhibiting this optimization.
llvm-svn: 227972
Improve EXTRACT_VECTOR_ELT DAG combine to catch conversion patterns
between x86mmx and i32 with more layers of indirection.
Before:
movq2dq %mm0, %xmm0
movd %xmm0, %eax
After:
movd %mm0, %eax
llvm-svn: 227969
LLVM ToT produces poor MMX code compared to 3.5. However, part of the previous
functionality can be achieved by using -x86-experimental-vector-widening-legalization.
Add tests to be sure we don't regress again.
llvm-svn: 227869
This is true for SI only. CI+ supports unaligned memory accesses,
but this requires driver support, so for now we disallow unaligned
accesses for all GCN targets.
llvm-svn: 227822
This avoids a partial false dependency on the previous content of
the upper lanes of the destination vector register.
Differential Revision: http://reviews.llvm.org/D7307
llvm-svn: 227820
Summary:
Previously it only avoided optimizing signed comparisons to 0.
Sometimes the DAGCombiner will optimize the unsigned comparisons
to 0 before it gets to the peephole pass, but sometimes it doesn't.
Fix for PR22373.
Test Plan: test/CodeGen/ARM/sub-cmp-peephole.ll
Reviewers: jfb, manmanren
Subscribers: aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D7274
llvm-svn: 227809
The VSX store instructions were also picking up an implicit "may read" from the
default pattern, which was an intrinsic (and we don't currently have a way of
specifying write-only intrinsics).
This was causing MI verification to fail for VSX spill restores.
llvm-svn: 227759
isel is actually a cracked instruction on the P7/P8, and must start a dispatch
group. The scheduling model should reflect this so that we don't bunch too many
of them together when possible.
Thanks to Bill Schmidt and Pat Haugen for helping to sort this out.
llvm-svn: 227758
This moves the transformation introduced in r223757 into a separate MI pass.
This allows it to cover many more cases (not only cases where there must be a
reserved call frame), and perform rudimentary call folding. It still doesn't
have a heuristic, so it is enabled only for optsize/minsize, with stack
alignment <= 8, where it ought to be a fairly clear win.
(Re-commit of r227728)
Differential Revision: http://reviews.llvm.org/D6789
llvm-svn: 227752
The TOC base pointer is passed in r2, and we normally reserve this register so
that we can depend on it being there. However, for leaf functions, and
specifically those leaf functions that don't do any TOC access of their own
(which is generally due to accessing the constant pool, using TLS, etc.),
we can treat r2 as an ordinary callee-saved register (it must be callee-saved
because, for local direct calls, the linker will not insert any save/restore
code).
The allocation order has been changed slightly for PPC64/ELF systems to put r2
at the end of the list (while leaving it near the beginning for Darwin systems
to prevent unnecessary output changes). While r2 is allocatable, using it still
requires spill/restore traffic, and thus comes at the end of the list.
llvm-svn: 227745
This moves the transformation introduced in r223757 into a separate MI pass.
This allows it to cover many more cases (not only cases where there must be a
reserved call frame), and perform rudimentary call folding. It still doesn't
have a heuristic, so it is enabled only for optsize/minsize, with stack
alignment <= 8, where it ought to be a fairly clear win.
Differential Revision: http://reviews.llvm.org/D6789
llvm-svn: 227728
Summary:
CUDA driver can unroll loops when jit-compiling PTX. To prevent CUDA
driver from unrolling a loop marked with llvm.loop.unroll.disable is not
unrolled by CUDA driver, we need to emit .pragma "nounroll" at the
header of that loop.
This patch also extracts getting unroll metadata from loop ID metadata
into a shared helper function.
Test Plan: test/CodeGen/NVPTX/nounroll.ll
Reviewers: eliben, meheff, jholewinski
Reviewed By: jholewinski
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D7041
llvm-svn: 227703
This patch adds shuffle mask decodes for integer zero extends (pmovzx** and movq xmm,xmm) and scalar float/double loads/moves (movss/movsd).
Also adds shuffle mask decodes for integer loads (movd/movq).
Differential Revision: http://reviews.llvm.org/D7228
llvm-svn: 227688
Now that -mstack-probe-size is piped through to the backend via the function
attribute as on Windows x86, honour the value to permit handling of non-default
values for stack probes. This is needed /Gs with the clang-cl driver or
-mstack-probe-size with the clang driver when targeting Windows on ARM.
llvm-svn: 227667
Some of those didn't even have run lines: they were removed
inadvertently during the Great Merge of 2014.
They used to check for DUPs, but now we go through W-regs?
Filed PR22418 for that potential regression.
For now, just make the tests explicit, so we now where we stand.
llvm-svn: 227635
MSDN's x64 software conventions page says that this is one of the fixed
list of legal epilogues:
https://msdn.microsoft.com/en-us/library/tawsa7cb.aspx
Presumably this is how the unwinder distinguishes epilogue jumps from
in-function control flow.
Also normalize the way we place "## TAILCALL" comments on such jumps.
llvm-svn: 227611
In the large code model, we now put __chkstk in %r11 before calling it.
Refactor the code so that we only do this once. Simplify things by using
__chkstk_ms instead of __chkstk on cygming. We already use that symbol
in the prolog emission, and it simplifies our logic.
Second half of PR18582.
llvm-svn: 227519
win64: Call __chkstk through a register with the large code model
Fixes half of PR18582. True dynamic allocas will still have a
CALL64pcrel32 which will fail.
Reviewers: majnemer
Differential Revision: http://reviews.llvm.org/D7267
llvm-svn: 227503
Add tests for the various combines. This should
always be at least cycle neutral on all subtargets for f64,
and faster on some. For f32 we should prefer selecting
v_mad_f32 over v_fma_f32.
llvm-svn: 227484
For large stack offsets the compiler generates multiple immediate mode
sub/add instructions in the prologue/epilogue. This patch makes the
compiler place the final amount to be added/subtracted into a register,
which is then added/substracted with a single operation.
Differential Revision: http://reviews.llvm.org/D7226
llvm-svn: 227458
Patch by Nemanja Ivanovic.
As was uncovered by the failing test case (when run on non-PPC
platforms), the feature set when compiling with -march=ppc64le was not
being picked up. This change ensures that if the -mcpu option is not
specified, the correct feature set is picked up regardless of whether
we are on PPC or not.
llvm-svn: 227455
ELF has support for sections that can be split into fixed size or
null terminated entities.
Since these sections can be split by the linker, it is not necessary
to split them in codegen.
This reduces the combined .o size in a llvm+clang build from
202,394,570 to 173,819,098 bytes.
The time for linking clang with gold (on a VM, on a laptop) goes
from 2.250089985 to 1.383001792 seconds.
The flip side is the size of rodata in clang goes from 10,926,785
to 10,929,345 bytes.
The increase seems to be because of http://sourceware.org/bugzilla/show_bug.cgi?id=17902.
llvm-svn: 227431
If the personality is not a recognized MSVC personality function, this
pass delegates to the dwarf EH preparation pass. This chaining supports
people on *-windows-itanium or *-windows-gnu targets.
Currently this recognizes some personalities used by MSVC and turns
resume instructions into traps to avoid link errors. Even if cleanups
are not used in the source program, LLVM requires the frontend to emit a
code path that resumes unwinding after an exception. Clang does this,
and we get unreachable resume instructions. PR20300 covers cleaning up
these unreachable calls to resume.
Reviewers: majnemer
Differential Revision: http://reviews.llvm.org/D7216
llvm-svn: 227405
Reduce integer multiplication by a constant of the form k*2^c, where k is in {3,5,9} into a lea + shl. Previously it was only done for imulq on 64-bit platforms, but it makes sense for imull and 32-bit as well.
Differential Revision: http://reviews.llvm.org/D7196
llvm-svn: 227308
This includes two things:
1) Fix TCRETURNdi and TCRETURN64di patterns to check the right thing (LP64 as opposed to target bitness).
2) Allow LEA64_32 in MatchingStackOffset.
llvm-svn: 227307
By Asaf Badouh and Elena Demikhovsky
Added special nodes for rounding: FMADD_RND, FMSUB_RND..
It will prevent merge between nodes with rounding and other standard nodes.
llvm-svn: 227303
This commit creates infinite loop in DAG combine for in the LLVM test-suite
for aarch64 with mcpu=cylcone (just having neon may be enough to expose this).
llvm-svn: 227272
This patch resolves part of PR21711 ( http://llvm.org/bugs/show_bug.cgi?id=21711 ).
The 'f3' test case in that report presents a situation where we have two 128-bit
stores extracted from a 256-bit source vector.
Instead of producing this:
vmovaps %xmm0, (%rdi)
vextractf128 $1, %ymm0, 16(%rdi)
This patch merges the 128-bit stores into a single 256-bit store:
vmovups %ymm0, (%rdi)
Differential Revision: http://reviews.llvm.org/D7208
llvm-svn: 227242
No other test I know shows how struct names are mangled in overloaded
intrinsic functions.
Differential Revision: http://reviews.llvm.org/D7037
llvm-svn: 227229
For ordered, unordered, equal and not-equal tests, packed float and double comparison instructions can be safely commuted without affecting the results. This patch checks the comparison mode of the (v)cmpps + (v)cmppd instructions and commutes the result if it can.
Differential Revision: http://reviews.llvm.org/D7178
llvm-svn: 227145
Patch to allow (v)pclmulqdq to be commuted - swaps the src registers and inverts the immediate (low/high) src mask.
Differential Revision: http://reviews.llvm.org/D7180
llvm-svn: 227141
- Rename mmx-builtins to mmx-intrinsics to match other intrinsic test naming.
- Remove tests that duplicate functionality from mmx-intrinsics.ll.
- Move arith related tests to mmx-arith.ll.
- MMX related shuffle goes to vector-shuffle-mmx.ll.
llvm-svn: 227130
Instead of creating a pattern like "(p && a) || ((!p) && b)",
just expand the i8 operands to i32 and perform the selp on them.
Fixes PR22246
llvm-svn: 227123
This patch fixes the following miscompile:
define void @sqrtsd(<2 x double> %a) nounwind uwtable ssp {
%0 = tail call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %a) nounwind
%a0 = extractelement <2 x double> %0, i32 0
%conv = fptrunc double %a0 to float
%a1 = extractelement <2 x double> %0, i32 1
%conv3 = fptrunc double %a1 to float
tail call void @callee2(float %conv, float %conv3) nounwind
ret void
}
Current codegen:
sqrtsd %xmm0, %xmm1 ## high element of %xmm1 is undef here
xorps %xmm0, %xmm0
cvtsd2ss %xmm1, %xmm0
shufpd $1, %xmm1, %xmm1
cvtsd2ss %xmm1, %xmm1 ## operating on undef value
jmp _callee
This is a continuation of http://llvm.org/viewvc/llvm-project?view=revision&revision=224624 ( http://reviews.llvm.org/D6330 )
which was itself a continuation of r167064 ( http://llvm.org/viewvc/llvm-project?view=revision&revision=167064 ).
All of these patches are partial fixes for PR14221 ( http://llvm.org/bugs/show_bug.cgi?id=14221 );
this should be the final patch needed to resolve that bug.
Differential Revision: http://reviews.llvm.org/D6885
llvm-svn: 227111
than on MipsSubtargetInfo.
This required a bit of massaging in the MC level to handle this since
MC is a) largely a collection of disparate classes with no hierarchy,
and b) there's no overarching equivalent to the TargetMachine, instead
only the subtarget via MCSubtargetInfo (which is the base class of
TargetSubtargetInfo).
We're now storing the ABI in both the TargetMachine level and in the
MC level because the AsmParser and the TargetStreamer both need to
know what ABI we have to parse assembly and emit objects. The target
streamer has a pointer to the one in the asm parser and is updated
when the asm parser is created. This is fragile as the FIXME comment
notes, but shouldn't be a problem in practice since we always
create an asm parser before attempting to emit object code via the
assembler. The TargetMachine now contains the ABI so that the DataLayout
can be constructed dependent upon ABI.
All testcases have been updated to use the -target-abi command line
flag so that we can set the ABI without using a subtarget feature.
Should be no change visible externally here.
llvm-svn: 227102
Summary:
This patch adds support for some operations that were missing from
128-bit integer types (add/sub/mul/sdiv/udiv... etc.). With these
changes we can support the __int128_t and __uint128_t data types
from C/C++.
Depends on D7125
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7143
llvm-svn: 227089
This reverts commit r227003. Support for addition/subtraction and
various other operations for the i128 data type will be added in a
future commit based on the review D7143.
llvm-svn: 227082
It appears we have different behavior with and without -mcpu=pwr8 even
with ppc64le defaulting to POWER8. The failure appears as follows:
/home/bb/cmake-llvm-x86_64-linux/llvm-project/llvm/test/CodeGen/PowerPC/ppc64le-aggregates.ll:268:14: error: expected string not found in input
; CHECK-DAG: lfs 1, 0([[REG]])
^
<stdin>:497:11: note: scanning from here
ld 3, .LC1@toc@l(3)
^
<stdin>:497:11: note: with variable "REG" equal to "3"
ld 3, .LC1@toc@l(3)
^
<stdin>:514:2: note: possible intended match here
lfs 1, 0(4)
^
Reverting this particular test case change. Nemanja, please have a look
at the reason for the failure.
llvm-svn: 227055
Test by Nemanja Ivanovic.
Since ppc64le implies POWER8 as a minimum, it makes sense that the
same features are included. Since the pwr8 processor model will likely
be getting new features until the implementation is complete, I
created a new list to add these updates to. This will include them in
both pwr8 and ppc64le.
Furthermore, it seems that it would make sense to compose the feature
lists for other processor models (pwr3 and up). Per discussion in the
review, I will make this change in a subsequent patch.
In order to test the changes, I've added an additional run step to
test cases that specify -march=ppc64le -mcpu=pwr8 to omit the -mcpu
option. Since the feature lists are the same, the behaviour should be
unchanged.
llvm-svn: 227053
- Added KSHIFTB/D/Q for skx
- Added KORTESTB/D/Q for skx
- Fixed store operation for v8i1 type for KNL
- Store size of v8i1, v4i1 and v2i1 are changed to 8 bits
llvm-svn: 227043
Summary:
V8->V9:
- cleanup tests
V7->V8:
- addressed feedback from David:
- switched to range-based 'for' loops
- fixed formatting of tests
V6->V7:
- rebased and adjusted AsmPrinter args
- CamelCased .td, fixed formatting, cleaned up names, removed unused patterns
- diffstat: 3 files changed, 203 insertions(+), 227 deletions(-)
V5->V6:
- addressed feedback from Chandler:
- reinstated full verbose standard banner in all files
- fixed variables that were not in CamelCase
- fixed names of #ifdef in header files
- removed redundant braces in if/else chains with single statements
- fixed comments
- removed trailing empty line
- dropped debug annotations from tests
- diffstat of these changes:
46 files changed, 456 insertions(+), 469 deletions(-)
V4->V5:
- fix setLoadExtAction() interface
- clang-formated all where it made sense
V3->V4:
- added CODE_OWNERS entry for BPF backend
V2->V3:
- fix metadata in tests
V1->V2:
- addressed feedback from Tom and Matt
- removed top level change to configure (now everything via 'experimental-backend')
- reworked error reporting via DiagnosticInfo (similar to R600)
- added few more tests
- added cmake build
- added Triple::bpf
- tested on linux and darwin
V1 cover letter:
---------------------
recently linux gained "universal in-kernel virtual machine" which is called
eBPF or extended BPF. The name comes from "Berkeley Packet Filter", since
new instruction set is based on it.
This patch adds a new backend that emits extended BPF instruction set.
The concept and development are covered by the following articles:
http://lwn.net/Articles/599755/http://lwn.net/Articles/575531/http://lwn.net/Articles/603983/http://lwn.net/Articles/606089/http://lwn.net/Articles/612878/
One of use cases: dtrace/systemtap alternative.
bpf syscall manpage:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b4fc1a460f3017e958e6a8ea560ea0afd91bf6fe
instruction set description and differences vs classic BPF:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/filter.txt
Short summary of instruction set:
- 64-bit registers
R0 - return value from in-kernel function, and exit value for BPF program
R1 - R5 - arguments from BPF program to in-kernel function
R6 - R9 - callee saved registers that in-kernel function will preserve
R10 - read-only frame pointer to access stack
- two-operand instructions like +, -, *, mov, load/store
- implicit prologue/epilogue (invisible stack pointer)
- no floating point, no simd
Short history of extended BPF in kernel:
interpreter in 3.15, x64 JIT in 3.16, arm64 JIT, verifier, bpf syscall in 3.18, more to come in the future.
It's a very small and simple backend.
There is no support for global variables, arbitrary function calls, floating point, varargs,
exceptions, indirect jumps, arbitrary pointer arithmetic, alloca, etc.
From C front-end point of view it's very restricted. It's done on purpose, since kernel
rejects all programs that it cannot prove safe. It rejects programs with loops
and with memory accesses via arbitrary pointers. When kernel accepts the program it is
guaranteed that program will terminate and will not crash the kernel.
This patch implements all 'must have' bits. There are several things on TODO list,
so this is not the end of development.
Most of the code is a boiler plate code, copy-pasted from other backends.
Only odd things are lack or < and <= instructions, specialized load_byte intrinsics
and 'compare and goto' as single instruction.
Current instruction set is fixed, but more instructions can be added in the future.
Signed-off-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Subscribers: majnemer, chandlerc, echristo, joerg, pete, rengolin, kristof.beyls, arsenm, t.p.northover, tstellarAMD, aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D6494
llvm-svn: 227008
Summary:
In addition to the included tests, this fixes
test/CodeGen/Generic/i128-addsub.ll on a mips64 host.
Reviewers: atanasyan, sagar, vmedic
Reviewed By: vmedic
Subscribers: sdkie, llvm-commits
Differential Revision: http://reviews.llvm.org/D6610
llvm-svn: 227003
This fixes a regression introduced by r226816.
When replacing a splat shuffle node with a constant build_vector,
make sure that the new build_vector has a valid number of elements.
Thanks to Patrik Hagglund for reporting this problem and providing a
small reproducible.
llvm-svn: 227002
This patch adds the missing LD[U]RSW variants to the load store optimizer, so
that we generate LDPSW when possible.
<rdar://problem/19583480>
llvm-svn: 226978
Handle the poor codegen for i64/x86xmm->v2i64 (%mm -> %xmm) moves. Instead of
using stack store/load pair to do the job, use scalar_to_vector directly, which
in the MMX case can use movq2dq. This was the current behavior prior to
improvements for vector legalization of extloads in r213897.
This commit fixes the regression and as a side-effect also remove some
unnecessary shuffles.
In the new attached testcase, we go from:
pshufw $-18, (%rdi), %mm0
movq %mm0, -8(%rsp)
movq -8(%rsp), %xmm0
pshufd $-44, %xmm0, %xmm0
movd %xmm0, %eax
...
To:
pshufw $-18, (%rdi), %mm0
movq2dq %mm0, %xmm0
movd %xmm0, %eax
...
Differential Revision: http://reviews.llvm.org/D7126
rdar://problem/19413324
llvm-svn: 226953
We used to do this promotion during DAG legalization, but this
caused an infinite loop in ExpandUnalignedLoad() because it assumed
that i64 loads were legal if i64 was a legal type.
It also seems better to report i64 loads as legal, since they actually
are and we were just promoting them to simplify our tablegen files.
llvm-svn: 226945
This mostly reverts commit r222062 and replaces it with a new enum. At
some point this enum will grow at least for other MSVC EH personalities.
Also beefs up the way we were sniffing the personality function.
Previously we would emit the Itanium LSDA despite using
__C_specific_handler.
Reviewers: majnemer
Differential Revision: http://reviews.llvm.org/D6987
llvm-svn: 226920
v2: add and enable tests for SI
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Matt Arsenault <Matthew.Arsenault@amd.com>
llvm-svn: 226881
Minor tweak now that D7042 is complete, we can enable stack folding for (V)MOVDDUP and do proper testing.
Added missing AVX ymm folding patterns and fixed alignment for AVX VMOVSLDUP / VMOVSHDUP.
llvm-svn: 226873
Removed loops from PSUBUS tests - ensures folding is tested. Also renamed SSE2 tests SSSE3 to match cpu.
This is a follow up commit agreed in http://reviews.llvm.org/D7094
llvm-svn: 226871
Specifically, gc.result benefits from this greatly. Instead of:
gc.result.int.*
gc.result.float.*
gc.result.ptr.*
...
We now have a gc.result.* that can specialize to literally any type.
Differential Revision: http://reviews.llvm.org/D7020
llvm-svn: 226857
This is a 2nd try at the same optimization as http://reviews.llvm.org/D6698.
That patch was checked in at r224611, but reverted at r225031 because it
caused a failure outside of the regression tests.
The cause of the crash was not recognizing consecutive stores that have mixed
source values (loads and vector element extracts), so this patch adds a check
to bail out if any store value is not coming from a vector element extract.
This patch also refactors the shared logic of the constant source and vector
extracted elements source cases into a helper function.
Differential Revision: http://reviews.llvm.org/D6850
llvm-svn: 226845
This solves PR22276.
Splats of constants would sometimes produce redundant shuffles, sometimes ridiculously so (see the PR for details). Fold these shuffles into BUILD_VECTORs early on instead.
Differential Revision: http://reviews.llvm.org/D7093
Fixed recommit of r226811.
llvm-svn: 226816
This solves PR22276.
Splats of constants would sometimes produce redundant shuffles, sometimes ridiculously so (see the PR for details). Fold these shuffles into BUILD_VECTORs early on instead.
Differential Revision: http://reviews.llvm.org/D7093
llvm-svn: 226811
The problem occurs when after vectorization we have type
<2 x i32>. This type is promoted to <2 x i64> and then requires
additional efforts for expanding loads and truncating stores.
I added EXPAND / TRUNCATE attributes to the masked load/store
SDNodes. The code now contains additional shuffles.
I've prepared changes in the cost estimation for masked memory
operations, it will be submitted separately.
llvm-svn: 226808
Type MVT::i1 became legal in KNL, but store operation can't be narrowed to this type,
since the size of VT (1 bit) is not equal to its actual store size(8 bits).
Added a test provided by David (dag@cray.com)
llvm-svn: 226805
Added most of the missing integer vector folding patterns for SSE (to SSE42) and AVX1.
The most useful of these are probably the i32/i64 extraction, i8/i16/i32/i64 insertions, zero/sign extension, unsigned saturation subtractions, i64 subtractions and the variable mask blends (pblendvb) - others include CLMUL, SSE42 string comparisons and bit tests.
Differential Revision: http://reviews.llvm.org/D7094
llvm-svn: 226745
This patch adds shuffle matching for the SSE3 MOVDDUP, MOVSLDUP and MOVSHDUP instructions. The big use of these being that they avoid many single source shuffles from needing to use (pre-AVX) dual source instructions such as SHUFPD/SHUFPS: causing extra moves and preventing load folds.
Adding these instructions uncovered an issue in XFormVExtractWithShuffleIntoLoad which crashed on single operand shuffle instructions (now fixed). It also involved fixing getTargetShuffleMask to correctly identify theses instructions as unary shuffles.
Also adds a missing tablegen pattern for MOVDDUP.
Differential Revision: http://reviews.llvm.org/D7042
llvm-svn: 226716
Thumbv4t does not have lo->lo copies other than MOVS,
and that can't be predicated. So emit MOVS when needed
and bail if there's a predicate.
http://reviews.llvm.org/D6592
llvm-svn: 226711
This fixes it for SI. It also removes the pattern
used previously for Evergreen for f32. I'm not sure
if the the new R600 output is better or not, but it uses
1 fewer instructions if BFI is available.
llvm-svn: 226682
Now that we can fully specify extload legality, we can declare them
legal for the PMOVSX/PMOVZX instructions. This for instance enables
a DAGCombine to fire on code such as
(and (<zextload-equivalent> ...), <redundant mask>)
to turn it into:
(zextload ...)
as seen in the testcase changes.
There is one regression, in widen_load-2.ll: we're no longer able
to do store-to-load forwarding with illegal extload memory types.
This will be addressed separately.
Differential Revision: http://reviews.llvm.org/D6533
llvm-svn: 226676
AAPCS64 says that it's up to the platform to specify whether x18 is
reserved, and a first step on that way is to add a flag controlling
it.
From: Andrew Turner <andrew@fubar.geek.nz>
llvm-svn: 226664
Changed the AVX1 tests register spill tail call to return a xmm like the SSE42 version - makes doing diffs between them a lot easier without affecting the spills themselves.
llvm-svn: 226623
The SSE42 version of the AVX1 float stack folding tests will be added shortly, this renames the AVX1 file so that the files will be near each other in a directory listing to help ensure they are kept in sync.
llvm-svn: 226620
This addresses part of llvm.org/PR22262. Specifically, it prevents
considering the densities of sub-ranges that have fewer than
TLI.getMinimumJumpTableEntries() elements. Those densities won't help
jump tables.
This is not a complete solution but works around the most pressing
issue.
Review: http://reviews.llvm.org/D7070
llvm-svn: 226600
With the appropriate Verifier changes, exactracting the result out of a
statepoint wrapping a vararg function crashes. However, a void vararg
function works fine: commit this first step.
Differential Revision: http://reviews.llvm.org/D7071
llvm-svn: 226599
We don't have a good way of legalizing this if the frame index offset
is more than the 12-bits, which is size of MUBUF's offset field, so
now we store the frame index in the vaddr field.
llvm-svn: 226584
This commits adds the octeon branch instructions bbit0/bbit032/bbit1/bbit132.
It also includes patterns for instruction selection and test cases.
Reviewed by D. Sanders
llvm-svn: 226573
Now that we can create much more exhaustive X86 memory folding tests, this patch adds the missing AVX1/F16C floating point instruction stack foldings we can easily test for including the scalar intrinsics (add, div, max, min, mul, sub), conversions float/int to double, half precision conversions, rounding, dot product and bit test. The patch also adds a couple of obviously missing SSE instructions (more to follow once we have full SSE testing).
Now that scalar folding is working it broke a very old test (2006-10-07-ScalarSSEMiscompile.ll) - this test appears to make no sense as its trying to ensure that a scalar subtraction isn't folded as it 'would zero the top elts of the loaded vector' - this test just appears to be wrong to me.
Differential Revision: http://reviews.llvm.org/D7055
llvm-svn: 226513
Original patch by Luke Iannini. Minor improvements and test added by
Erik de Castro Lopo.
Differential Revision: http://reviews.llvm.org/D6877
From: Erik de Castro Lopo <erikd@mega-nerd.com>
llvm-svn: 226473
No change in this commit, but clang was changed to also produce trivial comdats when
needed.
Original message:
Don't create new comdats in CodeGen.
This patch stops the implicit creation of comdats during codegen.
Clang now sets the comdat explicitly when it is required. With this patch clang and gcc
now produce the same result in pr19848.
llvm-svn: 226467
Our PPC64 ELF V2 call lowering logic added r2 as an operand to all direct call
instructions in order to represent the dependency on the TOC base pointer
value. Restricting this to ELF V2, however, does not seem to make sense: calls
under ELF V1 have the same dependence, and indirect calls have an r2 dependence
just as direct ones. Make sure the dependence is noted for all calls under both
ELF V1 and ELF V2.
llvm-svn: 226432
Begun adding more exhaustive tests - all floating point instructions should now be either tested or have placeholders. We do seem to have a number of missing instructions, I will add a patch for review once the remaining working instructions are added.
I'll then move on to SSE tests and then the integer instructions.
llvm-svn: 226400
The default calling convention specified by the PPC64 ELF (V1 and V2) ABI is
designed to work with both prototyped and non-prototyped/varargs functions. As
a result, GPRs and stack space are allocated for every argument, even those
that are passed in floating-point or vector registers.
GlobalOpt::OptimizeFunctions will transform local non-varargs functions (that
do not have their address taken) to use the 'fast' calling convention.
When functions are using the 'fast' calling convention, don't allocate GPRs for
arguments passed in other types of registers, and don't allocate stack space for
arguments passed in registers. Other changes for the fast calling convention
may be added in the future.
llvm-svn: 226399
R11's status is the same under both the PPC64 ELF V1 and V2 ABIs: it is
reserved for use as an "environment pointer" for compilation models that
require such a thing. We don't, we also don't need a second scratch register,
and because we support only "local" patchpoint call targets, we might as well
let R11 be used for anyregcc patchpoints.
llvm-svn: 226369
Loading 2 2x32-bit float vectors into the bottom half of a 256-bit vector
produced suboptimal code in AVX2 mode with certain IR combinations.
In particular, the IR optimizer folded 2f32 + 2f32 -> 4f32, 4f32 + 4f32
(undef) -> 8f32 into a 2f32 + 2f32 -> 8f32, which seems more canonical,
but then mysteriously generated rather bad code; the movq/movhpd combination
didn't match.
The problem lay in the BUILD_VECTOR optimization path. The 2f32 inputs
would get promoted to 4f32 by the type legalizer, eventually resulting
in a BUILD_VECTOR on two 4f32 into an 8f32. The BUILD_VECTOR then, recognizing
these were both half the output size, concatted them and then produced
a shuffle. However, the resulting concat + shuffle was more complex than
it should be; in the case where the upper half of the output is undef, we
probably want to generate shuffle + concat instead.
This enhancement causes the vector_shuffle combine step to recognize this
suboptimal pattern and correct it. I included it there instead of in BUILD_VECTOR
in case the same suboptimal pattern occurs for other reasons.
This results in the optimizer correctly producing the optimal movq + movhpd
sequence for all three variations on this IR, even with AVX2.
I've included a test case.
Radar link: rdar://problem/19287012
Fix for PR 21943.
From: Fiona Glaser <fglaser@apple.com>
llvm-svn: 226360
This patch disables target specific combine on X86ISD::INSERTPS dag nodes
if optlevel is CodeGenOpt::None.
The backend currently implements a target specific combine rule that converts
a vector load used by an INSERTPS dag node into a scalar load plus a
scalar_to_vector. This allows ISel to select a single INSERTPSrm instead of
two instructions (i.e. a vector load plus INSERTPSrr).
However, the existing target combine rule on INSERTPS nodes only works under
the assumption that ISel will always be able to match an INSERTPSrm. This is
not true in general at -O0, since the backend only allows folding a load into
the memory operand of an instruction if the optimization level is not
CodeGenOpt::None.
In the example below:
//
__m128 test(__m128 a, __m128 *b) {
__m128 c = _mm_insert_ps(a, *b, 1 << 6);
return c;
}
//
Before this patch, at -O0, the backend would have canonicalized the load to 'b'
into a scalar load plus scalar_to_vector. Later on, ISel would have selected an
INSERTPSrr leaving the insertps mask in an inconsistent state:
movss 4(%rdi), %xmm1
insertps $64, %xmm1, %xmm0 # xmm0 = xmm1[1],xmm0[1,2,3].
With this patch, the backend avoids folding the vector load into the operand of
the INSERTPS. The new codegen at -O0 is:
movaps (%rdi), %xmm1
insertps $64, %xmm1, %xmm0 # %xmm1[1],xmm0[1,2,3].
llvm-svn: 226277
The current 'big vectors' stack folded reload testing pattern is very bulky and makes it difficult to test all instructions as big vectors will tend to use only the ymm instruction implementations.
This patch changes the tests to use a nop call that lists explicit xmm registers as sideeffects, with this we can force a partial register spill of the relevant registers and then check that the reload is correctly folded. The asm generated only adds the forced spill, a nop instruction and a couple of extra labels (a fraction of the current approach).
More exhaustive tests will follow shortly, I've added some extra tests (the xmm versions of some of the existing folding tests) as a starting point.
Differential Revision: http://reviews.llvm.org/D6932
llvm-svn: 226264
Bill Schmidt pointed out that some adjustments would be needed to properly
support powerpc64le (using the ELF V2 ABI). For one thing, R11 is not available
as a scratch register, so we need to use R12. R12 is also available under ELF
V1, so to maintain consistency, I flipped the order to make R12 the first
scratch register in the array under both ABIs.
llvm-svn: 226247
This reverts commit r226173, adding r226038 back.
No change in this commit, but clang was changed to also produce trivial comdats for
costructors, destructors and vtables when needed.
Original message:
Don't create new comdats in CodeGen.
This patch stops the implicit creation of comdats during codegen.
Clang now sets the comdat explicitly when it is required. With this patch clang and gcc
now produce the same result in pr19848.
llvm-svn: 226242
Function pointers under PPC64 ELFv1 (which is used on PPC64/Linux on the
POWER7, A2 and earlier cores) are really pointers to a function descriptor, a
structure with three pointers: the actual pointer to the code to which to jump,
the pointer to the TOC needed by the callee, and an environment pointer. We
used to chain these loads, and make them opaque to the rest of the optimizer,
so that they'd always occur directly before the call. This is not necessary,
and in fact, highly suboptimal on embedded cores. Once the function pointer is
known, the loads can be performed ahead of time; in fact, they can be hoisted
out of loops.
Now these function descriptors are almost always generated by the linker, and
thus the contents of the descriptors are invariant. As a result, by default,
we'll mark the associated loads as invariant (allowing them to be hoisted out
of loops). I've added a target feature to turn this off, however, just in case
someone needs that option (constructing an on-stack descriptor, casting it to a
function pointer, and then calling it cannot be well-defined C/C++ code, but I
can imagine some JIT-compilation system doing so).
Consider this simple test:
$ cat call.c
typedef void (*fp)();
void bar(fp x) {
for (int i = 0; i < 1600000000; ++i)
x();
}
$ cat main.c
typedef void (*fp)();
void bar(fp x);
void foo() {}
int main() {
bar(foo);
}
On the PPC A2 (the BG/Q supercomputer), marking the function-descriptor loads
as invariant brings the execution time down to ~8 seconds from ~32 seconds with
the loads in the loop.
The difference on the POWER7 is smaller. Compiling with:
gcc -std=c99 -O3 -mcpu=native call.c main.c : ~6 seconds [this is 4.8.2]
clang -O3 -mcpu=native call.c main.c : ~5.3 seconds
clang -O3 -mcpu=native call.c main.c -mno-invariant-function-descriptors : ~4 seconds
(looks like we'd benefit from additional loop unrolling here, as a first
guess, because this is faster with the extra loads)
The -mno-invariant-function-descriptors will be added to Clang shortly.
llvm-svn: 226207
Reapply r226071 with fixes. Two fixes:
1. We need to manually remove the old and create the new 'deaf defs'
associated with physical register definitions when we move the definition of
the physical register from the copy point to the point of the original vreg def.
This problem was picked up by the machinstr verifier, and could trigger a
verification failure on test/CodeGen/X86/2009-02-12-DebugInfoVLA.ll, so I've
turned on the verifier in the tests.
2. When moving the def point of the phys reg up, we need to make sure that it
is neither defined nor read in between the two instructions. We don't, however,
extend the live ranges of phys reg defs to cover uses, so just checking for
live-range overlap between the pair interval and the phys reg aliases won't
pick up reads. As a result, we manually iterate over the range and check for
reads.
A test soon to be committed to the PowerPC backend will test this change.
Original commit message:
[RegisterCoalescer] Remove copies to reserved registers
This allows the RegisterCoalescer to join "non-flipped" range pairs with a
physical destination register -- which allows the RegisterCoalescer to remove
copies like this:
<vreg> = something (maybe a load, for example)
... (things that don't use PHYSREG)
PHYSREG = COPY <vreg>
(with all of the restrictions normally applied by the RegisterCoalescer: having
compatible register classes, etc. )
Previously, the RegisterCoalescer handled only the opposite case (copying
*from* a physical register). I don't handle the problem fully here, but try to
get the common case where there is only one use of <vreg> (the COPY).
An upcoming commit to the PowerPC backend will make this pattern much more
common on PPC64/ELF systems.
llvm-svn: 226200