makes a mess of the lit output when they ultimately fail.
The 2012-10-02-DAGCycle test is really frustrating because the *only*
explanation for what it is testing is a rdar link. I would really rather
that rdar links (which are not public or part of the open source
project) were not committed to the source code. Regardless, the actual
problem *must* be described as the rdar link is completely opaque. The
fact that this test didn't check for any particular output further
exacerbates the inability of any other developer to debug failures.
The mem-promote-integers test has nice comments and *seems* to be
a great test for our lowering... except that we don't actually check
that any of the generated code is correct or matches some pattern. We
just avoid crashing. It would be great to go back and populate this test
with the actual expectations.
llvm-svn: 214605
Stop using ST registers for function returns and inline-asm instructions and use
FP registers instead. This allows removing a large amount of code in the
stackifier pass that was needed to track register liveness and handle copies
between ST and FP registers and function calls returning floating point values.
It also fixes a bug which manifests when an ST register defined by an
inline-asm instruction was live across another inline-asm instruction, as shown
in the following sequence of machine instructions:
1. INLINEASM <es:frndint> $0:[regdef], %ST0<imp-def,tied5>
2. INLINEASM <es:fldcw $0>
3. %FP0<def> = COPY %ST0
<rdar://problem/16952634>
llvm-svn: 214580
This is consistent with how we parse them in a standalone .s file, and
inline assembly shouldn't differ.
This fixes errors about requiring more registers than available in
cases like this:
void f();
void __declspec(naked) g() {
__asm pusha
__asm call f
__asm popa
__asm ret
}
There are no registers available to pass the address of 'f' into the asm
blob. The asm should now directly call 'f'.
Tests will land in Clang shortly.
llvm-svn: 214550
Add branch weights to branch instructions, so that the following passes can
optimize based on it (i.e. basic block ordering).
Fixes <rdar://problem/17887137>.
llvm-svn: 214537
This change adds code to explicitly mark a function which requires runtime stack realignment as not having a fixed frame size in the StackMap section. As it happens, this is not actually a functional change. The size that would be reported without the check is also "-1", but as far as I can tell, that's an accident. The code change makes this explicit.
Note: There's a separate bug in handling of stackmaps and patchpoints in functions which need dynamic frame realignment. The current code assumes that offsets can be calculated from RBP, but realigned frames must use RSP. (There's a variable gap between RBP and the spill slots.) This change set does not address that issue.
Reviewers: atrick, ributzka
Differential Revision: http://reviews.llvm.org/D4572
llvm-svn: 214534
This is a followup patch for r214366, which added the same behavior to the
AArch64 and X86 FastISel code. This fix reproduces the already existing
behavior of SelectionDAG in FastISel.
llvm-svn: 214531
The tbz/tbnz checks the sign bit to convert
op w1, w1, w10
cmp w1, #0
b.lt .LBB0_0
to
op w1, w1, w10
tbnz w1, #31, .LBB0_0
Differential Revision: http://reviews.llvm.org/D4440
llvm-svn: 214518
Summary:
Big-endian mode was not correctly adjusting the offset for types smaller
than an ABI slot.
Fixes PR19612
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: sstankovic, llvm-commits
Differential Revision: http://reviews.llvm.org/D4556
llvm-svn: 214493
ADDS and SUBS cannot encode negative immediates or immediates larger than 12bit.
This fix checks if the immediate version can be used under this constraints and
if we can convert ADDS to SUBS or vice versa to support negative immediates.
Also update the test cases to test the immediate versions.
llvm-svn: 214470
When generating unaligned vector loads, we need to search for other loads or
stores nearby offset by one vector width. If we find one, then we know that we
can safely generate another aligned load at that address. Otherwise, we must
generate the next load using an offset of the vector width minus one byte (so
we don't read off the end of the allocation if the base unaligned address
happened to be aligned at runtime). We had previously done this using only
other vector loads and stores, but did not consider the PowerPC-specific vector
load/store intrinsics. Now we'll also consider vector intrinsics. By itself,
this change is a feature enhancement, but is a necessary step toward fixing the
underlying problem behind PR19991.
llvm-svn: 214469
Abs/neg folding has moved out of foldOperands and into the instruction
selection phase using complex patterns. As a consequence of this
change, we now prefer to select the 64-bit encoding for most
instructions and the modifier operands have been dropped from
integer VOP3 instructions.
llvm-svn: 214467
This is a follow-up to the activity in the bug at
http://llvm.org/bugs/show_bug.cgi?id=18663 . The underlying issue has
to do with how the KILL pseudo-instruction is handled. I defer to
Hal/Jakob/Uli for additional details and background.
This will disable the (bad?) assert, add an associated fixme comment,
and add a pair of tests.
The code change and the pr18663-2.ll test are copied from the referenced
bug. That test does not immediately fail in my environment, but I have
added the pr18663.ll test which does.
(Comment from Hal)
to provide everyone else with some context, this assert was not bad when
it was written. At that time, we only generated KILL pseudo instructions
around subregister copies. This logic, unfortunately, had its own problems.
In r199797, the relevant logic in MachineCopyPropagation was replaced to
generate KILLs for other kinds of copies too. This change in semantics broke
this now-problematic assumption in AggressiveAntiDepBreaker. The
AggressiveAntiDepBreaker really needs a proper cleanup to deal with the
change, but removing the assert (which just allows the function to return
false) is a safe conservative behavior, and should do for the time being.
llvm-svn: 214429
It seems that when I fixed this, almost exactly a year ago, I did not quite do
it correctly. When we have duplicate block predecessors, we can indeed not have
different incoming values for the same block, but we *must* have duplicate
entries. So, instead of skipping the duplicates, we explicitly add the
duplicate incoming values.
Fixes PR20442.
llvm-svn: 214423
This commit updates the existing SelectionDAG tests for the stackmap and patchpoint
intrinsics and enables FastISel testing. It also splits up the tests into separate
files, due to different codegen between SelectionDAG and FastISel.
llvm-svn: 214382
Currently the large code model for MachO uses the GOT to make function calls.
Emit the required adrp and ldr instructions to load the address from the GOT.
Related to <rdar://problem/17733076>.
llvm-svn: 214381
This fixes a mistake where I accidentially dropped the upper 32bit of a
64bit pointer during FastISel lowering of the patchpoint intrinsic.
llvm-svn: 214367
UNDEF arguments are not ment to be touched - especially for the webkit_js
calling convention. This fix reproduces the already existing behavior of
SelectionDAG in FastISel.
llvm-svn: 214366
Before this patch we had
@a = weak global ...
but
@b = alias weak ...
The patch changes aliases to look more like global variables.
Looking at some really old code suggests that the reason was that the old
bison based parser had a reduction for alias linkages and another one for
global variable linkages. Putting the alias first avoided the reduce/reduce
conflict.
The days of the old .ll parser are long gone. The new one parses just "linkage"
and a later check is responsible for deciding if a linkage is valid in a
given context.
llvm-svn: 214355
This improves the code generation for the XALU intrinsics when the
condition is feeding a select instruction.
This also updates and enables the XALU unit tests for FastISel.
This fixes <rdar://problem/17831117>.
llvm-svn: 214350
Currently the shift-immediate versions are not supported by tblgen and
hopefully this can be later removed, once the required support has been
added to tblgen.
llvm-svn: 214345
This patch adds an explicit triple to the test case introduced by r214322. This
should fix build failueres that are occuring on bots that are cross building.
llvm-svn: 214330
DAGCombine may choose to rewrite graphs where two loads feed a select into
graphs where a select of two addresses feed a load. While it sanity checks the
loads to make sure they are broadly equivalent it currently just uses the
alignment restriction of the left node. In cases where the right node has
stronger alignment requiresment this may lead to bad codegen, such as generating
an aligned load where an unaligned load is required. This patch makes the
combine generate a load with an alignment that is the same as whichever is more
restrictive of the two alignments.
Tests included.
rdar://17762530
llvm-svn: 214322
Fix the missing case in ScalarizeVectorResult() that was exposed with
libclcore.bc in Android.
Differential Revision: http://reviews.llvm.org/D4645
llvm-svn: 214266
We need to make sure we use the softened version of all appropriate operands in
the libcall, or things go horribly wrong. This may entail actually executing a
1-stage softening.
llvm-svn: 214175
DITypeArray is an array of DITypeRef, at its creation, we will create
DITypeRef (i.e use the identifier if the type node has an identifier).
This is the last patch to unique the type array of a subroutine type.
rdar://17628609
llvm-svn: 214132
Rename to allowsMisalignedMemoryAccess.
On R600, 8 and 16 byte accesses are mostly OK with 4-byte alignment,
and don't need to be split into multiple accesses. Vector loads with
an alignment of the element type are not uncommon in OpenCL code.
llvm-svn: 214055
'J' represents a negative number suitable for an add/sub alias
instruction, but while preparing it to become an int64_t we were
mangling the sign extension. So "i32 -1" became 0xffffffffLL, for
example.
Should fix one half of PR20456.
llvm-svn: 214052
instructions in the legalized DAG, and leverage it to combine long
sequences of instructions to PSHUFB.
Eventually, the other x86-instruction-specific shuffle combines will
probably all be driven out of this routine. But the real motivation is
to detect after we have fully legalized and optimized a shuffle to the
minimal number of x86 instructions whether it is profitable to replace
the chain with a fully generic PSHUFB instruction even though doing so
requires either a load from a constant pool or tying up a register with
the mask.
While the Intel manuals claim it should be used when it replaces 5 or
more instructions (!!!!) my experience is that it is actually very fast
on modern chips, and so I've gon with a much more aggressive model of
replacing any sequence of 3 or more instructions.
I've also taught it to do some basic canonicalization to special-purpose
instructions which have smaller encodings than their generic
counterparts.
There are still quite a few FIXMEs here, and I've not yet implemented
support for lowering blends with PSHUFB (where its power really shines
due to being able to zero out lanes), but this starts implementing real
PSHUFB support even when using the new, fancy shuffle lowering. =]
llvm-svn: 214042
over each node in the worklist prior to combining.
This allows the combiner to produce new nodes which need to go back
through legalization. This is particularly useful when generating
operands to target specific nodes in a post-legalize DAG combine where
the operands are significantly easier to express as pre-legalized
operations. My immediate use case will be PSHUFB formation where we need
to build a constant shuffle mask with a build_vector node.
This also refactors the relevant functionality in the legalizer to
support this, and updates relevant tests. I've spoken to the R600 folks
and these changes look like improvements to them. The avx512 change
needs to be investigated, I suspect there is a disagreement between the
legalizer and the DAG combiner there, but it seems a minor issue so
leaving it to be re-evaluated after this patch.
Differential Revision: http://reviews.llvm.org/D4564
llvm-svn: 214020
The tale starts with r212808 which attempted to fix inversion of the low
and high bits when lowering MUL_LOHI. Sadly, that commit did not include
any positive test cases, and just removed some operations from a test
case where the actual logic being changed isn't fully visible from the
test.
What this commit did was two things. First, it reversed the low and high
results in the formation of the MERGE_VALUES node for the multiple
results. This is entirely correct.
Second it changed the shuffles for extracting the low and high
components from the i64 results of the multiplies to extract them
assuming a big-endian-style encoding of the multiply results. This
second change is wrong. There is no big-endian encoding in x86, the
results of the multiplies are normal v2i64s: when cast to v4i32, the low
i32s are at offsets 0 and 2, and the high i32s are at offsets 1 and 3.
However, the first change wasn't enough to actually fix the bug, which
is (I assume) why the second change was also made. There was another bug
in the MERGE_VALUES formation: we weren't using a VTList, and so were
getting a single result node! When grabbing the *second* result from the
node, we got... well.. colud be anything. I think this *appeared* to
invert things, but had to be causing other problems as well.
Fortunately, I fixed the MERGE_VALUES issue in r213931, so we should
have been fine, right? NOOOPE! Because the core bug was never addressed,
the test in vector-idiv failed when I fixed the MERGE_VALUES node.
Because there are essentially no docs for this node, I had to guess at
how to fix it and tried swapping the operands, restoring the order of
the original code before r212808. While this "fixed" the test case (in
that we produced the write instructions) we were still extracting the
wrong elements of the i64s, and thus PR20355 was still broken.
This commit essentially reverts the big-endian-style extraction part of
r212808 and goes back to the original masks which were correct. Now that
the MERGE_VALUES node formation is also correct, everything works. I've
also included a more detailed test from PR20355 to make sure this stays
fixed.
llvm-svn: 214011
The clever way to implement signed multiplication with unsigned *is
already implemented* and tested and working correctly. The bug is
somewhere else. Re-investigating.
This will teach me to not scroll far enough to read the code that did
what I thought needed to be done.
llvm-svn: 214009
signed multiplication is requested. While there is not a difference in
the *low* half of the result, the *high* half (used specifically to
implement the signed division by these constants) certainly is used. The
test case I've nuked was actively asserting wrong code.
There is a delightful solution to doing signed multiplication even when
we don't have it that Richard Smith has crafted, but I'll add the
machinery back and implement that in a follow-up patch. This at least
restores correctness.
llvm-svn: 214007
This is the first commit in a series that add an @llvm.assume intrinsic which
can be used to provide the optimizer with a condition it may assume to be true
(when the control flow would hit the intrinsic call). Some basic properties are added here:
- llvm.invariant(true) is dead.
- llvm.invariant(false) is unreachable (this directly corresponds to the
documented behavior of MSVC's __assume(0)), so is llvm.invariant(undef).
The intrinsic is tagged as writing arbitrarily, in order to maintain control
dependencies. BasicAA has been updated, however, to return NoModRef for any
particular location-based query so that we don't unnecessarily block code
motion.
llvm-svn: 213973
address of the stack guard was being spilled to the stack.
Previously the address of the stack guard would get spilled to the stack if it
was impossible to keep it in a register. This patch introduces a new target
independent node and pseudo instruction which gets expanded post-RA to a
sequence of instructions that load the stack guard value. Register allocator
can now just remat the value when it can't keep it in a register.
<rdar://problem/12475629>
llvm-svn: 213967
This commit implements the frameaddress intrinsic for the AArch64 architecture
in FastISel.
There were two test cases that pretty much tested the same, so I combined them
to a single test case.
Fixes <rdar://problem/17811834>
llvm-svn: 213959
Reverted by Eric Christopher (Thanks!) in r212203 after Bob Wilson
reported LTO issues. Duncan Exon Smith and Aditya Nandakumar helped
provide a reduced reproduction, though the failure wasn't too hard to
guess, and even easier with the example to confirm.
The assertion that the subprogram metadata associated with an
llvm::Function matches the scope data referenced by the DbgLocs on the
instructions in that function is not valid under LTO. In LTO, a C++
inline function might exist in multiple CUs and the subprogram metadata
nodes will refer to the same llvm::Function. In this case, depending on
the order of the CUs, the first intance of the subprogram metadata may
not be the one referenced by the instructions in that function and the
assertion will fail.
A test case (test/DebugInfo/cross-cu-linkonce-distinct.ll) is added, the
assertion removed and a comment added to explain this situation.
This was then reverted again in r213581 as it caused PR20367. The root
cause of this was the early exit in LiveDebugVariables meant that
spurious DBG_VALUE intrinsics that referenced dead variables were not
removed, causing an assertion/crash later on. The fix is to have
LiveDebugVariables strip all DBG_VALUE intrinsics in functions without
debug info as they're not needed anyway. Test case added to cover this
situation (that occurs when a debug-having function is inlined into a
nodebug function) in test/DebugInfo/X86/nodebug_with_debug_loc.ll
Original commit message:
If a function isn't actually in a CU's subprogram list in the debug info
metadata, ignore all the DebugLocs and don't try to build scopes, track
variables, etc.
While this is possibly a minor optimization, it's also a correctness fix
for an incoming patch that will add assertions to LexicalScopes and the
debug info verifier to ensure that all scope chains lead to debug info
for the current function.
Fix up a few test cases that had broken/incomplete debug info that could
violate this constraint.
Add a test case where this occurs by design (inlining a
debug-info-having function in an attribute nodebug function - we want
this to work because /if/ the nodebug function is then inlined into a
debug-info-having function, it should be fine (and will work fine - we
just stitch the scopes up as usual), but should the inlining not happen
we need to not assert fail either).
llvm-svn: 213952
* Add CUs to the named CU node
* Add missing DW_TAG_subprogram nodes
* Add llvm::Functions to the DW_TAG_subprogram nodes
This cleans up the tests so that they don't break under a
soon-to-be-made change that is more strict about such things.
llvm-svn: 213951
assembly instructions.
This is necessary to ensure ARM assembler switches to Thumb mode before it
starts assembling the file level inline assembly instructions at the beginning
of a .s file.
<rdar://problem/17757232>
llvm-svn: 213924
Because the PowerPC vmrgh* and vmrgl* instructions have a built-in
big-endian bias, it is necessary to swap their inputs in little-endian
mode when using them to implement a vector shuffle. This was
previously missed in the vector LE implementation.
There was already logic to distinguish between unary and "normal"
vmrg* vector shuffles, so this patch extends that logic to use a third
option: "swapped" vmrg* vector shuffles that are used for little
endian in place of the "normal" ones.
I've updated the vec-shuffle-le.ll test to check for the expected
register ordering on the generated instructions.
This bug was discovered when testing the LE and ELFv2 patches for
safety if they were backported to 3.4. A different vectorization
decision was made in 3.4 than on mainline trunk, and that exposed the
problem. I've verified this fix takes care of that issue.
llvm-svn: 213915
which have successfully round-tripped through the combine phase, and use
this to ensure all operands to DAG nodes are visited by the combiner,
even if they are only added during the combine phase.
This is critical to have the combiner reach nodes that are *introduced*
during combining. Previously these would sometimes be visited and
sometimes not be visited based on whether they happened to end up on the
worklist or not. Now we always run them through the combiner.
This fixes quite a few bad codegen test cases lurking in the suite while
also being more principled. Among these, the TLS codegeneration is
particularly exciting for programs that have this in the critical path
like TSan-instrumented binaries (although I think they engineer to use
a different TLS that is faster anyways).
I've tried to check for compile-time regressions here by running llc
over a merged (but not LTO-ed) clang bitcode file and observed at most
a 3% slowdown in llc. Given that this is essentially a worst case (none
of opt or clang are running at this phase) I think this is tolerable.
The actual LTO case should be even less costly, and the cost in normal
compilation should be negligible.
With this combining logic, it is possible to re-legalize as we combine
which is necessary to implement PSHUFB formation on x86 as
a post-legalize DAG combine (my ultimate goal).
Differential Revision: http://reviews.llvm.org/D4638
llvm-svn: 213898
vector operation legalization with support for custom target lowering
and fallback to expand when it fails, and use this to implement sext and
anyext load lowering for x86 in a more principled way.
Previously, the x86 backend relied on a target DAG combine to "combine
away" sextload and extload nodes prior to legalization, or would expand
them during legalization with terrible code. This is particularly
problematic because the DAG combine relies on running over non-canonical
DAG nodes at just the right time to match several common and important
patterns. It used a combine rather than lowering because we didn't have
good lowering support, and to expose some tricks being employed to more
combine phases.
With this change it becomes a proper lowering operation, the backend
marks that it can lower these nodes, and I've added support for handling
the canonical forms that don't have direct legal representations such as
sextload of a v4i8 -> v4i64 on AVX1. With this change, our test cases
for this behavior continue to pass even after the DAG combiner beigns
running more systematically over every node.
There is some noise caused by this in the test suite where we actually
use vector extends instead of subregister extraction. This doesn't
really seem like the right thing to do, but is unlikely to be a critical
regression. We do regress in one case where by lowering to the
target-specific patterns early we were able to combine away extraneous
legal math nodes. However, this regression is completely addressed by
switching to a widening based legalization which is what I'm working
toward anyways, so I've just switched the test to that mode.
Differential Revision: http://reviews.llvm.org/D4654
llvm-svn: 213897
This patch minimizes the number of nops that must be emitted on X86 to satisfy
stackmap shadow constraints.
To minimize the number of nops inserted, the X86AsmPrinter now records the
size of the most recent stackmap's shadow in the StackMapShadowTracker class,
and tracks the number of instruction bytes emitted since the that stackmap
instruction was encountered. Padding is emitted (if it is required at all)
immediately before the next stackmap/patchpoint instruction, or at the end of
the basic block.
This optimization should reduce code-size and improve performance for people
using the llvm stackmap intrinsic on X86.
<rdar://problem/14959522>
llvm-svn: 213892
Use ComputeNumSignBits instead of checking for i8 / i16 which only
worked when AMDIL was lying about having legal i8 / i16.
If an integer is known to fit in 24-bits, we can
do division faster with float ops.
llvm-svn: 213843
This bug is introduced by r211144. The element of operand may be
smaller than the element of result, but previous commit can
only handle the contrary condition. This commit is to handle this
scenario and generate optimized codes like ZIP1.
llvm-svn: 213830
When we had a vector_shuffle where we had an input from each vector, we
could miscompile it because we were assuming the input from V2 wouldn't
be moved from where it was on the vector.
Added a test case.
llvm-svn: 213826
It isn't reasonable to test storing things using undef pointers --
storing through those is at best "good luck" and really should be
transformed to "unreachable". Random changes in the combiner can
randomly break these tests for no good reason. I'm following up on the
original commit regarding the right long-term strategy here.
llvm-svn: 213810
The transform to constant fold unary operations with an AND across a
vector comparison applies when the constant is not a splat of a scalar
as well.
llvm-svn: 213800
The folding of unary operations through a vector compare and mask operation
is only safe if the unary operation result is of the same size as its input.
For example, it's not safe for [su]itofp from v4i32 to v4f64.
llvm-svn: 213799
Constant fold the lanes of the input constant build_vector individually
so we correctly handle when the vector elements are not all the same
constant value.
PR20394
llvm-svn: 213798
I used the wrong method to obtain the return type inside FinishCall. This fix
simply uses the return type from FastLowerCall, which we already determined to
be a valid type.
Reduced test case from Chad. Thanks.
llvm-svn: 213788
With optimizations disabled, we disable the isel patterns for mul.wide; but we
were still generating MULWIDE ISD nodes. Now, we only try to generate MULWIDE
ISD nodes in DAGCombine if the optimization level is not zero.
llvm-svn: 213773
The target-independent DAGcombiner will generate:
asr w1, X, #31 w1 = splat sign bit.
add X, X, w1, lsr #28 X = X + 0 or pow2-1
asr w0, X, asr #4 w0 = X/pow2
However, the add + shifts is expensive, so generate:
add w0, X, 15 w0 = X + pow2-1
cmp X, wzr X - 0
csel X, w0, X, lt X = (X < 0) ? X + pow2-1 : X;
asr w0, X, asr 4 w0 = X/pow2
llvm-svn: 213758
We were assuming all SBFX-like operations would have the shl/asr form, but
often when the field being extracted is an i8 or i16, we end up with a
SIGN_EXTEND_INREG acting on a shift instead. Simple enough to check for though.
llvm-svn: 213754
Although the final shifter operand is a rotate, this actually only matters for
the half-word extends when the amount == 24. Otherwise folding a shift in is
just as good.
llvm-svn: 213753
There really is no arm64_be: it was a useful fiction to test big-endian support
while both backends existed in parallel, but now the only platform that uses
the name (iOS) doesn't have a big-endian variant, let alone one called
"arm64_be".
llvm-svn: 213748
This chang fully reverts r211771.
That revision added a canonicalization rule which has the potential to causes a
combine-cycle in the target-independent canonicalizing DAG combine.
The plan is to move the logic that forms target specific addsub nodes as part of
the lowering of shuffles.
llvm-svn: 213736
instruction sequences with CHECK-NEXT for these test cases.
This notably exposes how absolutely horrible the generated code is for
several of these test cases, and will make any future updates to the
test as our vector instruction selection gets better.
llvm-svn: 213732
The post-indexed instructions were missing the constraint, causing unpredictable STRH instructions to be emitted.
The earlyclobber constraint on the pre-indexed STR instructions is not strictly necessary, as the instruction selection for pre-indexed STR instructions goes through an additional layer of pseudo instructions which have the constraint defined, however it doesn't hurt to specify the constraint directly on the pre-indexed instructions as well, since at some point someone might create instances of them programmatically and then the constraint is definitely needed.
llvm-svn: 213729
insertions.
The old behavior could cause arbitrarily bad memory usage in the DAG
combiner if there was heavy traffic of adding nodes already on the
worklist to it. This commit switches the DAG combine worklist to work
the same way as the instcombine worklist where we null-out removed
entries and only add new entries to the worklist. My measurements of
codegen time shows slight improvement. The memory utilization is
unsurprisingly dominated by other factors (the IR and DAG itself
I suspect).
This change results in subtle, frustrating churn in the particular order
in which DAG combines are applied which causes a number of minor
regressions where we fail to match a pattern previously matched by
accident. AFAICT, all of these should be using AddToWorklist to directly
or should be written in a less brittle way. None of the changes seem
drastically bad, and a few of the changes seem distinctly better.
A major change required to make this work is to significantly harden the
way in which the DAG combiner handle nodes which become dead
(zero-uses). Previously, we relied on the ability to "priority-bump"
them on the combine worklist to achieve recursive deletion of these
nodes and ensure that the frontier of remaining live nodes all were
added to the worklist. Instead, I've introduced a routine to just
implement that precise logic with no indirection. It is a significantly
simpler operation than that of the combiner worklist proper. I suspect
this will also fix some other problems with the combiner.
I think the x86 changes are really minor and uninteresting, but the
avx512 change at least is hiding a "regression" (despite the test case
being just noise, not testing some performance invariant) that might be
looked into. Not sure if any of the others impact specific "important"
code paths, but they didn't look terribly interesting to me, or the
changes were really minor. The consensus in review is to fix any
regressions that show up after the fact here.
Thanks to the other reviewers for checking the output on other
architectures. There is a specific regression on ARM that Tim already
has a fix prepped to commit.
Differential Revision: http://reviews.llvm.org/D4616
llvm-svn: 213727
This commit modifies the existing call lowering functions to be used as the
FastLowerCall and FastLowerIntrinsicCall target-hooks instead.
This enables patchpoint intrinsic lowering for AArch64.
This fixes <rdar://problem/17733076>
llvm-svn: 213704
We should update the usages to all of the results;
otherwise, we might get assertion failure or SEGV during
the type legalization of ATOMIC_CMP_SWAP_WITH_SUCCESS
with two or more illegal types.
For example, in the following sequence, both i8 and i1
might be illegal in some target, e.g. armv5, mipsel, mips64el,
%0 = cmpxchg i8* %ptr, i8 %desire, i8 %new monotonic monotonic
%1 = extractvalue { i8, i1 } %0, 1
Since both i8 and i1 should be legalized, the corresponding
ATOMIC_CMP_SWAP_WITH_SUCCESS dag will be checked/replaced/updated
twice.
If we don't update the usage to *ALL* of the results in the
first round, the DAG for extractvalue might be processed earlier.
The GetPromotedInteger() will result in assertion failure,
because its operand (i.e. the success bit of cmpxchg) is not
promoted beforehand.
llvm-svn: 213569
There are a few more cleanups to do, but I ran into some problems
with ext loads and trunc stores, when I tried to change some of the
vector loads and stores from custom to legal, so I wasn't able to
get rid of everything.
llvm-svn: 213552
We now emit this value when we need to contradict the default value. This
restores support for binutils 2.24.
When a suitable binutils has been released we can resume unconditionally
emitting .module directives. This is preferable to omitting the .module
directives since the .module directives protect against, for example,
accidentally assembling FP32 code with -mfp64 and producing an unusuable object.
llvm-svn: 213548
This implements a solution for constant initializers suggested
by Vadim Girlin, where we store the data after the shader code
and then use the S_GETPC instruction to compute its address.
This saves use the trouble of creating a new buffer for constant data
and then having to pass the pointer to the kernel via user SGPRs or the
input buffer.
llvm-svn: 213530
We now emit this directive when we need to contradict the default value (e.g.
-mno-odd-spreg is given) or an option changed the default value (e.g. -mfpxx
is given).
This restores support for the currently available head of binutils. However,
at this point binutils 2.24 is still not sufficient since it does not support
'.module fp=...'.
llvm-svn: 213511
This makes the first stage DAG for @llvm.convert.to.fp16 an fptrunc,
and correspondingly @llvm.convert.from.fp16 an fpext. The legalisation
path is now uniform, regardless of the input IR:
fptrunc -> FP_TO_FP16 (if f16 illegal) -> libcall
fpext -> FP16_TO_FP (if f16 illegal) -> libcall
Each target should be able to select the version that best matches its
operations and not be required to duplicate patterns for both fptrunc
and FP_TO_FP16 (for example).
As a result we can remove some redundant AArch64 patterns.
llvm-svn: 213507
Canonicalize shuffles according to rules:
* shuffle(A, shuffle(A, B)) -> shuffle(shuffle(A,B), A)
* shuffle(B, shuffle(A, B)) -> shuffle(shuffle(A,B), B)
* shuffle(B, shuffle(A, Undef)) -> shuffle(shuffle(A, Undef), B)
This patch helps identifying more shuffle pairs that could be combined reusing
the already existing rules in the DAGCombiner.
Added new test 'combine-vec-shuffle-5.ll' to verify that the canonicalized
shuffles are now folded into a single shuffle node by the DAGCombiner.
Added more test cases to 'combine-vec-shuffle-4.ll'.
llvm-svn: 213504
This patch adds infrastructure support for passing array types
directly. These can be used by the front-end to pass aggregate
types (coerced to an appropriate array type). The details of the
array type being used inform the back-end about ABI-relevant
properties. Specifically, the array element type encodes:
- whether the parameter should be passed in FPRs, VRs, or just
GPRs/stack slots (for float / vector / integer element types,
respectively)
- what the alignment requirements of the parameter are when passed in
GPRs/stack slots (8 for float / 16 for vector / the element type
size for integer element types) -- this corresponds to the
"byval align" field
Using the infrastructure provided by this patch, a companion patch
to clang will enable two features:
- In the ELFv2 ABI, pass (and return) "homogeneous" floating-point
or vector aggregates in FPRs and VRs (this is similar to the ARM
homogeneous aggregate ABI)
- As an optimization for both ELFv1 and ELFv2 ABIs, pass aggregates
that fit fully in registers without using the "byval" mechanism
The patch uses the functionArgumentNeedsConsecutiveRegisters callback
to encode that special treatment is required for all directly-passed
array types. The isInConsecutiveRegs / isInConsecutiveRegsLast bits set
as a results are then used to implement the required size and alignment
rules in CalculateStackSlotSize / CalculateStackSlotAlignment etc.
As a related change, the ABI routines have to be modified to support
passing floating-point types in GPRs. This is necessary because with
homogeneous aggregates of 4-byte float type we can now run out of FPRs
*before* we run out of the 64-byte argument save area that is shadowed
by GPRs. Any extra floating-point arguments that no longer fit in FPRs
must now be passed in GPRs until we run out of those too.
Note that there was already code to pass floating-point arguments in
GPRs used with vararg parameters, which was done by writing the argument
out to the argument save area first and then reloading into GPRs. The
patch re-implements this, however, in favor of code packing float arguments
directly via extension/truncation, BITCAST, and BUILD_PAIR operations.
This is required to support the ELFv2 ABI, since we cannot unconditionally
write to the argument save area (which the caller might not have allocated).
The change does, however, affect ELFv1 varags routines too; but even here
the overall effect should be advantageous: Instead of loading the argument
into the FPR, then storing the argument to the stack slot, and finally
reloading the argument from the stack slot into a GPR, the new code now
just loads the argument into the FPR, and subsequently loads the argument
into the GPR (via BITCAST). That BITCAST might imply a save/reload from
a stack temporary (in which case we're no worse than before); but it
might be implemented more efficiently in some cases.
The final part of the patch enables up to 8 FPRs and VRs for argument
return in PPCCallingConv.td; this is required to support returning
ELFv2 homogeneous aggregates. (Note that this doesn't affect other ABIs
since LLVM wil only look for which register to use if the parameter is
marked as "direct" return anyway.)
Reviewed by Hal Finkel.
llvm-svn: 213493
This is a minor improvement in the ELFv2 ABI. In ELFv1, DWARF CFI
would represent a saved CR word (holding CR fields CR2, CR3, and CR4)
using just a single CFI record refering to CR2. In ELFv2 instead,
each of the CR fields is represented by its own CFI record. The
advantage is that the compiler can now chose to save just a single
(or two) CR fields instead of all of them, if those are the only ones
that actually need saving. That can lead to more efficient code using
mf(o)crf instead of the (slow) mfcr instruction.
Note that this patch does not (yet) implement this more efficient
code generation, but it does implement the part that is required to
be ABI compliant: creating multiple CFI records if multiple CR fields
are saved.
Reviewed by Hal Finkel.
llvm-svn: 213492
The ELFv2 ABI reduces the amount of stack required to implement an
ABI-compliant function call in two ways:
* the "linkage area" is reduced from 48 bytes to 32 bytes by
eliminating two unused doublewords
* the 64-byte "parameter save area" is now optional and need not be
present in certain cases (it remains mandatory in functions with
variable arguments, and functions that have any parameter that is
passed on the stack)
The following patch implements this required changes:
- reducing the linkage area, and associated relocation of the TOC save
slot, in getLinkageSize / getTOCSaveOffset (this requires updating all
callers of these routines to pass in the isELFv2ABI flag).
- (partially) handling the case where the parameter save are is optional
This latter part requires some extra explanation: Currently, we still
always allocate the parameter save area when *calling* a function.
That is certainly always compliant with the ABI, but may cause code to
allocate stack unnecessarily. This can be addressed by a follow-on
optimization patch.
On the *callee* side, in LowerFormalArguments, we *must* track
correctly whether the ABI guarantees that the caller has allocated
the parameter save area for our use, and the patch does so. However,
there is one complication: the code that handles incoming "byval"
arguments will currently *always* write to the parameter save area,
because it has to force incoming register arguments to the stack since
it must return an *address* to implement the byval semantics.
To fix this, the patch changes the LowerFormalArguments code to write
arguments to a freshly allocated stack slot on the function's own stack
frame instead of the argument save area in those cases where that area
is not present.
Reviewed by Hal Finkel.
llvm-svn: 213490
This patch builds upon the two preceding MC changes to implement the
basic ELFv2 function call convention. In the ELFv1 ABI, a "function
descriptor" was associated with every function, pointing to both the
entry address and the related TOC base (and a static chain pointer
for nested functions). Function pointers would actually refer to that
descriptor, and the indirect call sequence needed to load up both entry
address and TOC base.
In the ELFv2 ABI, there are no more function descriptors, and function
pointers simply refer to the (global) entry point of the function code.
Indirect function calls simply branch to that address, after loading it
up into r12 (as required by the ABI rules for a global entry point).
Direct function calls continue to just do a "bl" to the target symbol;
this will be resolved by the linker to the local entry point of the
target function if it is local, and to a PLT stub if it is global.
That PLT stub would then load the (global) entry point address of the
final target into r12 and branch to it. Note that when performing a
local function call, r2 must be set up to point to the current TOC
base: if the target ends up local, the ABI requires that its local
entry point is called with r2 set up; if the target ends up global,
the PLT stub requires that r2 is set up.
This patch implements all LLVM changes to implement that scheme:
- No longer create a function descriptor when emitting a function
definition (in EmitFunctionEntryLabel)
- Emit two entry points *if* the function needs the TOC base (r2)
anywhere (this is done EmitFunctionBodyStart; note that this cannot
be done in EmitFunctionBodyStart because the global entry point
prologue code must be *part* of the function as covered by debug info).
- In order to make use tracking of r2 (as needed above) work correctly,
mark direct function calls as implicitly using r2.
- Implement the ELFv2 indirect function call sequence (no function
descriptors; load target address into r12).
- When creating an ELFv2 object file, emit the .abiversion 2 directive
to tell the linker to create the appropriate version of PLT stubs.
Reviewed by Hal Finkel.
llvm-svn: 213489
The PPCTargetLowering::SelectAddressRegImm routine needs to handle
FrameIndex nodes in a special manner, by tranlating them into a
TargetFrameIndex node. This was done in most cases, but seems to
have been neglected in one path: when the input tree has an OR of
the FrameIndex with an immediate. This can happen if the FrameIndex
can be proven to be sufficiently aligned that an OR of that immediate
is equivalent to an ADD.
The missing handling of FrameIndex in that case caused the SelectionDAG
instruction selection to miss opportunities to merge the OR back into
the FrameIndex node, leading to superfluous addi/ori instructions in
the final assembler output.
llvm-svn: 213482
These instructions can only take a limited input range, and return
the constant value 1 out of range. We should do range reduction to
be able to process arbitrary values. Use a FRACT instruction after
normalization to achieve this. Also add a test for constant folding
with the lowered code with unsafe-fp-math enabled.
v2: use DAG lowering instead of intrinsic, adapt test
v3: calculate constant, fold pattern into instruction definition
v4: misc style fixes, add sin-fold testcase, cosmetics
Patch by Grigori Goronzy
llvm-svn: 213458
Function @test3c should check that the DAGCombiner is able to fold a pair of
shuffles into a new shuffle with a permute mask of <6,7,2,3>. However, one of
the shuffles in @test3c had a wrong permute mask; this prevented the DAGCombiner
from folding the shuffles into the expected result.
Now that the shuffle mask is fixed, the backend correctly folds the two shuffles
in function @test3c into a single movhlps instruction.
llvm-svn: 213451
When performing a dynamic stack adjustment without optimisations, we would mark
SP as def and R4 as kill. This occurred as part of the expansion of a
WIN__CHKSTK SDNode which indicated the proper handling of SP and R4. The result
would be that we would double define SP as part of an operation, which is
obviously incorrect.
Furthermore, the VTList for the chain had an incorrect parameter type of i32
instead of Other.
Correct these to permit proper lowering of __builtin_alloca at -O0.
llvm-svn: 213442
This adds initial support for PPC32 ELF PIC (Position Independent Code; the
-fPIC variety), thus rectifying a long-standing deficiency in the PowerPC
backend.
Patch by Justin Hibbits!
llvm-svn: 213427
Because i16 is illegal, there's no native DAG method to
represent a bitcast to or from an f16 type. This meant LLVM was
inserting a stack store/load pair which is really not ideal.
llvm-svn: 213378
Actual support for softening f16 operations is still limited, and can be added
when it's needed. But Soften is much closer to being a useful thing to try
than keeping it Legal when no registers can actually hold such values.
Longer term, we probably want something between Soften and Promote semantics
for most targets, it'll be more efficient to promote the 4 basic operations to
f32 than libcall them.
llvm-svn: 213372
The post-indexed instructions were missing the constraint, causing unpredictable STR instructions to be emitted.
The earlyclobber constraint on the pre-indexed STR instructions is not strictly necessary, as the instruction selection for pre-indexed STR instructions goes through an additional layer of pseudo instructions which have the constraint defined, however it doesn't hurt to specify the constraint directly on the pre-indexed instructions as well, since at some point someone might create instances of them programmatically and then the constraint is definitely needed.
This fixes PR20323.
llvm-svn: 213369
Unfortunately, we don't seem to have a direct truncation, but the
extension can be legally split into two operations so we should
support that.
llvm-svn: 213357
Clang may well start emitting these soon, and while it may not be
directly relevant for OpenCL or GLSL, the instructions were just
sitting there waiting to be used.
llvm-svn: 213356
Since the result of a SETCC for X86 is 0 or -1 in each lane, we can
move unary operations, in this case [su]int_to_fp through the mask
operation and constant fold the operation away. Generally speaking:
UNARYOP(AND(VECTOR_CMP(x,y), constant))
--> AND(VECTOR_CMP(x,y), constant2)
where constant2 is UNARYOP(constant).
This implements the transform where UNARYOP is [su]int_to_fp.
For example, consider the simple function:
define <4 x float> @foo(<4 x float> %val, <4 x float> %test) nounwind {
%cmp = fcmp oeq <4 x float> %val, %test
%ext = zext <4 x i1> %cmp to <4 x i32>
%result = sitofp <4 x i32> %ext to <4 x float>
ret <4 x float> %result
}
Before this change, the SSE code is generated as:
LCPI0_0:
.long 1 ## 0x1
.long 1 ## 0x1
.long 1 ## 0x1
.long 1 ## 0x1
.section __TEXT,__text,regular,pure_instructions
.globl _foo
.align 4, 0x90
_foo: ## @foo
cmpeqps %xmm1, %xmm0
andps LCPI0_0(%rip), %xmm0
cvtdq2ps %xmm0, %xmm0
retq
After, the code is improved to:
LCPI0_0:
.long 1065353216 ## float 1.000000e+00
.long 1065353216 ## float 1.000000e+00
.long 1065353216 ## float 1.000000e+00
.long 1065353216 ## float 1.000000e+00
.section __TEXT,__text,regular,pure_instructions
.globl _foo
.align 4, 0x90
_foo: ## @foo
cmpeqps %xmm1, %xmm0
andps LCPI0_0(%rip), %xmm0
retq
The cvtdq2ps has been constant folded away and the floating point 1.0f
vector lanes are materialized directly via the ModRM operand of andps.
llvm-svn: 213342
Since the result of a SETCC for AArch64 is 0 or -1 in each lane, we can
move unary operations, in this case [su]int_to_fp through the mask
operation and constant fold the operation away. Generally speaking:
UNARYOP(AND(VECTOR_CMP(x,y), constant))
--> AND(VECTOR_CMP(x,y), constant2)
where constant2 is UNARYOP(constant).
This implements the transform where UNARYOP is [su]int_to_fp.
For example, consider the simple function:
define <4 x float> @foo(<4 x float> %val, <4 x float> %test) nounwind {
%cmp = fcmp oeq <4 x float> %val, %test
%ext = zext <4 x i1> %cmp to <4 x i32>
%result = sitofp <4 x i32> %ext to <4 x float>
ret <4 x float> %result
}
Before this change, the code is generated as:
fcmeq.4s v0, v0, v1
movi.4s v1, #0x1 // Integer splat value.
and.16b v0, v0, v1 // Mask lanes based on the comparison.
scvtf.4s v0, v0 // Convert each lane to f32.
ret
After, the code is improved to:
fcmeq.4s v0, v0, v1
fmov.4s v1, #1.00000000 // f32 splat value.
and.16b v0, v0, v1 // Mask lanes based on the comparison.
ret
The svvtf.4s has been constant folded away and the floating point 1.0f
vector lanes are materialized directly via fmov.4s.
Rather than do the folding manually in the target code, teach getNode()
in the generic SelectionDAG to handle folding constant operands of
vector [su]int_to_fp nodes. It is reasonable (as noted in a FIXME) to do
additional constant folding there as well, but I don't have test cases
for those operations, so leaving them for another time when it becomes
appropriate.
rdar://17693791
llvm-svn: 213341
We now consider the FPOpFusion flag when determining whether
to fuse ops. We also explicitly emit add.rn when fusion is
disabled to prevent ptxas from fusing the operations on its
own.
llvm-svn: 213287
This also uses TSFlags to mark machine instructions that are surface/texture
accesses, as well as the vector width for surface operations. This is used
to simplify some of the switch statements that need to detect surface/texture
instructions
llvm-svn: 213256
Previously we asserted on this code. Currently compiler-rt doesn't
actually implement any of these new libcalls, but external help is
pretty much the only viable option for LLVM.
I've followed the much more generic "__truncST2" naming, as opposed to
the odd name for f32 -> f16 truncation. This can obviously be changed
later, or overridden by any targets that need to.
llvm-svn: 213252
x86 has no native ability to extend an f16 to f64, but the same result
is obtained if we expand it into two separate extensions: f16 -> f32
-> f64.
Unfortunately the same is not true for truncate, so that still results
in a compilation failure.
llvm-svn: 213251
This makes the two intrinsics @llvm.convert.from.f16 and
@llvm.convert.to.f16 accept types other than simple "float". This is
only strictly needed for the truncate operation, since otherwise
double rounding occurs and there's no way to represent the strict IEEE
conversion. However, for symmetry we allow larger types in the extend
too.
During legalization, we can expand an "fp16_to_double" operation into
two extends for convenience, but abort when the truncate isn't legal. A new
libcall is probably needed here.
Even after this commit, various target tweaks are needed to actually use the
extended intrinsics. I've put these into separate commits for clarity, so there
are no actual tests of f64 conversion here.
llvm-svn: 213248
Memory barrier __builtin_arm_[dmb, dsb, isb] intrinsics are required to
implement their corresponding ACLE and MSVC intrinsics.
This patch ports ARM dmb, dsb, isb intrinsic to AArch64.
Differential Revision: http://reviews.llvm.org/D4520
llvm-svn: 213247
This fixes an issue where a local value is defined before and used after an
inline asm call with side effects.
This fix simply flushes the local value map, which updates the insertion point
for the inline asm call to be above any previously defined local values.
This fixes <rdar://problem/17694203>
llvm-svn: 213203
We were not considering the stated alignment on vector loads/stores,
leading us to generate vector instructions even when we do not have
sufficient alignment.
Now, for IR like:
%1 = load <4 x float>, <4 x float>* %ptr, align 4
we will generate correct, conservative PTX like:
ld.f32 ... [%ptr]
ld.f32 ... [%ptr+4]
ld.f32 ... [%ptr+8]
ld.f32 ... [%ptr+12]
Or if we have an alignment of 8 (for example), we can
generate code like:
ld.v2.f32 ... [%ptr]
ld.v2.f32 ... [%ptr+8]
llvm-svn: 213186
Before this change, method 'isShuffleMaskLegal' didn't know that shuffles
implementing a 'movhlps' operation were perfectly legal for SSE targets.
This patch adds the missing check for 'isMOVHLPSMask' inside method
'isShuffleMaskLegal' to fix the problem.
The reason why it is important to do this is because the DAGCombiner
conservatively avoids combining a pair of shuffles if the resulting shuffle
node has an illegal mask. Before this patch, shuffles with a MOVHLPS mask were
wrongly considered not to be legal. This was the root cause of some poor-code
generation bugs.
llvm-svn: 213137
The coalescer is very aggressive at propagating constraints on the register classes, and the register allocator doesn’t know how to split sub-registers later to recover. This patch provides an escape valve for targets that encounter this problem to limit coalescing.
This patch also implements such for ARM to lower register pressure when using lots of large register classes. This works around PR18825.
llvm-svn: 213078
v2: use ffbh/l if available
v3: Rebase on top of Matt's SI patches
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Tom Stellard <tom@stellard.net>
llvm-svn: 213072
This patch adds two new rules to the DAGCombiner:
1. shuffle (shuffle A, Undef, M0), B, M1 -> shuffle A, B, M2
2. shuffle (shuffle A, Undef, M0), A, M1 -> shuffle A, Undef, M2
We only do this if the combined shuffle is legal for the target.
Example:
;;
define <4 x float> @test(<4 x float> %a, <4 x float> %b) {
%1 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32><i32 6, i32 0, i32 1, i32 7>
%2 = shufflevector <4 x float> %1, <4 x float> %b, <4 x i32><i32 1, i32 2, i32 4, i32 5>
ret <4 x i32> %2
}
;;
(using llc -mcpu=corei7 -march=x86-64)
Before, the x86 backend generated:
pshufd $120, %xmm0, %xmm0
shufps $-108, %xmm0, %xmm1
movaps %xmm1, %xmm0
Now the x86 backend generates:
movsd %xmm1, %xmm0
llvm-svn: 213069
Instead of specifying 32-bit x86, specify 32-bit x86 linux.
This test is testing a very specific behavior which changed with
WinCOFF's constant pools.
llvm-svn: 213041
The constant pool entry code for WinCOFF assumed that vector constants
would be formed using ConstantDataVector, it did not expect to see a
ConstantVector. Furthermore, it did not expect undef as one of the
elements of the vector.
ConstantVectors should be handled like ConstantDataVectors, treat Undef
as zero.
llvm-svn: 213038
This helps avoid redundant instructions to unpack, and repack
the vectors. Ideally we could recognize that pattern and eliminate
it. Currently v4i8 and other small element type vectors are scalarized,
so this has the added bonus of avoiding that.
llvm-svn: 213031
Now functions 'test4', 'test9', 'test14' and 'test19' correctly perform
a move of two packed values from the high quadword of vector %b to the low
quadword of vector %a (movhlps idiom).
No functional change intended.
llvm-svn: 213029
This patch fixes a crasher in method 'DAGCombiner::visitOR' due to an invalid
call to method 'isShuffleMaskLegal'. On x86, method 'isShuffleMaskLegal'
always expects a legal vector value type in input.
With this patch, we immediately check if the input OR dag node has a legal
vector type; we only try to fold a OR dag node into a single shufflevector
if we know that the resulting shuffle will have a legal type.
This is to avoid calling method 'isShuffleMaskLegal' on a potentially
illegal vector value type.
Added a new test-case to file 'CodeGen/X86/combine-or.ll' to verify that
DAGCombiner doesn't crash in the attempt to check/combine an OR between shuffles
with illegal types.
llvm-svn: 213020
COFF lacks a feature that other object file formats support: mergeable
sections.
To work around this, MSVC sticks constant pool entries in special COMDAT
sections so that each constant is in it's own section. This permits
unused constants to be dropped and it also allows duplicate constants in
different translation units to get merged together.
This fixes PR20262.
Differential Revision: http://reviews.llvm.org/D4482
llvm-svn: 213006
This patch teaches the DAGCombiner how to fold a pair of shuffles
according to rules:
1. shuffle(shuffle A, B, M0), B, M1) -> shuffle(A, B, M2)
2. shuffle(shuffle A, B, M0), A, M1) -> shuffle(A, B, M3)
The new rules would only trigger if the resulting shuffle has legal type and
legal mask.
Added test 'combine-vec-shuffle-3.ll' to verify that DAGCombiner correctly
folds shuffles on x86 when the resulting mask is legal. Also added some negative
cases to verify that we avoid introducing illegal shuffles.
llvm-svn: 213001
We would emit a libcall for a 64-bit atomic on x86 after SVN r212119. This was
due to the misuse of hasCmpxchg16 to indicate if cmpxchg8b was supported on a
32-bit target. They were added at different times and would result in the
border condition being mishandled.
This fixes the border case to emit the cmpxchg8b instruction for 64-bit atomic
operations on x86 at the cost of restoring a long-standing bug in the codegen.
We emit a cmpxchg8b on all x86 targets even where the CPU does not support this
instruction (pre-Pentium CPUs). Although this bug should be fixed, this was
present prior to SVN r212119 and this change, so this is not really introducing
a regression.
llvm-svn: 212956
We construct a temporary "atomicrmw xchg" instruction when lowering atomic
stores for widths that aren't supported natively. This isn't on the top-level
worklist though, so it won't be removed automatically and we have to do it
ourselves once that itself has been lowered.
Thanks Saleem for pointing this out!
llvm-svn: 212948
Summary:
This is because the FP64A the hardware will redirect 32-bit reads/writes
from/to odd-numbered registers to the upper 32-bits of the corresponding
even register. In effect, simulating FR=0 mode when FR=0 mode is not
available.
Unfortunately, we have to make the decision to avoid mfc1/mtc1 before
register allocation so we currently do this for even registers too.
FPXX has a similar requirement on 32-bit architectures that lack
mfhc1/mthc1 so this patch also handles the affected moves from the FPU for
FPXX too. Moves to the FPU were supported by an earlier commit.
Differential Revision: http://reviews.llvm.org/D4484
llvm-svn: 212938
Summary:
This is similar to r210771 which did the same thing for MTHC1.
Also corrected MTHC1_D32 and MTHC1_D64 which used AFGR64 and FGR64 on the
wrong definitions.
Differential Revision: http://reviews.llvm.org/D4483
llvm-svn: 212936
enabled and mthc1 and dmtc1 are not available (e.g. on MIPS32r1)
This prevents the upper 32-bits of a double precision value from being moved to
the FPU with mtc1 to an odd-numbered FPU register. This is necessary to ensure
that the code generated executes correctly regardless of the current FPU mode.
MIPS32r2 and above continues to use mtc1/mthc1, while MIPS-IV and above continue
to use dmtc1.
Differential Revision: http://reviews.llvm.org/D4465
llvm-svn: 212930
This crash was pretty common while compiling Rust for iOS (armv7). Reason -
SjLj preparation step was lowering aggregate arguments as ExtractValue +
InsertValue. ExtractValue has assertion which checks that there is some data in
value, which is not true in case of empty (no fields) structures. Rust uses
them quite extensively so this patch uses a 'select true, %val, undef'
instruction to lower the argument.
Patch by Valerii Hiora.
llvm-svn: 212922
Verify that DAGCombiner does not crash when trying to fold a pair of shuffles
according to rule (added at r212539):
(shuffle (shuffle A, Undef, M0), Undef, M1) -> (shuffle A, Undef, M2)
The DAGCombiner avoids folding shuffles if the resulting shuffle dag node
is not legal for the target. That means, the resulting shuffle must have
legal type and legal mask.
Before, the DAGCombiner only called method
'TargetLowering::isShuffleMaskLegal' to check if it was "safe" to fold according
to the above-mentioned rule. However, this caused a crash in the x86 backend
since method 'isShuffleMaskLegal' always expects to be called on a
legal vector type.
llvm-svn: 212915
This adds a llvm.aarch64.hint intrinsic to mirror the llvm.arm.hint in order to
support the various hint intrinsic functions in the ACLE.
Add an optional pattern field that permits the subclass to specify the pattern
that matches the selection. The intrinsic pattern is set as mayLoad, mayStore,
so overload the value for the definition of the hint instruction.
llvm-svn: 212883