This change follows up defaults for GCC and Clang, so LLVM does not differ
from them. While number of the test files are touched with this change, they
all keep the old (expected) behaviour with the explicit option:
"-relocation-model=pic"
The tests that have not been touched are insensitive to relocation model.
Differential Revision: http://reviews.llvm.org/D17995
llvm-svn: 265949
This adds a conditional variant of CallBR instruction, CallBCR. Also,
it can be fused with integer comparisons, resulting in one of the new
C*BCall instructions.
In addition to CallBRCL limitations, this has another one: it won't
trigger if the function to call isn't already in %r1 - see f22 in the
test for an example (it's also why the loads in tests are volatile).
Author: koriakin
Differential Revision: http://reviews.llvm.org/D18928
llvm-svn: 265933
Extend the existing lowering of vXi8 multiplies to support v64i8 on avx512bw targets.
I added the Lower512IntArith helper function to help with this - not sure how often this could be used in the future, but it seemed better than putting all that logic inside LowerMUL.
Differential Revision: http://reviews.llvm.org/D18937
llvm-svn: 265902
Summary:
After we make the adjustment, we can assume that for local allocas, but
not for stack parameters, the return address, or any other fixed stack
object (which has a negative offset and therefore lies prior to the
adjusted SP).
Fixes PR26662.
Reviewers: hfinkel, qcolombet, rnk
Subscribers: rnk, llvm-commits
Differential Revision: http://reviews.llvm.org/D18471
llvm-svn: 265886
This patch adds support for decoding XOP VPPERM instruction when it represents a basic shuffle.
The mask decoding required the existing MCInstrLowering code to be updated to support binary shuffles - the implementation now matches what is done in X86InstrComments.cpp.
Differential Revision: http://reviews.llvm.org/D18441
llvm-svn: 265874
In Memcpy lowering we had missed a dependence from the load of the
operation to successor operations. This causes us to potentially
construct an in initial DAG with a memory dependence not fully
represented in the chain sub-DAG but rather require looking at the
entire DAG breaking alias analysis by allowing incorrect repositioning
of memory operations.
To work around this, r200033 changed DAGCombiner::GatherAllAliases to be
conservative if any possible issues to happen. Unfortunately this check
forbade many non-problematic situations as well. For example, it's
common for incoming argument lowering to add a non-aliasing load hanging
off of EntryNode. Then, if GatherAllAliases visited EntryNode, it would
find that other (unvisited) use of the EntryNode chain, and just give up
entirely. Furthermore, the check was incomplete: it would not actually
detect all such potentially problematic DAG constructions, because
GatherAllAliases did not guarantee to visit all chain nodes going up to
the root EntryNode. This is in general fine -- giving up early will just
miss a potential optimization, not generate incorrect results. But, for
this non-chain dependency detection code, it's possible that you could
have a load attached to a higher-up chain node than any which were
visited. If that load aliases your store, but the only dependency is
through the value operand of a non-aliasing store, it would've been
missed by this code, and potentially reordered.
With the dependence added, this check can be removed and Alias Analysis
can be much more aggressive. This fixes code quality regression in the
Consecutive Store Merge cleanup (D14834).
Test Change:
ppc64-align-long-double.ll now may see multiple serializations
of its stores
Differential Revision: http://reviews.llvm.org/D18062
llvm-svn: 265836
This adds a conditional variant of CallJG instruction, CallBRCL.
It can be used for conditional sibling calls. Unfortunately, due
to IfCvt limitations, it only really works well for functions without
arguments.
Author: koriakin
Differential Revision: http://reviews.llvm.org/D18864
llvm-svn: 265814
Added ISelDAGToDAG functions to enable selection of the smlawb, smlawt,
smulwb and smulwt instructions for the ARM backend. Also updated the smul
CodeGen test and removed the smulw one.
Differential Revision: http://reviews.llvm.org/D18892
llvm-svn: 265793
It caused PR27275: "ARM: Bad machine code: Using an undefined physical register"
Also reverting the following commits that were landed on top:
r265610 "Fix the compare-clang diff error introduced by r265547."
r265639 "Fix the sanitizer bootstrap error in r265547."
r265657 "InlineSpiller.cpp: Escap \@ in r265547. [-Wdocumentation]"
llvm-svn: 265790
This is the same change on PPC64 as r255821 on AArch64. I have even borrowed
his commit message.
The access function has a short entry and a short exit, the initialization
block is only run the first time. To improve the performance, we want to
have a short frame at the entry and exit.
We explicitly handle most of the CSRs via copies. Only the CSRs that are not
handled via copies will be in CSR_SaveList.
Frame lowering and prologue/epilogue insertion will generate a short frame
in the entry and exit according to CSR_SaveList. The majority of the CSRs will
be handled by register allcoator. Register allocator will try to spill and
reload them in the initialization block.
We add CSRsViaCopy, it will be explicitly handled during lowering.
1> we first set FunctionLoweringInfo->SplitCSR if conditions are met (the target
supports it for the given machine function and the function has only return
exits). We also call TLI->initializeSplitCSR to perform initialization.
2> we call TLI->insertCopiesSplitCSR to insert copies from CSRsViaCopy to
virtual registers at beginning of the entry block and copies from virtual
registers to CSRsViaCopy at beginning of the exit blocks.
3> we also need to make sure the explicit copies will not be eliminated.
Author: Tom Jablin (tjablin)
Reviewers: hfinkel kbarton cycheng
http://reviews.llvm.org/D17533
llvm-svn: 265781
It seems that llc cannot be called used in assembler tests so test that
checks asm for particular target needs to be moved to codegen.
llvm-svn: 265770
Summary:
EHPad BB are not entered the classic way and therefor do not need to be placed after their predecessors. This patch make sure EHPad BB are not chosen amongst successors to form chains, and are selected as last resort when selecting the best candidate.
EHPad are scheduled in reverse probability order in order to have them flow into each others naturally.
Reviewers: chandlerc, majnemer, rafael, MatzeB, escha, silvas
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17625
llvm-svn: 265726
Standard load/store instructions with GLC bit set.
Reviewers: tstellardAMD, arsenm
Differential Revision: http://reviews.llvm.org/D18760
llvm-svn: 265709
Return is now considered a predicable instruction, and is converted
to a newly-added CondReturn (which maps to BCR to %r14) instruction by
the if conversion pass.
Also, fused compare-and-branch transform knows about conditional
returns, emitting the proper fused instructions for them.
This transform triggers on a *lot* of tests, hence the huge diffstat.
The changes are mostly jX to br %r14 -> bXr %r14.
Author: koriakin
Differential Revision: http://reviews.llvm.org/D17339
llvm-svn: 265689
http://reviews.llvm.org/D18562
A large number of testcases has been modified so they pass after this test.
One testcase is deleted, because I realized even after undoing the original
change that was committed with this testcase, the testcase still passes. So
I removed it. The change to one other testcase (test/CodeGen/PowerPC/pr25802.ll)
is an arbitrary change to keep it passing. Given the original intention of the
testcase, and the fact that fixing it will require some time to change the testcase,
we concluded that this quick change will be enough.
llvm-svn: 265683
Re-apply r265450 which caused PR27245 and was reverted in r265559
because of a wrong generalization: the fetch_and_add->add_and_fetch
combine only works in specific, but pretty common, cases:
(icmp slt x, 0) -> (icmp sle (add x, 1), 0)
(icmp sge x, 0) -> (icmp sgt (add x, 1), 0)
(icmp sle x, 0) -> (icmp slt (sub x, 1), 0)
(icmp sgt x, 0) -> (icmp sge (sub x, 1), 0)
Original Message:
We only generate LOCKed versions of add/sub when the result is unused.
It often happens that the result is used, but only by a comparison. We
can optimize those out by reusing EFLAGS, which lets us use the proper
instructions, instead of having to fallback to LXADD.
Instead of doing this as an MI peephole (as we do for the other
non-LOCKed (really, non-MR) forms), do it in ISel. It becomes quite
tricky later.
This also makes it eventually possible to stop expanding and/or/xor
if the only user is an icmp (also see D18141).
This uses the LOCK ISD opcodes added by r262244.
Differential Revision: http://reviews.llvm.org/D17633
llvm-svn: 265636
Third time's the charm? The previous attempt (r265345) caused ASan test
failures on X86, as broken CFI caused stack traces to not work.
This version of the patch makes sure not to merge with stack adjustments
that have CFI, and to not add merged instructions' offests to the CFI
about to be generated.
This is already covered by the lit tests; I just got the expectations
wrong previously.
llvm-svn: 265623
http://reviews.llvm.org/D18405
When the integer value loaded is never used directly as integer we should use VSX
or Floating Point Facility integer loads and avoid extra direct move
llvm-svn: 265593
This makes it possible to distinguish between mesa shaders
and other kernels even in the presence of compute shaders.
Patch By: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Differential Revision: http://reviews.llvm.org/D18559
llvm-svn: 265589
when DenseMap growed and moved memory. I verified it fixed the bootstrap
problem on x86_64-linux-gnu but I cannot verify whether it fixes
the bootstrap error on clang-ppc64be-linux. I will watch the build-bot
result closely.
Replace analyzeSiblingValues with new algorithm to fix its compile
time issue. The patch is to solve PR17409 and its duplicates.
analyzeSiblingValues is a N x N complexity algorithm where N is
the number of siblings generated by reg splitting. Although it
causes siginificant compile time issue when N is large, it is also
important for performance since it removes redundent spills and
enables rematerialization.
To solve the compile time issue, the patch removes analyzeSiblingValues
and replaces it with lower cost alternatives containing two parts. The
first part creates a new spill hoisting method in postOptimization of
register allocation. It does spill hoisting at once after all the spills
are generated instead of inside every instance of selectOrSplit. The
second part queries the define expr of the original register for
rematerializaiton and keep it always available during register allocation
even if it is already dead. It deletes those dead instructions only in
postOptimization. With the two parts in the patch, it can remove
analyzeSiblingValues without sacrificing performance.
Differential Revision: http://reviews.llvm.org/D15302
llvm-svn: 265547
This patch enable sibling call optimization on ppc64 ELFv1/ELFv2 abi, and
add a couple of test cases. This patch also passed llvm/clang bootstrap
test, and spec2006 build/run/result validation.
Original issue: https://llvm.org/bugs/show_bug.cgi?id=25617
Great thanks to Tom's (tjablin) help, he contributed a lot to this patch.
Thanks Hal and Kit's invaluable opinions!
Reviewers: hfinkel kbarton
http://reviews.llvm.org/D16315
llvm-svn: 265506
While preserving the return value for @llvm.experimental.deoptimize at
the IR level is useful during mid-level optimization, doing so at the
machine instruction level requires generating some extra code and a
return that is non-ideal. This change has LLVM lower
```
%val = call @llvm.experimental.deoptimize
ret %val
```
to effectively
```
call @__llvm_deoptimize()
unreachable
```
instead.
llvm-svn: 265502
Bionic has a defined thread-local location for the stack protector
cookie. Emit a direct load instead of going through __stack_chk_guard.
llvm-svn: 265481
We only generate LOCKed versions of add/sub when the result is unused.
It often happens that the result is used, but only by a comparison. We
can optimize those out by reusing EFLAGS, which lets us use the proper
instructions, instead of having to fallback to LXADD.
Instead of doing this as an MI peephole (as we do for the other
non-LOCKed (really, non-MR) forms), do it in ISel. It becomes quite
tricky later.
This also makes it eventually possible to stop expanding and/or/xor
if the only user is an icmp (also see D18141).
This uses the LOCK ISD opcodes added by r262244.
Differential Revision: http://reviews.llvm.org/D17633
llvm-svn: 265450
utils/update_test_checks.py was improved with:
http://reviews.llvm.org/rL265414
to include the first line of the function (expected to be
a comment line). This ensures that nothing bad has happened
before the first actual line of checked asm. It also matches
the existing behavior of the old script.
llvm-svn: 265416
Summary: LanaiSetflagAluCombiner could previously combine instructions across basic building blocks even when not legal. Make the LanaiSetflagAluCombiner more conservative to avoid this.
Reviewers: eliben
Subscribers: joker.eph, llvm-commits
Differential Revision: http://reviews.llvm.org/D18746
llvm-svn: 265411
Presently, CodeGenPrepare deletes all nearly empty (only phi and branch)
basic blocks. This pass can delete loop preheaders which frequently creates
critical edges. A preheader can be a convenient place to spill registers to
the stack. If the entrance to a loop body is a critical edge, then spills
may occur in the loop body rather than immediately before it. This patch
protects loop preheaders from deletion in CodeGenPrepare even if they are
nearly empty.
Since the patch alters the CFG, it affects a large number of test cases.
In most cases, the changes are merely cosmetic (basic blocks have different
names or instruction orders change slightly). I am somewhat concerned about
the test/CodeGen/Mips/brdelayslot.ll test case. If the loop preheader is not
deleted, then the MIPS backend does not take advantage of a branch delay
slot. Consequently, I would like some close review by a MIPS expert.
The patch also partially subsumes D16893 from George Burgess IV. George
correctly notes that CodeGenPrepare does not actually preserve the dominator
tree. I think the dominator tree was usually not valid when CodeGenPrepare
ran, but I am using LoopInfo to mark preheaders, so the dominator tree is
now always valid before CodeGenPrepare.
Author: Tom Jablin (tjablin)
Reviewers: hfinkel george.burgess.iv vkalintiris dsanders kbarton cycheng
http://reviews.llvm.org/D16984
llvm-svn: 265397
This patch adds support for compact jumps similiar to the previous compact
branch support for MIPSR6. Unlike compact branches, compact jumps do not
have a forbidden slot.
As MipsInstrInfo::getEquivalentCompactForm can determine the correct
expansion for jumps and branches for both microMIPS and MIPSR6, remove the
unnecessary distinction in the delay slot filler.
Reviewers: vkalintiris
Subscribers: llvm-commits, dsanders
llvm-svn: 265390
Summary:
I encountered this issue when constant folding during inlining tried to
fold away a bitcast of a double to an x86_mmx, which is not an integral
type. The test case exposes the same issue with a smaller code snippet
during early CSE.
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D18528
llvm-svn: 265367
We missed a handful of .mir tests that existed outside the
test/CodeGen/MIR directory.
Also fix the three powerpc .mir tests that nobody noticed were broken.
llvm-svn: 265350
The original commit miscompiled things on 32-bit Windows, e.g. a Clang
boostrap. It turns out that mergeSPUpdates() was a bit too generous in
what it interpreted as a stack adjustment, causing the following code:
addl $12, %esp
leal -4(%ebp), %esp
To be "optimized" into simply:
addl $8, %esp
This commit tightens up mergeSPUpdates() and includes a new test
(test14 in movtopush.ll) for this situation.
llvm-svn: 265345
Adds some dllexport tests to verify that:
- Variables in bss are exported appropriately
- Non-dllexport symbols aliased to dllexport symbols are not exported
- Symbols declared as dllexport but are not defined are not exported
We plan to enable dllimport/dllexport support for the PS4, and these
additional tests are for points we noticed in our internal testing.
Patch by Warren Ristow!
Differential Revision: http://reviews.llvm.org/D18682
llvm-svn: 265333
We can only perform a tail call to a callee that preserves all the
registers that the caller needs to preserve.
This situation happens with calling conventions like preserver_mostcc or
cxx_fast_tls. It was explicitely handled for fast_tls and failing for
preserve_most. This patch generalizes the check to any calling
convention.
Related to rdar://24207743
Differential Revision: http://reviews.llvm.org/D18680
llvm-svn: 265329
Summary:
This adds the same checks that were added in r264593 to all
target-specific passes that run after register allocation.
Reviewers: qcolombet
Subscribers: jyknight, dsanders, llvm-commits
Differential Revision: http://reviews.llvm.org/D18525
llvm-svn: 265313
time issue. The patch is to solve PR17409 and its duplicates.
analyzeSiblingValues is a N x N complexity algorithm where N is
the number of siblings generated by reg splitting. Although it
causes siginificant compile time issue when N is large, it is also
important for performance since it removes redundent spills and
enables rematerialization.
To solve the compile time issue, the patch removes analyzeSiblingValues
and replaces it with lower cost alternatives containing two parts. The
first part creates a new spill hoisting method in postOptimization of
register allocation. It does spill hoisting at once after all the spills
are generated instead of inside every instance of selectOrSplit. The
second part queries the define expr of the original register for
rematerializaiton and keep it always available during register allocation
even if it is already dead. It deletes those dead instructions only in
postOptimization. With the two parts in the patch, it can remove
analyzeSiblingValues without sacrificing performance.
Differential Revision: http://reviews.llvm.org/D15302
llvm-svn: 265309
A cross-thread sequentially consistent fence should be lowered into
z/Architecture's BCR serialization instruction, instead of causing a
fatal error in the back-end.
Author: bryanpkc
Differential Revision: http://reviews.llvm.org/D18644
llvm-svn: 265292
Enable the SystemZ back-end to lower FRAMEADDR and RETURNADDR, which
previously would cause the back-end to crash. Currently, only a
frame count of zero is supported.
Author: bryanpkc
Differential Revision: http://reviews.llvm.org/D18514
llvm-svn: 265291
Implemented truncstore for KNL and skylake-avx512.
Covered vectors from v2i1 to v64i1. We save the value in bits (not in bytes) - v32i1 is saved in 4 bytes.
Differential Revision: http://reviews.llvm.org/D18740
llvm-svn: 265283
Was there really no other way to splat a byte in SSE2?
punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]
pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
llvm-svn: 265172
Summary:
Implement BUFFER_ATOMIC_CMPSWAP{,_X2} instructions on all GCN targets, and FLAT_ATOMIC_CMPSWAP{,_X2} on CI+.
32-bit instruction variants tested manually on Kabini and Bonaire. Tests and parts of code provided by Jan Veselý.
Patch by: Vedran Miletić
Reviewers: arsenm, tstellarAMD, nhaehnle
Subscribers: jvesely, scchan, kanarayan, arsenm
Differential Revision: http://reviews.llvm.org/D17280
llvm-svn: 265170
Note however that this is identical to the existing SSE2 run.
What we really want is yet another run for an SSE2 machine that
also has fast unaligned 16-byte accesses.
llvm-svn: 265167
Follow-up to http://reviews.llvm.org/D18566 and http://reviews.llvm.org/D18676 -
where we noticed that an intermediate splat was being generated for memsets of
non-zero chars.
That was because we told getMemsetStores() to use a 32-bit vector element type,
and it happily obliged by producing that constant using an integer multiply.
The 16-byte test that was added in D18566 is now equivalent for AVX1 and AVX2
(no splats, just a vector load), but we have PR27141 to track that splat difference.
Note that the SSE1 path is not changed in this patch. That can be a follow-up.
This patch should resolve PR27100.
llvm-svn: 265161
Follow-up to D18566 - where we noticed that an intermediate splat was being
generated for memsets of non-zero chars.
That was because we told getMemsetStores() to use a 32-bit vector element type,
and it happily obliged by producing that constant using an integer multiply.
The tests that were added in the last patch are now equivalent for AVX1 and AVX2
(no splats, just a vector load), but we have PR27141 to track that splat difference.
In the new tests, the splat via shuffling looks ok to me, but there might be some
room for improvement depending on uarch there.
Note that the SSE1/2 paths are not changed in this patch. That can be a follow-up.
This patch should resolve PR27100.
Differential Revision: http://reviews.llvm.org/D18676
llvm-svn: 265148
This changes some dllexport tests, to verify that some symbols that
should not be exported are not, in a way that improves the robustness
of CHECK-SAME interaction with CHECK-NOT.
We plan to enable dllimport/dllexport support for the PS4, and these
changes are for points we noticed in our internal testing.
Patch by Warren Ristow!
llvm-svn: 265106
Re-enable an assertion enabled by Justin Lebar in rL265092. rL265092
was breaking test/CodeGen/X86/deopt-intrinsic.ll because webkit_jscc
does not like non-i64 return types. Change the test case to not do
that.
llvm-svn: 265099
Previously, HandleLastUse would delete RegRef information for sub-registers
if they were dead even if their corresponding super-register were still live.
If the super-register were later renamed, then the definitions of the
sub-register would not be updated appropriately. This patch alters the
behavior so that RegInfo information for sub-registers is only deleted when
the sub-register and super-register are both dead.
This resolves PR26775. This is the mirror image of Hal's r227311 commit.
Author: Tom Jablin (tjablin)
Reviewers: kbarton uweigand nemanjai hfinkel
http://reviews.llvm.org/D18448
llvm-svn: 265097
Summary:
Previously the NVVMReflect pass would read its configuration from
command-line flags or a static configuration given to the pass at
instantiation time.
This doesn't quite work for clang's use-case. It needs to pass a value
for __CUDA_FTZ down on a per-module basis. We use a module flag for
this, so the NVVMReflect pass needs to be updated to read said flag.
Reviewers: tra, rnk
Subscribers: cfe-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D18672
llvm-svn: 265090
This mostly cosmetic patch moves the DebugEmissionKind enum from DIBuilder
into DICompileUnit. DIBuilder is not the right place for this enum to live
in — a metadata consumer should not have to include DIBuilder.h.
I also added a Verifier check that checks that the emission kind of a
DICompileUnit is actually legal.
http://reviews.llvm.org/D18612
<rdar://problem/25427165>
llvm-svn: 265077
Print aliases in topological order, that is, for any alias a = b,
b must be printed before a. This is because on some targets (e.g. PowerPC)
linker expects aliases in such an order to generate correct TOC information.
GCC also prints aliases in topological order.
llvm-svn: 265064
Summary:
This change will allow loads with imp-def to be clustered in machine-scheduler pass.
areMemAccessesTriviallyDisjoint() can also handle loads with imp-def.
Reviewers: mcrosier, jmolloy, t.p.northover
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18665
llvm-svn: 265051
Chapter 3 of the QPX manual states that, "Scalar floating-point load
instructions, defined in the Power ISA, cause a replication of the source data
across all elements of the target register." Thus, if we have a load followed
by a QPX splat (from the first lane), the splat is redundant. This adds a late
MI-level pass to remove the redundant splats in some of these cases
(specifically when both occur in the same basic block).
This optimization is scheduled just prior to post-RA scheduling. It can't happen
before anything that might replace the load with some already-computed quantity
(i.e. store-to-load forwarding).
llvm-svn: 265047
We don't really support non-constant shuffle masks, but these tests are for cases where BUILD_VECTOR is made up from vector extracts (as well as undef/zero scalars).
llvm-svn: 265045
The test case added in r265023 is failing on ninja-x64-msvc-RA-centos6.
Update the test to make less specific assumptions on code generation.
llvm-svn: 265026
PPCSimplifyAddress contains this code:
IntegerType *OffsetTy = ((VT == MVT::i32) ? Type::getInt32Ty(*Context)
: Type::getInt64Ty(*Context));
to determine the type to be used for an index register, if one needs
to be created. However, the "VT" here is the type of the data being
loaded or stored, *not* the type of an address. This means that if
a data element of type i32 is accessed using an index that does not
not fit into 32 bits, a wrong address is computed here.
Note that PPCFastISel is only ever used on 64-bit currently, so the type
of an address is actually *always* MVT::i64. Other parts of the code,
even in this same PPCSimplifyAddress routine, already rely on that fact.
Thus, this patch changes the code to simply unconditionally use
Type::getInt64Ty(*Context) as OffsetTy.
llvm-svn: 265023
Summary:
This change will handle missing store pair opportunity where the first store
instruction stores zero followed by the non-zero store. For example, this change
will convert :
str wzr, [x8]
str w1, [x8, #4]
into:
stp wzr, w1, [x8]
Reviewers: jmolloy, t.p.northover, mcrosier
Subscribers: flyingforyou, aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18570
llvm-svn: 265021
The fast isel pass currently emits a COPY_TO_REGCLASS node to convert
from a F4RC to a F8RC register class during conversion of a
floating-point number to integer. There is actually no support in the
common code instruction printers to emit COPY_TO_REGCLASS nodes, so the
PowerPC back-end has special code there to simply ignore
COPY_TO_REGCLASS.
This is correct *if and only if* the source and destination registers of
COPY_TO_REGCLASS are the same (except for the different register class).
But nothing guarantees this to be the case, and if the register
allocator does end up allocating source and destination to different
registers after all, the back-end simply generates incorrect code. I've
included a test case that shows such incorrect code generation.
However, it seems that COPY_TO_REGCLASS is actually not intended to be
used at the MI layer at all. It is used during SelectionDAG, but always
lowered to a plain COPY before emitting MI. Other back-end's fast isel
passes never emit COPY_TO_REGCLASS at all. I suspect it is simply wrong
for the PowerPC back-end to emit it here.
This patch changes the PowerPC back-end to directly emit COPY instead of
COPY_TO_REGCLASS and removes the special handling in the instruction
printers.
Differential Revision: http://reviews.llvm.org/D18605
llvm-svn: 265020
When dealing with complex<float>, and similar structures with two
single-precision floating-point numbers, especially when such things are being
passed around by value, we'll sometimes end up loading both float values by
extracting them from one 64-bit integer load. It looks like this:
t13: i64,ch = load<LD8[%ref.tmp]> t0, t6, undef:i64
t16: i64 = srl t13, Constant:i32<32>
t17: i32 = truncate t16
t18: f32 = bitcast t17
t19: i32 = truncate t13
t20: f32 = bitcast t19
The problem, especially before the P8 where those bitcasts aren't legal (and
get expanded via the stack), is that it would have been better to use two
floating-point loads directly. Here we add a target-specific DAGCombine to do
just that. In short, we turn:
ld 3, 0(5)
stw 3, -8(1)
rldicl 3, 3, 32, 32
stw 3, -4(1)
lfs 3, -4(1)
lfs 0, -8(1)
into:
lfs 3, 4(5)
lfs 0, 0(5)
llvm-svn: 264988
The test case was defining and using a function 'notExported()', but
the FileCheck checks were checking for the name 'not_exported'. This
changes the test to use 'notExported' across the board. Also, the test
defined a function 'not_defined()', but doesn't have any checks related
to it. For consistency, this name is changed to 'notDefined'. A later
commit will add checks for 'notDefined'.
Patch by Warren Ristow!
llvm-svn: 264984
The size savings are significant, and from what I can tell, both ICC and GCC do this.
Differential Revision: http://reviews.llvm.org/D18573
llvm-svn: 264966
Fix for issue introduced D17297, where we were breaking early from the loop detecting consecutive loads which could leave us thinking a consecutive load with zeros was possible.
llvm-svn: 264922
Summary:
This results in higher register usage, but should make it easier for
the compiler to hide latency.
This pass is a prerequisite for some more scheduler improvements, and I
think the increase register usage with this patch is acceptable, because
when combined with the scheduler improvements, the total register usage
will decrease.
shader-db stats:
2382 shaders in 478 tests
Totals:
SGPRS: 48672 -> 49088 (0.85 %)
VGPRS: 34148 -> 34847 (2.05 %)
Code Size: 1285816 -> 1289128 (0.26 %) bytes
LDS: 28 -> 28 (0.00 %) blocks
Scratch: 492544 -> 573440 (16.42 %) bytes per wave
Max Waves: 6856 -> 6846 (-0.15 %)
Wait states: 0 -> 0 (0.00 %)
Depends on D18451
Reviewers: nhaehnle, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18452
llvm-svn: 264876
XOP's VPPERM has some great 'permute operations' that it can do as well as part of shuffling the bytes of a 128-bit vector - in this case we use it to perform BITREVERSE in a single instruction.
llvm-svn: 264870
We are currently doing a REALLY bad job of packing results of vector comparisons into the legalized <X x i1> result equivalents - a mixture of PACKSS/PMOVMSKB would be much better here.
llvm-svn: 264867
operations.
Specifically, we had code that tried to badly approximate reconstructing
all of the possible variations on addressing modes in two x86
instructions based on those in one pseudo instruction. This is not the
first bug uncovered with doing this, so stop doing it altogether.
Instead generically and pedantically copy every operand from the address
over to both new instructions, and strip kill flags from any register
operands.
This fixes a subtle bug seen in the wild where we would mysteriously
drop parts of the addressing mode, causing for example the index
argument in the added test case to just be completely ignored.
Hypothetically, this was an extremely bad miscompile because it actually
caused a predictable and leveragable write of a 64bit quantity to an
unintended offset (the first element of the array intead of whatever
other element was intended). As a consequence, in theory this could even
have introduced security vulnerabilities.
However, this was only something that could happen with an atomic
floating point add. No other operation could trigger this bug, so it
seems extremely unlikely to have occured widely in the wild.
But it did in fact occur, and frequently in scientific applications
which were using relaxed atomic updates of a floating point value after
adding a delta. Those would end up being quite badly miscompiled by
LLVM, which is how we found this. Of course, this often looks like
a race condition in the code, but it was actually a miscompile.
I suspect that this whole RELEASE_FADD thing was a complete mistake.
There is no such operation, and I worry that anything other than add
will get remarkably worse codegeneration. But that's not for this
change....
llvm-svn: 264845
1. Removed the run line for mingw32 and made the Darwin triples unknown.
This is a test of 32-bit vs. 64-bit platform and the underlying hardware.
We have other tests for checking behavioral differences of the OS platform.
2. Changed the CPU specifiers to the attributes they were meant to represent.
Any CPU that doesn't have SSE4.2 is assumed to have slow unaligned 16-byte accesses,
so it won't use those here.
3. Although the stores really could all be CHECK-DAG, I left them as CHECK-NEXT to
show the strange behavior of the instruction scheduler in the SLOW_32 case.
4. The odd-looking instructions are due to the use of a null pointer in the IR, so
we have integer immediate store addresses. Cute.
llvm-svn: 264796
Add function soft attribute to the generation of Jump Tables in CodeGen
as initial step towards clang support of gcc's no-jump-table support
Reviewers: hans, echristo
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D18321
llvm-svn: 264756
Fixed fp_to_uint instruction selection on KNL.
One pattern was missing for <4 x double> to <4 x i32>
Differential Revision: http://reviews.llvm.org/D18512
llvm-svn: 264701
Instead of using two feature bits, one to indicate the availability of the
popcnt[dw] instructions, and another to indicate whether or not they're fast,
use a single enum. This allows more consistent control via target attribute
strings, and via Clang's command line.
llvm-svn: 264690
Minimum density for both optsize and non optsize are now options
-sparse-jump-table-density (default 10) for non optsize functions
-dense-jump-table-density (default 40) for optsize functions, which
matches the current default. This improves several benchmarks at google
at the cost of a small codesize increase. For code compiled with -Os,
the old behavior continues
llvm-svn: 264689