This commit was already commited as revision rL208689 and discussd in
phabricator revision D3704.
But the test file was crashing on OS X and windows.
I fixed the test file in the same way as in rL208340.
llvm-svn: 208711
compared to 'AddrMode.BaseReg'. In the case that 'AddrMode.BaseReg' is
nullptr, 'Result' will also be nullptr, so the cast causes an assertion. We
should use dyn_cast_or_null here to check 'Result' is not null and it is an
instruction.
Bug found by Mats Petersson, and I reduced his IR to get a test case.
llvm-svn: 208705
Normally, patterns like (add x, (setcc cc ...)) will be folded into
(csel x, x+1, not cc). However, if there is a ZEXT after SETCC, they
won't be folded. This patch recognizes the ZEXT and allows the
generation of CSINC.
This patch fixes bug 19680.
llvm-svn: 208660
Right now the load may not get DCE'd because of the side-effect of updating
the base pointer.
This can happen if we lower a read-modify-write of an illegal larger type
(e.g. i48) such that the modification only affects one of the subparts (the
lower i32 part but not the higher i16 part). See the testcase.
In order to spot the dead load we need to revisit it when SimplifyDemandedBits
decided that the value of the load is masked off. This is the
CommitTargetLoweringOpt piece.
I checked compile time with ARM64 by sending SPEC bitcode files through llc.
No measurable change.
Fixes <rdar://problem/16031651>
llvm-svn: 208640
r208453 added support for having sret on the second parameter. In that
change, the code for copying sret into a virtual register was hoisted
into the loop that lowers formal parameters. This caused a "Wrong
topological sorting" assertion failure during scheduling when a
parameter is passed in memory. This change undoes that by creating a
second loop that deals with sret.
I'm worried that this fix is incomplete. I don't fully understand the
dependence issues. However, with this change we produce the same DAGs
we used to produce, so if they are broken, they are just as broken as
they have always been.
llvm-svn: 208637
Tested by comparing make check VERBOSE=1 before and after to make sure
no tests are missed. (VERBOSE=1 prints the list of tests.)
Only one test :( remains where .cpp is required:
tools/llvm-cov/range_based_for.cpp:// RUN: llvm-cov range_based_for.cpp | FileCheck %s --check-prefix=STDOUT
The topic was discussed in this thread:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140428/214905.html
llvm-svn: 208621
The current patterns for REV16 misses mostn __builtin_bswap16() due to
legalization promoting the operands to from load/stores toi32s and then
truncing/extending them. This patch adds new patterns that catch the resultant
DAGs and codegens them to rev16 instructions. Tests included.
rdar://15353652
llvm-svn: 208620
1) Changed gather and scatter intrinsics. Now they are aligned with GCC built-ins. There is no more non-masked form. Masked intrinsic receives -1 if all lanes are executed.
2) I changed the function that works with intrinsics inside X86ISelLowering.cpp. I put all intrinsics in one table. I did it for INTRINSICS_W_CHAIN and plan to put all intrinsics from WO_CHAIN set to the same table in order to avoid the long-long "switch". (I wanted to use static map initialization that allowed by C++11 but I wasn't able to compile it on VS2012).
3) I added gather/scatter prefetch intrinsics.
4) I fixed MRMm encoding for masked instructions.
llvm-svn: 208522
Support for the intrinsics that read from and write to global named registers
is added for r1, r2 and r13 (depending on the subtarget).
llvm-svn: 208509
The counter-loops formation pass needs to know what operations might be
function calls (because they can't appear in counter-based loops). On PPC32,
128-bit shifts might be runtime calls (even though you can't use __int128 on
PPC32, it seems that SROA might form them).
Fixes PR19709.
llvm-svn: 208501
When lowering build_vector to an insertps, we would still lower it, even
if the source vectors weren't v4x32. This would break on avx if the source
was a v8x32. We now check the type of the source vectors.
llvm-svn: 208487
We were swapping the true & false results while testing for FMAX/FMIN,
but not putting them back to the original state if the later checks
failed.
Should fix PR19700.
llvm-svn: 208469
This reverts commit r200561.
This calling convention was an attempt to match the MSVC C++ ABI for
methods that return structures by value. This solution didn't scale,
because it would have required splitting every CC available on Windows
into two: one for methods and one for free functions.
Now that we can put sret on the second arg (r208453), and Clang does
that (r208458), revert this hack.
llvm-svn: 208459
MSVC always places the implicit sret parameter after the implicit this
parameter of instance methods. We used to handle this for
x86_thiscallcc by allocating the sret parameter on the stack and leaving
the this pointer in ecx, but that doesn't handle alternative calling
conventions like cdecl, stdcall, fastcall, or the win64 convention.
Instead, change the verifier to allow sret on the second parameter.
This also requires changing the Mips and X86 backends to return the
argument with the sret parameter, instead of assuming that the sret
parameter comes first.
The Sparc backend also returns sret parameters in a register, but I
wasn't able to update it to handle secondary sret parameters. It
currently calls report_fatal_error if you feed it an sret in the second
parameter.
Reviewers: rafael.espindola, majnemer
Differential Revision: http://reviews.llvm.org/D3617
llvm-svn: 208453
This patch adds support to ARM for custom lowering of the
llvm.{u|s}add.with.overflow.i32 intrinsics for i32/i64. This is particularly useful
for handling idiomatic saturating math functions as generated by
InstCombineCompare.
Test cases included.
rdar://14853450
llvm-svn: 208435
When using the ARM AAPCS, HFAs (Homogeneous Floating-point Aggregates) must
be passed in a block of consecutive floating-point registers, or on the stack.
This means that unused floating-point registers cannot be back-filled with
part of an HFA, however this can currently happen. This patch, along with the
corresponding clang patch (http://reviews.llvm.org/D3083) prevents this.
llvm-svn: 208413
Summary:
Adds MIPS32r6/MIPS64r6 and checks the compatibility requirements for these
processors.
I've also included comments to describe removed and re-encoded instructions,
along with placeholder def's for the new instructions but there are no
functional changes to codegen at this point.
Reviewers: jkolek, vmedic
Reviewed By: vmedic
Differential Revision: http://reviews.llvm.org/D3622
llvm-svn: 208399
Handle lowering of global addresses for PIC mode compilation on Windows. Always
use the movw/movt load to load the address as Windows on ARM requires ARMv7+ and
is a pure Thumb environment.
llvm-svn: 208385
Summary:
Also ran clang-format on the function. The code added is the last else
if block.
Reviewers: nadav, craig.topper, delena
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3518
llvm-svn: 208372
This patch teaches the backend how to combine packed SSE2/AVX2 arithmetic shift
intrinsics.
The rules are:
- Always fold a packed arithmetic shift by zero to its first operand;
- Convert a packed arithmetic shift intrinsic dag node into a ISD::SRA only if
the shift count is known to be smaller than the vector element size.
This patch also teaches to function 'getTargetVShiftByConstNode' how fold
target specific vector shifts by zero.
Added two new tests to verify that the DAGCombiner is able to fold
sequences of SSE2/AVX2 packed arithmetic shift calls.
llvm-svn: 208342
When building on Windows, the default target is Windows. Windows on ARM does
not support ARM mode compilation, resulting in test failures. Simply specify a
triple to ensure that we are testing the correct behaviour.
llvm-svn: 208340
Summary:
Vectors built with zeros and elements in the same order as another
(source) vector are optimized to be built using a single insertps
instruction.
Also optimize when we move one element in a vector to a different place
in that vector while zeroing out some of the other elements.
Further optimizations are possible, described in TODO comments.
I will be implementing at least some of them in the near future.
Added some tests for different cases where this optimization triggers.
Reviewers: nadav, delena, craig.topper
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3521
llvm-svn: 208271
Prior to r208252, the FMA 231 family was marked as isCommutable. However the
memory variants of this family are not commutable. Therefore, we did not
implemented the findCommutedOpIndices for those variants and missed that
the default implementation (more or less: commute indices 1 and 2) was
firing behind our back.
As a result, as demonstrated in the test case before the fix, we were
transforming a = b * c + a into a = a * c + b.
I.e., before r208252 we were generating for this test case:
vmovaps %xmm0, %xmm1
vmoss (%rsi), %xmm0
vfmadd231ss (%rdi), %xmm1, %xmm0
Instead of:
vmoss (%rsi), %xmm1
vfmadd231ss (%rdi), %xmm1, %xmm0
<rdar://problem/16800495>
llvm-svn: 208260
this patch disables the dead register elimination pass and the load/store pair
optimization pass at -O0. The ILP optimizations don't require the optimization
level to be checked because the call to addILPOpts is predicated with the
necessary check. The AdvSIMDScalar pass is disabled by default at all
optimization levels. This patch leaves that pass disabled by default.
Also, move command-line options into ARM64TargetMachine.cpp and add a few
additional flags to aid in debugging. This fixes an issue with the
-debug-pass=Structure flag where passes were printed, but not actually run
(i.e., AdvSIMDScalar pass).
llvm-svn: 208223
When performing a scalar comparison that feeds into a vector select,
it's actually better to do the comparison on the vector side: the
scalar route would be "CMP -> CSEL -> DUP", the vector is "CM -> DUP"
since the vector comparisons are all mask based.
llvm-svn: 208210
The AAPCS states that values passed in registers must have a value as though
they had been loaded with "LDR". LDR is equivalent to "LD1.64 vX.1D" - that is,
loading scalars to vector registers and loading 1-element vectors is equivalent.
The logic implemented here is to ensure that at all call boundaries and during
formal argument lowering all vectors are treated as their bitwidth-based floating
point scalar counterpart, which is always one of f64 or f128 (v2i32 -> f64,
v4i32 -> f128 etc). A BITCAST is inserted so that the appropriate REV will be
generated during code generation.
llvm-svn: 208198
Because we've canonicalised on using LD1/ST1, every time we do a bitcast
between vector types we must do an equivalent lane reversal.
Consider a simple memory load followed by a bitconvert then a store.
v0 = load v2i32
v1 = BITCAST v2i32 v0 to v4i16
store v4i16 v2
In big endian mode every memory access has an implicit byte swap. LDR and
STR do a 64-bit byte swap, whereas LD1/ST1 do a byte swap per lane - that
is, they treat the vector as a sequence of elements to be byte-swapped.
The two pairs of instructions are fundamentally incompatible. We've decided
to use LD1/ST1 only to simplify compiler implementation.
LD1/ST1 perform the equivalent of a sequence of LDR/STR + REV. This makes
the original code sequence: v0 = load v2i32
v1 = REV v2i32 (implicit)
v2 = BITCAST v2i32 v1 to v4i16
v3 = REV v4i16 v2 (implicit)
store v4i16 v3
But this is now broken - the value stored is different to the value loaded
due to lane reordering. To fix this, on every BITCAST we must perform two
other REVs:
v0 = load v2i32
v1 = REV v2i32 (implicit)
v2 = REV v2i32
v3 = BITCAST v2i32 v2 to v4i16
v4 = REV v4i16
v5 = REV v4i16 v4 (implicit)
store v4i16 v5
This means an extra two instructions, but actually in most cases the two REV
instructions can be combined into one. For example:
(REV64_2s (REV64_4h X)) === (REV32_4h X)
There is also no 128-bit REV instruction. This must be synthesized with an
EXT instruction.
Most bitconverts require some sort of conversion. The only exceptions are:
a) Identity conversions - vNfX <-> vNiX
b) Single-lane-to-scalar - v1fX <-> fX or v1iX <-> iX
Even though there are hundreds of changed lines, I have a fairly high confidence
that they are somewhat correct. The changes to add two REV instructions per
bitcast were pretty mechanical, and once I'd done that I threw the resulting
.td at a script I wrote which combined the two REVs together (and added
an EXT instruction, for f128) based on an instruction description I gave it.
This was much less prone to error than doing it all manually, plus my brain
would not just have melted but would have vapourised.
llvm-svn: 208194
This completes the port of r204814 (cpirker "AArch64_BE function argument
passing for ARM ABI") from AArch64 to ARM64, and fixes a bunch of issues
found during later development along the way. The biggest of these was
that the alignment fixup logic wasn't replicated into all the places it
should have been.
llvm-svn: 208192
The ARM::BLX instruction is an ARM mode instruction. The Windows on ARM target
is limited to Thumb instructions. Correctly use the thumb mode tBLXr
instruction. This would manifest as an errant write into the object file as the
instruction is 4-bytes in length rather than 2. The result would be a corrupted
object file that would eventually result in an executable that would crash at
runtime.
llvm-svn: 208152
remove it from the list of unspilled registers. Otherwise the following
attempt to keep the stack aligned by picking an extra GPR register to
spill will not work as it picks up r11.
llvm-svn: 208129
Before this patch, the backend always emitted a store+load sequence to
bitconvert from f64 to i64 the input operand of a ISD::BITCAST dag node that
performed a bitconvert from type MVT::f64 to type MVT::v2i32. The resulting
i64 node was then used to build a v2i32 vector.
With this patch, the backend now produces a cheaper SCALAR_TO_VECTOR from
MVT::f64 to MVT::v2f64. That SCALAR_TO_VECTOR is then followed by a "free"
bitcast to type MVT::v4i32. The elements of the resulting
v4i32 are then extracted to build a v2i32 vector (which is illegal and
therefore promoted to MVT::v2i64).
This is in general cheaper than emitting a stack store+load sequence
to bitconvert the operand from type f64 to type i64.
llvm-svn: 208107
This patch implements the infrastructure to use named register constructs in
programs that need access to specific registers (bare metal, kernels, etc).
So far, only the stack pointer is supported as a technology preview, but as it
is, the intrinsic can already support all non-allocatable registers from any
architecture.
llvm-svn: 208104
An alias has the address of what it points to, so it also has the same
alignment.
This allows a few optimizations to see past aliases for free.
llvm-svn: 208103
The Win64 docs are very clear that anything larger than 8 bytes is
passed by reference, and GCC MinGW64 honors that for __modti3 and
friends.
Patch by Jameson Nash!
llvm-svn: 208029
Summary:
Also ran clang-format on the function. The code added is the last else
if block.
Reviewers: nadav, craig.topper
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3518
llvm-svn: 207992
Windows on ARM does not conform to AEABI. However, memset would be emitted
using the AEABI signature, resulting in inverted parameters. Handle this
special case appropriately.
llvm-svn: 207943
Both MinGW and cygwin (i686) construct export directives without the global
leader prefix. This is mostly due to the fact that they use GNU ld which does
not correctly handle the export directive. This apparently has been been broken
for a while. However, this was recently reported as being broken by
mingwandroid and diorcety of the msys2 project.
Remove the global leader prefix if targeting MinGW or cygwin, otherwise, retain
the global leader prefix. Add an explicit test for cygwin's behaviour of export
directives.
llvm-svn: 207926
While post-indexed LD1/ST1 instructions do exist for vector loads,
this patch makes use of the more flexible addressing-modes in LDR/STR
instructions.
llvm-svn: 207838
Given the following C code llvm currently generates suboptimal code for
x86-64:
__m128 bss4( const __m128 *ptr, size_t i, size_t j )
{
float f = ptr[i][j];
return (__m128) { f, f, f, f };
}
=================================================
define <4 x float> @_Z4bss4PKDv4_fmm(<4 x float>* nocapture readonly %ptr, i64 %i, i64 %j) #0 {
%a1 = getelementptr inbounds <4 x float>* %ptr, i64 %i
%a2 = load <4 x float>* %a1, align 16, !tbaa !1
%a3 = trunc i64 %j to i32
%a4 = extractelement <4 x float> %a2, i32 %a3
%a5 = insertelement <4 x float> undef, float %a4, i32 0
%a6 = insertelement <4 x float> %a5, float %a4, i32 1
%a7 = insertelement <4 x float> %a6, float %a4, i32 2
%a8 = insertelement <4 x float> %a7, float %a4, i32 3
ret <4 x float> %a8
}
=================================================
shlq $4, %rsi
addq %rdi, %rsi
movslq %edx, %rax
vbroadcastss (%rsi,%rax,4), %xmm0
retq
=================================================
The movslq is uneeded, but is present because of the trunc to i32 and then
sext back to i64 that the backend adds for vbroadcastss.
We can't remove it because it changes the meaning. The IR that clang
generates is already suboptimal. What clang really should emit is:
%a4 = extractelement <4 x float> %a2, i64 %j
This patch makes that legal. A separate patch will teach clang to do it.
Differential Revision: http://reviews.llvm.org/D3519
llvm-svn: 207801
This creates a lot of core infrastructure in which to add, with little
effort, quite a bit more to mips fast-isel
Test Plan: simplestore.ll
Reviewers: dsanders
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D3527
llvm-svn: 207790
The canonical form of the BFM instruction is always one of the more explicit
extract or insert operations, which makes reading output much easier.
llvm-svn: 207752
For pattern like ((x >> C1) & Mask) << C2, DAG combiner may convert it
into (x >> (C1-C2)) & (Mask << C2), which makes pattern matching of ubfx
more difficult.
For example:
Given
%shr = lshr i64 %x, 4
%and = and i64 %shr, 15
%arrayidx = getelementptr inbounds [8 x [64 x i64]]* @arr, i64 0, %i64 2, i64 %and
%0 = load i64* %arrayidx
With current shift folding, it takes 3 instrs to compute base address:
lsr x8, x0, #1
and x8, x8, #0x78
add x8, x9, x8
If using ubfx, it only needs 2 instrs:
ubfx x8, x0, #4, #4
add x8, x9, x8, lsl #3
This fixes bug 19589
llvm-svn: 207702
There is no need to check if we want to hoist the immediate value of an
shift instruction. Simply return TCC_Free right away.
This change is like r206101, but for X86.
rdar://problem/16190769
llvm-svn: 207692
target cannot be determined accurately. This is the case for NaCl where the
sandboxing instructions are added in MC layer, after the MipsLongBranch pass.
It is also the case when the code has inline assembly. Instead of calculating
offset in the MipsLongBranch pass, use %hi(sym1 - sym2) and %lo(sym1 - sym2)
expressions that are resolved during the fixup.
This patch also deletes microMIPS test file test/CodeGen/Mips/micromips-long-branch.ll
and implements microMIPS CHECKs in a much simpler way in a file
test/CodeGen/Mips/longbranch.ll, together with MIPS32 and MIPS64.
llvm-svn: 207656
The canonical syntax for shifts by a variable amount does not end with 'v', but
that syntax should be supported as an alias (presumably for legacy reasons).
llvm-svn: 207649
On instructions using the NZCV register, a couple of conditions have dual
representations: HS/CS and LO/CC (meaning unsigned-higher-or-same/carry-set and
unsigned-lower/carry-clear). The first of these is more descriptive in most
circumstances, so we should print it.
llvm-svn: 207644
Summary:
This isn't supported directly so we rotate the vector by the desired number of
elements, insert to element zero, then rotate back.
The i64 case generates rather poor code on MIPS32. There is an obvious
optimisation to be made in future (do both insert.w's inside a shared
rotate/unrotate sequence) but for now it's sufficient to select valid code
instead of aborting.
Depends on D3536
Reviewers: matheusalmeida
Reviewed By: matheusalmeida
Differential Revision: http://reviews.llvm.org/D3537
llvm-svn: 207640
Since these are mostly used in "lsl #16", "lsl #32", "lsl #48" combinations to
piece together an immediate in 16-bit chunks, hex is probably the most
appropriate format.
llvm-svn: 207635
This is mostly aimed at the NEON logical operations and MOVI/MVNI (since they
accept weird shifts which are more naturally understandable in hex notation).
Also changes BRK/HINT etc, which is probably a neutral change, but easier than
the alternative.
llvm-svn: 207634
Since these instructions only accept a 12-bit immediate, possibly shifted left
by 12, the canonical syntax used by the architecture reference manual is "#N {,
lsl #12 }". We should accept an immediate that has already been shifted, (e.g.
Also, print a comment giving the full addend since it can be helpful.
llvm-svn: 207633
This introduces the stack lowering emission of the stack probe function for
Windows on ARM. The stack on Windows on ARM is a dynamically paged stack where
any page allocation which crosses a page boundary of the following guard page
will cause a page fault. This page fault must be handled by the kernel to
ensure that the page is faulted in. If this does not occur and a write access
any memory beyond that, the page fault will go unserviced, resulting in an
abnormal program termination.
The watermark for the stack probe appears to be at 4080 bytes (for
accommodating the stack guard canaries and stack alignment) when SSP is
enabled. Otherwise, the stack probe is emitted on the page size boundary of
4096 bytes.
llvm-svn: 207615
IMAGE_REL_ARM_MOV32T relocations require that the movw/movt pair-wise
relocation is not split up and reordered. When expanding the mov32imm
pseudo-instruction, create a bundle if the machine operand is referencing an
address. This helps ensure that the relocatable address load is not reordered
by subsequent passes.
Unfortunately, this only partially handles the case as the Constant Island Pass
occurs after the instructions are unbundled and does not properly handle
bundles. That is a more fundamental issue with the pass itself and beyond the
scope of this change.
llvm-svn: 207608
Currently, musttail codegen is relying on sibcall optimization, and
reporting a fatal error if fails. Sibcall optimization fails when stack
arguments need to be modified, which is insufficient for musttail.
The logic for moving arguments in memory safely is already implemented
for GuaranteedTailCallOpt. This change merely arranges for musttail
calls to use it.
No functional change for GuaranteedTailCallOpt.
Reviewers: espindola
Differential Revision: http://reviews.llvm.org/D3493
llvm-svn: 207598
SI_IF and SI_ELSE are terminators which also produce a value. For
these instructions ISel always inserts a COPY to move their value
to another basic block. This COPY ends up between SI_(IF|ELSE)
and the S_BRANCH* instruction at the end of the block.
This breaks MachineBasicBlock::getFirstTerminator() and also the
machine verifier which assumes that terminators are grouped together at
the end of blocks.
To solve this we coalesce the copy away right after ISel to make sure
there are no instructions in between terminators at the end of blocks.
llvm-svn: 207591
SALU instructions ignore control flow, so it is not always safe to use
them within branches. This is a partial solution to this problem
until we can come up with something better.
llvm-svn: 207590
This is a squash of several optimization commits:
- calculate DIV_Lo and DIV_Hi separately
- use BFE_U32 if we are operating on 32bit values
- use precomputed constants instead of shifting in UDVIREM
- skip the first 32 iterations of udivrem
v2: Check whether BFE is supported before using it
Patch by: Jan Vesely
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 207589
Summary:
This isn't supported directly so we splat the vector element and extract
the most convenient copy.
Reviewers: matheusalmeida
Reviewed By: matheusalmeida
Differential Revision: http://reviews.llvm.org/D3530
llvm-svn: 207524
Otherwise the legalizer would just scalarize everything. Support for
mulhi in the targets isn't that great yet so on most targets we get
exactly the same scalarized output. Add a test for x86 vector udiv.
I had to disable the mulhi nodes on ARM because there aren't any patterns
for it. As far as I know ARM has instructions for getting the high part of
a multiply so this should be fixed.
llvm-svn: 207315
The included test case would return the incorrect results, because the expansion
of an shift with a constant shift amount of 0 would generate undefined behavior.
This is because ExpandShiftByConstant assumes that all shifts by constants with
a value of 0 have already been optimized away. This doesn't happen for opaque
constants and usually this isn't a problem, because opaque constants won't take
this code path - they are not supposed to. In the case that the opaque constant
has to be expanded by the legalizer, the legalizer would drop the opaque flag.
In this case we hit the limitations of ExpandShiftByConstant and create incorrect
code.
This commit fixes the legalizer by not dropping the opaque flag when expanding
opaque constants and adding an assertion to ExpandShiftByConstant to catch this
not supported case in the future.
This fixes <rdar://problem/16718472>
llvm-svn: 207304
Scaling factors are not free on X86 because every "complex" addressing mode
breaks the related instruction into 2 allocations instead of 1.
<rdar://problem/16730541>
llvm-svn: 207301
Summary:
If we're doing a v4f32/v4i32 shuffle on x86 with SSE4.1, we can lower
certain shufflevectors to an insertps instruction:
When most of the shufflevector result's elements come from one vector (and
keep their index), and one element comes from another vector or a memory
operand.
Added tests for insertps optimizations on shufflevector.
Added support and tests for v4i32 vector optimization.
Reviewers: nadav
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3475
llvm-svn: 207291
This intrinsic is no longer needed with the new @llvm.arm.hint(i32) intrinsic
which provides a generic, extensible manner for adding hint instructions. This
functionality can now be represented as @llvm.arm.hint(i32 5).
llvm-svn: 207246
Introduce the llvm.arm.hint(i32) intrinsic that can be used to inject hints into
the instruction stream. This is particularly useful for generating IR from a
compiler where the user may inject an intrinsic (e.g. __yield). These are then
pattern substituted into the correct instruction which already existed.
llvm-svn: 207242
There's no need for local symbols to go through the GOT, in fact it seems GNU ld is not even emitting GOT entries for local symbols and will error out when trying to resolve a GOT relocation for a local symbol.
This bug triggers when bootstrapping clang on AArch64 Linux with -fPIC and the ARM64 backend. The AArch64 backend is not affected.
With this commit it's now possible to bootstrap clang on AArch64 Linux with the ARM64 backend (-fPIC, -O3).
llvm-svn: 207226
This patch is a supplement of implementing predicate of FP, enabling aarch64 backend
no-fp tests on arm64 target for verification. During this, one bug is exposed and
fixed by this patch.
llvm-svn: 207215
This is similar to the 'tail' marker, except that it guarantees that
tail call optimization will occur. It also comes with convervative IR
verification rules that ensure that tail call optimization is possible.
Reviewers: nicholas
Differential Revision: http://llvm-reviews.chandlerc.com/D3240
llvm-svn: 207143
This patch:
- Adds two new X86 builtin intrinsics ('int_x86_rdtsc' and
'int_x86_rdtscp') as GCCBuiltin intrinsics;
- Teaches the backend how to lower the two new builtins;
- Introduces a common function to lower READCYCLECOUNTER dag nodes
and the two new rdtsc/rdtscp intrinsics;
- Improves (and extends) the existing x86 test 'rdtsc.ll'; now test 'rdtsc.ll'
correctly verifies that both READCYCLECOUNTER and the two new intrinsics
work fine for both 64bit and 32bit Subtargets.
llvm-svn: 207127
This matches ARM64 behaviour, which I think is clearer. It also puts all the
churn from that difference into one easily ignored commit.
llvm-svn: 207116
ARM64 was not producing pure BFI instructions for bitfield insertion
operations, unlike AArch64. The approach had to be a little different (in
ISelDAGToDAG rather than ISelLowering), and the outcomes aren't identical but
hopefully this gives it similar power.
This should address PR19424.
llvm-svn: 207102
This allows us to compile
return (mask & 0x8 ? a : b);
into
testb $8, %dil
cmovnel %edx, %esi
instead of
andl $8, %edi
shrl $3, %edi
cmovnel %edx, %esi
which we formed previously because dag combiner canonicalizes setcc of and into shift.
llvm-svn: 207088
ANDS does not use the same encoding scheme as other xxxS instructions (e.g.,
ADDS). Take that into account to avoid wrong peephole optimization.
<rdar://problem/16693089>
llvm-svn: 207020
AArch64 has feature predicates for NEON, FP and CRYPTO instructions.
This allows the compiler to generate code without using FP, NEON
or CRYPTO instructions.
llvm-svn: 206949
The point of these calls is to allow Thumb-1 code to make use of the VFP unit
to perform its operations. This is not desirable with -msoft-float, since most
of the reasons you'd want that apply equally to the runtime library.
rdar://problem/13766161
llvm-svn: 206874
while checking candidate for bit field extract.
Otherwise the value may not fit in uint64_t and this will trigger an
assertion.
This fixes PR19503.
llvm-svn: 206834
This reverts commit r206707, reapplying r206704. The preceding commit
to CalcSpillWeights should have sorted out the failing buildbots.
<rdar://problem/14292693>
llvm-svn: 206766
Generating BZHI in the variable mask case, i.e. (and X, (sub (shl 1, N), 1)),
was already supported, but we were missing the constant-mask case. This patch
fixes that.
<rdar://problem/15480077>
llvm-svn: 206738
This reverts commit r206677, reapplying my BlockFrequencyInfo rewrite.
I've done a careful audit, added some asserts, and fixed a couple of
bugs (unfortunately, they were in unlikely code paths). There's a small
chance that this will appease the failing bots [1][2]. (If so, great!)
If not, I have a follow-up commit ready that will temporarily add
-debug-only=block-freq to the two failing tests, allowing me to compare
the code path between what the failing bots and what my machines (and
the rest of the bots) are doing. Once I've triggered those builds, I'll
revert both commits so the bots go green again.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
[2]: http://llvm-amd64.freebsd.your.org/b/builders/clang-i386-freebsd/builds/18445
<rdar://problem/14292693>
llvm-svn: 206704
Win64 stack unwinder gets confused when execution flow "falls through" after
a call to 'noreturn' function. This fixes the "missing epilogue" problem by
emitting a trap instruction for IR 'unreachable' on x86_x64-pc-windows.
A secondary use for it would be for anyone wanting to make double-sure that
'noreturn' functions, indeed, do not return.
llvm-svn: 206684
This reverts commit r206666, as planned.
Still stumped on why the bots are failing. Sanitizer bots haven't
turned anything up. If anyone can help me debug either of the failures
(referenced in r206666) I'll owe them a beer. (In the meantime, I'll be
auditing my patch for undefined behaviour.)
llvm-svn: 206677
This reverts commit r206628, reapplying r206622 (and r206626).
Two tests are failing only on buildbots [1][2]: i.e., I can't reproduce
on Darwin, and Chandler can't reproduce on Linux. Asan and valgrind
don't tell us anything, but we're hoping the msan bot will catch it.
So, I'm applying this again to get more feedback from the bots. I'll
leave it in long enough to trigger builds in at least the sanitizer
buildbots (it was failing for reasons unrelated to my commit last time
it was in), and hopefully a few others.... and then I expect to revert a
third time.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
[2]: http://llvm-amd64.freebsd.your.org/b/builders/clang-i386-freebsd/builds/18445
llvm-svn: 206666
Summary:
This port includes the rudimentary latencies that were provided for
the Cortex-A53 Machine Model in the AArch64 backend. It also changes
the SchedAlias for COPY in the Cyclone model to an explicit
WriteRes mapping to avoid conflicts in other subtargets.
Differential Revision: http://reviews.llvm.org/D3427
Patch by Dave Estes <cestes@codeaurora.org>!
llvm-svn: 206652
For a 256-bit BUILD_VECTOR consisting mostly of shuffles of 256-bit vectors,
both the BUILD_VECTOR and its operands may need to be legalized in multiple
steps. Consider:
(v8f32 (BUILD_VECTOR (extract_vector_elt (v8f32 %vreg0,) Constant<1>),
(extract_vector_elt %vreg0, Constant<2>),
(extract_vector_elt %vreg0, Constant<3>),
(extract_vector_elt %vreg0, Constant<4>),
(extract_vector_elt %vreg0, Constant<5>),
(extract_vector_elt %vreg0, Constant<6>),
(extract_vector_elt %vreg0, Constant<7>),
%vreg1))
a. We can't build a 256-bit vector efficiently so, we need to split it into
two 128-bit vecs and combine them with VINSERTX128.
b. Operands like (extract_vector_elt (v8f32 %vreg0), Constant<7>) needs to be
split into a VEXTRACTX128 and a further extract_vector_elt from the
resulting 128-bit vector.
c. The extract_vector_elt from b. is lowered into a shuffle to the first
element and a movss.
Depending on the order in which we legalize the BUILD_VECTOR and its
operands[1], buildFromShuffleMostly may be faced with:
(v4f32 (BUILD_VECTOR (extract_vector_elt
(vector_shuffle<1,u,u,u> (extract_subvector %vreg0, Constant<4>), undef),
Constant<0>),
(extract_vector_elt
(vector_shuffle<2,u,u,u> (extract_subvector %vreg0, Constant<4>), undef),
Constant<0>),
(extract_vector_elt
(vector_shuffle<3,u,u,u> (extract_subvector %vreg0, Constant<4>), undef),
Constant<0>),
%vreg1))
In order to figure out the underlying vector and their identity we need to see
through the shuffles.
[1] Note that the order in which operations and their operands are legalized is
only guaranteed in the first iteration of LegalizeDAG.
Fixes <rdar://problem/16296956>
llvm-svn: 206634
This reverts commit r206622 and the MSVC fixup in r206626.
Apparently the remotely failing tests are still failing, despite my
attempt to fix the nondeterminism in r206621.
llvm-svn: 206628
This reverts commit r206556, effectively reapplying commit r206548 and
its fixups in r206549 and r206550.
In an intervening commit I've added target triples to the tests that
were failing remotely [1] (but passing locally). I'm hoping the mystery
is solved? I'll revert this again if the tests are still failing
remotely.
[1]: http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/1816
llvm-svn: 206622
These tests were failing on some buildbots after r206548 (reverted in
r206556), but passing locally.
They were missing target triples, so maybe that's the problem?
llvm-svn: 206621
Code mostly copied from AArch64, just tidied up a trifle and plumbed
into the ARM64 way of doing things.
This also enables the AArch64 tests which inspired the previous
untested commits.
llvm-svn: 206574
A vector extract followed by a dup can become a single instruction even if the
types don't match. AArch64 handled this in ISelLowering, but a few reasonably
simple patterns can take care of it in TableGen, so that's where I've put it.
llvm-svn: 206573
ARM64 was scalarizing some vector comparisons which don't quite map to
AArch64's compare and mask instructions. AArch64's approach of sacrificing a
little efficiency to emulate them with the limited set available was better, so
I ported it across.
More "inspired by" than copy/paste since the backend's internal expectations
were a bit different, but the tests were invaluable.
llvm-svn: 206570
I enhanced it a little in the process. The decision shouldn't really be beased
on whether a BUILD_VECTOR is a splat: any set of constants will do the job
provided they're related in the correct way.
Also, the BUILD_VECTOR could be any operand of the incoming AND nodes, so it's
best to check for all 4 possibilities rather than assuming it'll be the RHS.
llvm-svn: 206569
It's not actually used to handle C or C++ ABI rules on ARM64, but could well be
emitted by other language front-ends, so it's as well to have a sensible
implementation.
llvm-svn: 206568
Use scalar BFE with constant shift and offset when possible.
This is complicated by the fact that the scalar version packs
the two operands of the vector version into one.
llvm-svn: 206558
Rewrite the shared implementation of BlockFrequencyInfo and
MachineBlockFrequencyInfo entirely.
The old implementation had a fundamental flaw: precision losses from
nested loops (or very wide branches) compounded past loop exits (and
convergence points).
The @nested_loops testcase at the end of
test/Analysis/BlockFrequencyAnalysis/basic.ll is motivating. This
function has three nested loops, with branch weights in the loop headers
of 1:4000 (exit:continue). The old analysis gives non-sensical results:
Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
---- Block Freqs ----
entry = 1.0
for.cond1.preheader = 1.00103
for.cond4.preheader = 5.5222
for.body6 = 18095.19995
for.inc8 = 4.52264
for.inc11 = 0.00109
for.end13 = 0.0
The new analysis gives correct results:
Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
block-frequency-info: nested_loops
- entry: float = 1.0, int = 8
- for.cond1.preheader: float = 4001.0, int = 32007
- for.cond4.preheader: float = 16008001.0, int = 128064007
- for.body6: float = 64048012001.0, int = 512384096007
- for.inc8: float = 16008001.0, int = 128064007
- for.inc11: float = 4001.0, int = 32007
- for.end13: float = 1.0, int = 8
Most importantly, the frequency leaving each loop matches the frequency
entering it.
The new algorithm leverages BlockMass and PositiveFloat to maintain
precision, separates "probability mass distribution" from "loop
scaling", and uses dithering to eliminate probability mass loss. I have
unit tests for these types out of tree, but it was decided in the review
to make the classes private to BlockFrequencyInfoImpl, and try to shrink
them (or remove them entirely) in follow-up commits.
The new algorithm should generally have a complexity advantage over the
old. The previous algorithm was quadratic in the worst case. The new
algorithm is still worst-case quadratic in the presence of irreducible
control flow, but it's linear without it.
The key difference between the old algorithm and the new is that control
flow within a loop is evaluated separately from control flow outside,
limiting propagation of precision problems and allowing loop scale to be
calculated independently of mass distribution. Loops are visited
bottom-up, their loop scales are calculated, and they are replaced by
pseudo-nodes. Mass is then distributed through the function, which is
now a DAG. Finally, loops are revisited top-down to multiply through
the loop scales and the masses distributed to pseudo nodes.
There are some remaining flaws.
- Irreducible control flow isn't modelled correctly. LoopInfo and
MachineLoopInfo ignore irreducible edges, so this algorithm will
fail to scale accordingly. There's a note in the class
documentation about how to get closer. See also the comments in
test/Analysis/BlockFrequencyInfo/irreducible.ll.
- Loop scale is limited to 4096 per loop (2^12) to avoid exhausting
the 64-bit integer precision used downstream.
- The "bias" calculation proposed on llvmdev is *not* incorporated
here. This will be added in a follow-up commit, once comments from
this review have been handled.
llvm-svn: 206548
Change the command line vector-insertion.ll to explicitly set the neon syntax
to apple so that buildbots that default to other syntaxes won't fail.
llvm-svn: 206502
Having i128 as a legal type complicates the legalization phase. v4i32
is already a legal type, so we will use that instead.
This fixes several piglit tests.
llvm-svn: 206500
This patch improves the performance of vector creation in caseiswhere where
several of the lanes in the vector are a constant floating point value. It
also includes new patterns to fold together some of the instructions when the
value is 0.0f. Test cases included.
rdar://16349427
llvm-svn: 206496
Previously, SSPBufferSize was assigned the value of the "stack-protector-buffer-size"
attribute after all uses of SSPBufferSize. The effect was that the default
SSPBufferSize was always used during analysis. I moved the check for the
attribute before the analysis; now --param ssp-buffer-size= works correctly again.
Differential Revision: http://reviews.llvm.org/D3349
llvm-svn: 206486
The commit of r205855:
Author: Arnold Schwaighofer <aschwaighofer@apple.com>
Date: Wed Apr 9 14:20:47 2014 +0000
SLPVectorizer: Only vectorize intrinsics whose operands are widened equally
The vectorizer only knows how to vectorize intrinics by widening all operands by
the same factor.
Patch by Tyler Nowicki!
exposed a backend bug causing a regression (Cannot select ctpop).
The commit msg is a bit confusing because the patch actually changes the
behavior for the loop-vectorizer as well. As things got refactored into a
helper ctpop got snuck in to the trivially-vectorizable helper which is now
used by both vectorizers. In other words, we started seeing vector-ctpops in
the backend.
This change makes ctpop LegalizeAction::Expand for the types not supported by
the byte-only CNT instruction. We may be able to custom-lower these later to
a single CNT but this is to fix the compiler crash first.
Fixes <rdar://problem/16578951>
llvm-svn: 206433
This is so that EF_MIPS_NAN2008 is set if we are using IEEE 754-2008
NaN encoding (-mnan=2008). This patch also adds support for parsing
'.nan legacy' and '.nan 2008' assembly directives. The handling of
these directives should match GAS' behaviour i.e., the last directive
in use sets the ELF header bit (EF_MIPS_NAN2008).
Differential Revision: http://reviews.llvm.org/D3346
llvm-svn: 206396
These ones used completely different sets of intrinsics, so the only way to do
it is create a separate ARM64 copy and change them all.
Other than that, CodeGen was straightforward, no deficiencies detected here.
llvm-svn: 206392
Summary: This was a case of incorrect usage of hasMips64() vs isABI_N64()
Reviewers: matheusalmeida, dsanders
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D3398
llvm-svn: 206388
This should fix the ninja-x64-msvc-RA-centos6 builder.
I suspect the check in MipsSubtarget.cpp is incorrect and is really trying to
check for a bare-metal target rather and anything other than linux. I'll
investigate this.
llvm-svn: 206385
The most important part here is that we should actuall emit the stubs we refer
to in the exception table, but as a side issue this uses more sensible & GCC
compatible representations for some of the bits of information.
llvm-svn: 206380
If we know that a particular 64-bit constant has all high bits zero, then we
can rely on the fact that 32-bit ARM64 instructions automatically zero out the
high bits of an x-register. This gives the expansion logic less constraints to
satisfy and so sometimes allows it to pick better sequences.
Came up while porting test/CodeGen/AArch64/movw-consts.ll: this will allow a
32-bit MOVN to be used in @test8 soon.
llvm-svn: 206379
if not in micromips mode.
The test (elf_st_other.ll) was renamed as the name and description didn't
make sense as the test wasn't checking any symbol table entry.
Differential Revision: http://reviews.llvm.org/D3346
llvm-svn: 206377
Summary:
I had difficulty finding tests for the N32 and N64 ABI so I've added a
collection of calling convention tests based on the document MIPS ABIs
Described (MD00305), the MIPSpro N32 Handbook, and the SYSV ABI. Where the
documents/implementations disagree, I've used GCC to resolve the conflict.
A few interesting details:
* For N32, LLVM uses 64-bit pointers when saving $ra despite pointers being
32-bit. I've yet to find a supporting statement in the ABI documentation but
the current behaviour matches GCC.
* For O32, the non-variable portion of a varargs argument list is also subject
to the rule that floating-point is passed via GPR's (on N32/N64 only the
variable portion is subject to this rule). This agrees with GCC's behaviour
and the SYSV ABI but contradicts part of the MIPSpro N32 Handbook which talks about O32's behaviour.
* The N32 implementation has the wrong callee-saved register list.
(I already have a fix for this but will commit it as a follow-up).
I've left RUN-TODO lines in for O32 on MIPS64. I don't plan to support this case
for now but we should revisit it.
Reviewers: matheusalmeida, vmedic
Reviewed By: matheusalmeida
Differential Revision: http://reviews.llvm.org/D3339
llvm-svn: 206370
This particular DAG combine is designed to kick in when both ConstantFPs will
end up being loaded via a litpool, however those nodes have a semi-legal
status, dictated by isFPImmLegal so in some cases there wouldn't have been a
litpool in the first place. Don't try to be clever in those circumstances.
Picked up while merging some AArch64 tests.
llvm-svn: 206365
Print in decimal for inline immediates, and hex otherwise. Use hex
always for offsets in addressing offsets.
This approximately matches what the shader compiler does.
llvm-svn: 206335
handles Intrinsic::trap if TargetOptions::TrapFuncName is set.
This fixes a bug in which the trap function was not taken into consideration
when a program was compiled without optimization (at -O0).
<rdar://problem/16291933>
llvm-svn: 206323
This patch teaches the backend how to efficiently lower logical and
arithmetic packed shifts on both SSE and AVX/AVX2 machines.
When possible, instead of scalarizing a vector shift, the backend should try
to expand the shift into a sequence of two packed shifts by immedate count
followed by a MOVSS/MOVSD.
Example
(v4i32 (srl A, (build_vector < X, Y, Y, Y>)))
Can be rewritten as:
(v4i32 (MOVSS (srl A, <Y,Y,Y,Y>), (srl A, <X,X,X,X>)))
[with X and Y ConstantInt]
The advantage is that the two new shifts from the example would be lowered into
X86ISD::VSRLI nodes. This is always cheaper than scalarizing the vector into
four scalar shifts plus four pairs of vector insert/extract.
llvm-svn: 206316
Sometimes we need emit the bits that would actually be a MOVN when producing a
relocated MOVZ instruction (don't ask). But not always, a check which ARM64 got
wrong until now.
llvm-svn: 206289
Code is mostly copied directly across, with a slight extension of the
ISelDAGToDAG function so that it can cope with the floating-point constants
being behind a litpool.
llvm-svn: 206285
In rare cases the dead definition elimination pass code can cause illegal cmn
instructions when it replaces dead registers on instructions that use
unmaterialized frame indexes. This patch disables the dead definition
optimization for instructions which include frame index operands.
rdar://16438284
llvm-svn: 206208
Previously, BranchProbabilityInfo::calcLoopBranchHeuristics would determine the weights of basic blocks inside loops even when it didn't have enough information to estimate the branch probabilities correctly. This patch fixes the function to exit early if it doesn't see any exit edges or back edges and let the later heuristics determine the weights.
This fixes PR18705 and <rdar://problem/15991090>.
Differential Revision: http://reviews.llvm.org/D3363
llvm-svn: 206194
Summary:
This was another incorrect use of hasMips64() vs isGP64bit().
Depends on D3344
Reviewers: matheusalmeida, vmedic
Reviewed By: vmedic
Differential Revision: http://reviews.llvm.org/D3347
llvm-svn: 206187
Summary:
Two exceptions to this:
test/CodeGen/Mips/octeon.ll
test/CodeGen/Mips/octeon_popcnt.ll
these test extensions to MIPS64
One test is altered for MIPS-IV:
test/CodeGen/Mips/mips64countleading.ll
Tests dclo/dclz which were added in MIPS64. The MIPS-IV version tests
that dclo/dclz are not emitted.
Four tests fail and are not in this patch:
test/CodeGen/Mips/abicalls.ll
test/CodeGen/Mips/fcopysign-f32-f64.ll
test/CodeGen/Mips/fcopysign.ll
test/CodeGen/Mips/stack-alignment.ll
Depends on D3343
Reviewers: matheusalmeida, vmedic
Reviewed By: vmedic
Differential Revision: http://reviews.llvm.org/D3344
llvm-svn: 206185
Summary:
- Conditional moves acting on 64-bit GPR's should require MIPS-IV rather than MIPS64
- ISD::MUL, and ISD::MULH[US] should be lowered on all 64-bit ISA's
Patch by David Chisnall
His work was sponsored by: DARPA, AFRL
I've added additional testcases to cover as much of the codegen changes
affecting MIPS-IV as I can. Where I've been unable to find an existing
MIPS64 testcase that can be re-used for MIPS-IV (mainly tests covering
ISD::GlobalAddress and similar), I at least agree that MIPS-IV should
behave like MIPS64. Further testcases that are fixed by this patch will follow
in my next commit. The testcases from that commit that fail for MIPS-IV without
this patch are:
LLVM :: CodeGen/Mips/2010-07-20-Switch.ll
LLVM :: CodeGen/Mips/cmov.ll
LLVM :: CodeGen/Mips/eh-dwarf-cfa.ll
LLVM :: CodeGen/Mips/largeimmprinting.ll
LLVM :: CodeGen/Mips/longbranch.ll
LLVM :: CodeGen/Mips/mips64-f128.ll
LLVM :: CodeGen/Mips/mips64directive.ll
LLVM :: CodeGen/Mips/mips64ext.ll
LLVM :: CodeGen/Mips/mips64fpldst.ll
LLVM :: CodeGen/Mips/mips64intldst.ll
LLVM :: CodeGen/Mips/mips64load-store-left-right.ll
LLVM :: CodeGen/Mips/sint-fp-store_pattern.ll
Reviewers: dsanders
Reviewed By: dsanders
CC: matheusalmeida
Differential Revision: http://reviews.llvm.org/D3343
llvm-svn: 206183
There was one definite issue in ARM64 (the off-by-1 check for whether
a shift could be folded in) and one difference that is probably
correct: ARM64 didn't fold nodes with multiple uses into the
arithmetic operations unless optimising for code size.
llvm-svn: 206168
Summary:
Previously loadImmediate() would produce MKMSK instructions with invalid
immediate values such as mkmsk r0, 9. Fix this by checking the mask size
is valid.
Reviewers: robertlytton
Reviewed By: robertlytton
CC: llvm-commits
Differential Revision: http://reviews.llvm.org/D3289
llvm-svn: 206163
We had been using the known-zero values of the operand of the or to construct
the mask for an rlwimi; this is not quite correct, but fine when the mask is
constant. When the mask is constant, then the known zeros of the operand must
be a superset of the zeros in the mask. However, when the mask is not a
constant, then there might be bits in the operand that are not known to be zero
that, at runtime, might be zero in the mask. Therefore, we check that any bits
not known to be zero *are* known to be one in the mask. Otherwise, we can't
fold the mask with the or and shift.
This was revealed as a miscompile of
MultiSource/Benchmarks/BitBench/drop3/drop3 when I started experimenting with
constant hoisting.
llvm-svn: 206136
We had disabled use of TBAA during CodeGen (even when otherwise using AA)
because the ptrtoint/inttoptr used by CGP for address sinking caused BasicAA to
miss basic type punning that it should catch (and, thus, we'd fail to override
TBAA when we should).
However, when AA is in use during CodeGen, CGP now uses normal GEPs and
bitcasts, instead of ptrtoint/inttoptr, when doing address sinking. As a
result, BasicAA should be able to make us do the right thing in the face of
type-punning, and it seems safe to enable use of TBAA again. self-hosting seems
fine on PPC64/Linux on the P7, with TBAA enabled and -misched=shuffle.
Note: We still don't update TBAA when merging stack slots, although because
BasicAA should now catch all such cases, this is no longer a blocking issue.
Nevertheless, I plan to commit code to deal with this properly in the near
future.
llvm-svn: 206093
The current memory-instruction optimization logic in CGP, which sinks parts of
the address computation that can be adsorbed by the addressing mode, does this
by explicitly converting the relevant part of the address computation into
IR-level integer operations (making use of ptrtoint and inttoptr). For most
targets this is currently not a problem, but for targets wishing to make use of
IR-level aliasing analysis during CodeGen, the use of ptrtoint/inttoptr is a
problem for two reasons:
1. BasicAA becomes less powerful in the face of the ptrtoint/inttoptr
2. In cases where type-punning was used, and BasicAA was used
to override TBAA, BasicAA may no longer do so. (this had forced us to disable
all use of TBAA in CodeGen; something which we can now enable again)
This (use of GEPs instead of ptrtoint/inttoptr) is not currently enabled by
default (except for those targets that use AA during CodeGen), and so aside
from some PowerPC subtargets and SystemZ, there should be no change in
behavior. We may be able to switch completely away from the ptrtoint/inttoptr
sinking on all targets, but further testing is required.
I've doubled-up on a number of existing tests that are sensitive to the
address sinking behavior (including some store-merging tests that are
sensitive to the order of the resulting ADD operations at the SDAG level).
llvm-svn: 206092
-fexhaustive-register-search option to allow an exhaustive search during last
chance recoloring.
This is related to PR18747
Patch by MAYUR PANDEY <mayur.p@samsung.com>.
llvm-svn: 206072
The TargetLowering::expandMUL() helper contains lowering code extracted
from the DAGTypeLegalizer and allows the SelectionDAGLegalizer to expand more
ISD::MUL patterns without having to use a library call.
llvm-svn: 206037
This removes the -segmented-stacks command line flag in favor of a
per-function "split-stack" attribute.
Patch by Luqman Aden and Alex Crichton!
llvm-svn: 205997
Refactored stack-protector.ll to use new-style function attributes everywhere
and eliminated unnecessary attributes.
This cleanup is in preparation for an upcoming test change.
llvm-svn: 205996
AVX supports logical operations using an operand from memory. Unfortunately
because integer operations were not added until AVX2 the AVX1 logical
operation's types were preventing the isel from folding the loads. In a limited
number of cases the peephole optimizer would fold the loads, but most were
missed. This patch adds explicit patterns with appropriate casts in order for
these loads to be folded.
The included test cases run on reduced examples and disable the peephole
optimizer to ensure the folds are being pattern matched.
Patch by Louis Gerbarg <lgg@apple.com>
rdar://16355124
llvm-svn: 205938
FoldConstantArithmetic() only knows how to deal with a few target independent
ISD opcodes. Bail early if it sees a target-specific ISD node. These node do
funny things with operand types which may break the assumptions of the code
that follows, and there's no actual folding that can be done anyway. For example,
non-constant 256 bit vector shifts on X86 have a shift-amount operand that's a
128-bit v4i32 vector regardless of what the first operand type is and that breaks
the assumption that the operand types must match.
rdar://16530923
llvm-svn: 205937
In AArch64 i64 to i32 truncate operation is a subregister access.
This allows more opportunities for LSR optmization to eliminate
variables of different types (i32 and i64).
llvm-svn: 205925
sign/zero/any extensions. However a few places were not checking properly the
property of the load and were turning an indexed load into a regular extended
load. Therefore the indexed value was lost during the process and this was
triggering an assertion.
<rdar://problem/16389332>
llvm-svn: 205923
This commit adds intrinsics and codegen support for the surface read/write and texture read instructions that take an explicit sampler parameter. Codegen operates on image handles at the PTX level, but falls back to direct replacement of handles with kernel arguments if image handles are not enabled. Note that image handles are explicitly disabled for all target architectures in this change (to be enabled later).
llvm-svn: 205907
Summary:
They behave in accordance with the Has2008 and ABS2008 configuration bits of the processor which are used to select between the 1985 and 2008 versions of IEEE 754. In 1985 mode, these instructions are arithmetic (i.e. they raise invalid operation exceptions when given NaN), in 2008 mode they are non-arithmetic (i.e. they are copies).
nmadd.[ds], and nmsub.[ds] are still subject to -enable-no-nans-fp-math because the ISA spec does not explicitly state that they obey Has2008 and ABS2008.
Fixed the issue with the previous version of this patch (r205628). A pre-existing 'let Predicate =' statement was removing some predicates that were necessary for FP64 to behave correctly.
Reviewers: matheusalmeida
Reviewed By: matheusalmeida
Differential Revision: http://llvm-reviews.chandlerc.com/D3274
llvm-svn: 205844
This implements the target-hooks for ARM64 to enable constant hoisting.
This fixes <rdar://problem/14774662> and <rdar://problem/16381500>.
llvm-svn: 205791
Confusingly, the NEON fmla instructions put the accumulator first but the
scalar versions put it at the end (like the fma lib function & LLVM's
intrinsic).
This should fix PR19345, assuming there's only one issue.
llvm-svn: 205758