get the literal string “Hello world” printed as a comment on the instruction
that loads the pointer to it. For now this is just for x86_64. So for object
files with relocation entries it produces things like:
leaq L_.str(%rip), %rax ## literal pool for: "Hello world\n"
and similar for fully linked images like executables:
leaq 0x4f(%rip), %rax ## literal pool for: "Hello world\n"
Also to allow testing against darwin’s otool(1), I hooked up the existing
-no-show-raw-insn option to the Mach-O parser code, added the new Mach-O
only -full-leading-addr option to match otool(1)'s printing of addresses and
also added the new -print-imm-hex option.
llvm-svn: 218423
For biendian targets like ARM and AArch64, it is useful to have the
output of the llvm-dwarfdump and llvm-objdump report the endianness
used when the object files were generated.
Patch by Charlie Turner.
llvm-svn: 218408
This change fixes the ARM and AArch64 relocation visitors in
RelocVisitor. They were unconditionally assuming the object data are
little-endian. Tests have been added to ensure that the
llvm-dwarfdump utility does not crash when processing big-endian
object files.
Patch by Charlie Turner.
llvm-svn: 218407
This change replaces the brittle if/else chain of string comparisons
with a switch statement on the detected target triple, removing the
need for testing arbitrary architecture names returned from
getFileFormatName, whose primary purpose seems to be for display
(user-interface) purposes. The visitor now takes a reference to the
object file, rather than its arbitrary file format name to figure out
whether the file is a 32 or 64-bit object file and what the detected
target triple is.
A set of tests have been added to help show that the refactoring processes
relocations for the same targets as the original code.
Patch by Charlie Turner.
llvm-svn: 218406
Use the same environment when invoking llvm-config from lit.cfg as
will be used when running tests, so that ASAN_OPTIONS, INCLUDE, etc.
are present.
llvm-svn: 218403
This reverts commit faac033f7364bb4226e22c8079c221c96af10d02.
The test depends on all targets to be enabled in llc in order to pass,
and needs to be rewritten/refactored to not have that dependency.
llvm-svn: 218393
For biendian targets like ARM and AArch64, it is useful to have the
output of the llvm-dwarfdump and llvm-objdump report the endianness
used when the object files were generated.
Patch by Charlie Turner.
llvm-svn: 218391
This change fixes the ARM and AArch64 relocation visitors in
RelocVisitor. They were unconditionally assuming the object data are
little-endian. Tests have been added to ensure that the
llvm-dwarfdump utility does not crash when processing big-endian
object files.
Patch by Charlie Turner.
llvm-svn: 218389
This change replaces the brittle if/else chain of string comparisons
with a switch statement on the detected target triple, removing the
need for testing arbitrary architecture names returned from
getFileFormatName, whose primary purpose seems to be for display
(user-interface) purposes. The visitor now takes a reference to the
object file, rather than its arbitrary file format name to figure out
whether the file is a 32 or 64-bit object file and what the detected
target triple is.
A set of tests have been added to help show that the refactoring processes
relocations for the same targets as the original code.
Patch by Charlie Turner.
llvm-svn: 218388
The doFinalization method checks that the LoopToAliasSetMap is
empty. LICM populates that map as it runs through the loop nest,
deleting the entries for child loops as it goes. However, if a child
loop is deleted by another pass (e.g. unrolling) then the loop will
never be deleted from the map because LICM walks the loop nest to
find entries it can delete.
The fix is to delete the loop from the map and free the alias set
when the loop is deleted from the loop nest.
Differential Revision: http://reviews.llvm.org/D5305
llvm-svn: 218387
If it's safe to clobber the condition flags, we can do a few extra things:
it's then possible to reset the base register writeback using a SUBS, so
we can try to merge even if the base register isn't dead after the merged
instruction.
This is effectively a (heavily bug-fixed) rewrite of r208992.
llvm-svn: 218386
v7M only allows the 16-bit encoding of the 'cps' (Change Processor
State) instruction, and does not have the 32-bit encoding which is
valid from v6T2 onwards.
llvm-svn: 218382
pool data being loaded into a vector register.
The comments take the form of:
# ymm0 = [a,b,c,d,...]
# xmm1 = <x,y,z...>
The []s are used for generic sequential data and the <>s are used for
specifically ConstantVector loads. Undef elements are printed as the
letter 'u', integers in decimal, and floating point values as floating
point values. Suggestions on improving the formatting or other aspects
of the display are very welcome.
My primary use case for this is to be able to FileCheck test masks
passed to vector shuffle instructions in-register. It isn't fantastic
for that (no decoding special zeroing semantics or other tricks), but it
at least puts the mask onto an instruction line that could reasonably be
checked. I've updated many of the new vector shuffle lowering tests to
leverage this in their test cases so that we're actually checking the
shuffle masks remain as expected.
Before implementing this, I tried a *bunch* of different approaches.
I looked into teaching the MCInstLower code to scan up the basic block
and find a definition of a register used in a shuffle instruction and
then decode that, but this seems incredibly brittle and complex.
I talked to Hal a lot about the "right" way to do this: attach the raw
shuffle mask to the instruction itself in some form of unencoded
operands, and then use that to emit the comments. I still think that's
the optimal solution here, but it proved to be beyond what I'm up for
here. In particular, it seems likely best done by completing the
plumbing of metadata through these layers and attaching the shuffle mask
in metadata which could have fully automatic dropping when encoding an
actual instruction.
llvm-svn: 218377
the native AVX2 instructions.
Note that the test case is really frustrating here because VPERMD
requires the mask to be in the register input and we don't produce
a comment looking through that to the constant pool. I'm going to
attempt to improve this in a subsequent commit, but not sure if I will
succeed.
llvm-svn: 218347
detection. It was incorrectly handling undef lanes by actually treating
an undef lane in the first 128-bit lane as a *numeric* shuffle value.
Fortunately, this almost always DTRT and disabled detecting repeated
patterns. But not always. =/ This patch introduces a much more
principled approach and fixes the miscompiles I spotted by inspection
previously.
llvm-svn: 218346
This testcase was not testing what it meant: because there were only two checks for
dmb {{ish}} in the second function, it could have missed a bug where one of the three
required dmb {{ish}} became dmb {{ishst}}. As I was fixing it, I also added
CHECK-LABELs to make it a bit less brittle.
llvm-svn: 218341
shuffles using the AVX2 instructions. This is the first step of cutting
in real AVX2 support.
Note that I have spotted at least one bug in the test cases already, but
I suspect it was already present and just is getting surfaced. Will
investigate next.
llvm-svn: 218338
Rather than slurping in and splatting out the whole ctor list, preserve
the existing array entries without trying to understand them. Only
remove the entries that we know we can optimize away. This way we don't
need to wire through priority and comdats or anything else we might add.
Fixes a linker issue where the .init_array or .ctors entry would point
to discarded initialization code if the comdat group from the TU with
the faulty global_ctors entry was dropped.
llvm-svn: 218337
e.g., add w1, w2, w3, lsl #(2 - 1)
This sort of thing comes up in pre-processed assembly playing macro games.
Still validate that it's an assembly time constant. The early exit error check
was just a bit overzealous and disallowed a left paren.
rdar://18430542
llvm-svn: 218336
add VPBLENDD to the InstPrinter's comment generation so we get nice
comments everywhere.
Now that we have the nice comments, I can see the bug introduced by
a silly typo in the commit that enabled VPBLENDD, and have fixed it. Yay
tests that are easy to inspect.
llvm-svn: 218335
Summary:
AtomicExpand already had logic for expanding wide loads and stores on LL/SC
architectures, and for expanding wide stores on CmpXchg architectures, but
not for wide loads on CmpXchg architectures. This patch fills this hole,
and makes use of this new feature in the X86 backend.
Only one functionnal change: we now lose the SynchScope attribute.
It is regrettable, but I have another patch that I will submit soon that will
solve this for all of AtomicExpand (it seemed better to split it apart as it
is a different concern).
Test Plan: make check-all (lots of tests for this functionality already exist)
Reviewers: jfb
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D5404
llvm-svn: 218332
Summary:
This patch makes use of AtomicExpandPass in Power for inserting fences around
atomic as part of an effort to remove fence insertion from SelectionDAGBuilder.
As a big bonus, it lets us use sync 1 (lightweight sync, often used by the mnemonic
lwsync) instead of sync 0 (heavyweight sync) in many cases.
I also added a test, as there was no test for the barriers emitted by the Power
backend for atomic loads and stores.
Test Plan: new test + make check-all
Reviewers: jfb
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D5180
llvm-svn: 218331
VPBLENDD where appropriate even on 128-bit vectors.
According to Agner's tables, this instruction is significantly higher
throughput (can execute on any port) on Haswell chips so we should
aggressively try to form it when available.
Sadly, this loses our delightful shuffle comments. I'll add those back
for VPBLENDD next.
llvm-svn: 218322
undef in the shuffle mask. This shows up when we're printing comments
during lowering and we still have an IR-level constant hanging around
that models undef.
A nice consequence of this is *much* prettier test cases where the undef
lanes actually show up as undef rather than as a particular set of
values. This also allows us to print shuffle comments in cases that use
undef such as the recently added variable VPERMILPS lowering. Now those
test cases have nice shuffle comments attached with their details.
The shuffle lowering for PSHUFB has been augmented to use undef, and the
shuffle combining has been augmented to comprehend it.
llvm-svn: 218301
trick that I missed.
VPERMILPS has a non-immediate memory operand mode that allows it to do
asymetric shuffles in the two 128-bit lanes. Use this rather than two
shuffles and a blend.
However, it turns out the variable shuffle path to VPERMILPS (and
VPERMILPD, although that one offers no functional differenc from the
immediate operand other than variability) wasn't even plumbed through
codegen. Do such plumbing so that we can reasonably emit
a variable-masked VPERMILP instruction. Also plumb basic comment parsing
and printing through so that the tests are reasonable.
There are still a few tests which don't show the shuffle pattern. These
are tests with undef lanes. I'll teach the shuffle decoding and printing
to handle undef mask entries in a follow-up. I've looked at the masks
and they seem reasonable.
llvm-svn: 218300
This includes constants, attributes, and some additional instructions not covered by previous tests.
Work was done by lama.saba@intel.com.
llvm-svn: 218297
We manage to generate all of the matching instructions (and a lot more) via
the reciprocal optimization function - even if we completely remove the square
root optimization. With CHECK_NEXT, we assure that we're executing the
expected square root optimization paths and not generating extra insts.
llvm-svn: 218284
Shift-left immediate with sign-/zero-extensions also works for boolean values.
Update the assert and the test cases to reflect that fact.
This should fix a bug found by Chad.
llvm-svn: 218275
These are just test cases, no actual code yet. This establishes the
baseline fallback strategy we're starting from on AVX2 and the expected
lowering we use on AVX1.
Also, these test cases are very much generated. I've manually crafted
the specific pattern set that I'm hoping will be useful at exercising
the lowering code, but I've not (and could not) manually verify *all* of
these. I've spot checked and they seem legit to me.
As with the rest of vector shuffling, at a certain point the only really
useful way to check the correctness of this stuff is through fuzz
testing.
llvm-svn: 218267
We generate broadcast instructions on CPUs with AVX2 to load some constant splat vectors.
This patch should preserve all existing behavior with regular optimization levels,
but also use splats whenever possible when optimizing for *size* on any CPU with AVX or AVX2.
The tradeoff is up to 5 extra instruction bytes for the broadcast instruction to save
at least 8 bytes (up to 31 bytes) of constant pool data.
Differential Revision: http://reviews.llvm.org/D5347
llvm-svn: 218263
This reverts commit r218254.
The global_atomics.ll test fails with asserts disabled. For some reason,
the compiler fails to produce the atomic no return variants.
llvm-svn: 218257
Summary:
Update segmented-stacks*.ll tests with x32 target case and make
corresponding changes to make them pass.
Test Plan: tests updated with x32 target
Reviewers: nadav, rafael, dschuff
Subscribers: llvm-commits, zinovy.nis
Differential Revision: http://reviews.llvm.org/D5245
llvm-svn: 218247
Summary: getSubroutineName is currently only used by llvm-symbolizer, thus add a binary test containing a cross-cu inlining example.
Reviewers: samsonov, dblaikie
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D5394
llvm-svn: 218245
The PSHUFB mask decode routine used to assert if the mask index was out of
range (<0 or greater than the size of the vector). The problem is, we can
legitimately have a PSHUFB with a large index using intrinsics. The
instruction only uses the least significant 4 bits. This change removes the
assert and masks the index to match the instruction behaviour.
llvm-svn: 218242
We currently emit an error when trying to assemble a file with more
than one section using DWARF2 debug info. This should be a warning
instead, as the resulting file will still be usable, but with a
degraded debug illusion.
llvm-svn: 218241
a more sane approach to AVX2 support.
Fundamentally, there is no useful way to lower integer vectors in AVX.
None. We always end up with a VINSERTF128 in the end, so we might as
well eagerly switch to the floating point domain and do everything
there. This cleans up lots of weird and unlikely to be correct
differences between integer and floating point shuffles when we only
have AVX1.
The other nice consequence is that by doing things this way we will make
it much easier to write the integer lowering routines as we won't need
to duplicate the logic to check for AVX vs. AVX2 in each one -- if we
actually try to lower a 256-bit vector as an integer vector, we have
AVX2 and can rely on it. I think this will make the code much simpler
and more comprehensible.
Currently, I've disabled *all* support for AVX2 so that we always fall
back to AVX. This keeps everything working rather than asserting. That
will go away with the subsequent series of patches that provide
a baseline AVX2 implementation.
Please note, I'm going to implement AVX2 *without access to hardware*.
That means I cannot correctness test this path. I will be relying on
those with access to AVX2 hardware to do correctness testing and fix
bugs here, but as a courtesy I'm trying to sketch out the framework for
the new-style vector shuffle lowering in the context of the AVX2 ISA.
llvm-svn: 218228
input v8f32 shuffles which are not 128-bit lane crossing but have
different shuffle patterns in the low and high lanes. This removes most
of the extract/insert traffic that was unnecessary and is particularly
good at lowering cases where only one of the two lanes is shuffled at
all.
I've also added a collection of test cases with undef lanes because this
lowering is somewhat more sensitive to undef lanes than others.
llvm-svn: 218226
in the high and low 128-bit lanes of a v8f32 vector.
No functionality change yet, but wanted to set up the baseline for my
next patch which will make these quite a bit better. =]
llvm-svn: 218224
lowering when it can use a symmetric SHUFPS across both 128-bit lanes.
This required making the SHUFPS lowering tolerant of other vector types,
and adjusting our canonicalization to canonicalize harder.
This is the last of the clever uses of symmetry I've thought of for
v8f32. The rest of the tricks I'm aware of here are to work around
assymetry in the mask.
llvm-svn: 218216
of a single element into a zero vector for v4f64 and v4i64 in AVX.
Ironically, there is less to see here because xor+blend is so crazy fast
that we can't really beat that to zero the high 128-bit lane.
llvm-svn: 218214
UNPCKHPS with AVX vectors by recognizing those patterns when they are
repeated for both 128-bit lanes.
With this, we now generate the exact same (really nice) code for
Quentin's avx_test_case.ll which was the most significant regression
reported for the new shuffle lowering. In fact, I'm out of specific test
cases for AVX lowering, the rest were AVX2 I think. However, there are
a bunch of pretty obvious remaining things to improve with AVX...
llvm-svn: 218213
important bits of cleverness: to detect and lower repeated shuffle
patterns between the two 128-bit lanes with a single instruction.
This patch just teaches it how to lower single-input shuffles that fit
this model using VPERMILPS. =] There is more that needs to happen here.
llvm-svn: 218211
generating the test cases to format things more consistently and
actually catch all the operand sequences that should be elided in favor
of the asm comments. No actual changes here.
llvm-svn: 218210
VBLENDPD over using VSHUFPD. While the 256-bit variant of VBLENDPD slows
down to the same speed as VSHUFPD on Sandy Bridge CPUs, it has twice the
reciprocal throughput on Ivy Bridge CPUs much like it does everywhere
for 128-bits. There isn't a downside, so just eagerly use this
instruction when it suffices.
llvm-svn: 218208
This expands the integer cases to cover the fact that AVX2 moves their
lane-crossing shuffles into the integer domain. It also adds proper
support for AVX2 run lines and the "ALL" group when it doesn't matter.
llvm-svn: 218206
actual support for complex AVX shuffling tricks. We can do independent
blends of the low and high 128-bit lanes of an avx vector, so shuffle
the inputs into place and then do the blend at 256 bits. This will in
many cases remove one blend instruction.
The next step is to permute the low and high halves in-place rather than
extracting them and re-inserting them.
llvm-svn: 218202
link.exe:
Fuzz testing has shown that COMMON symbols with size > 32 will always
have an alignment of at least 32 and all symbols with size < 32 will
have an alignment of at least the largest power of 2 less than the size
of the symbol.
binutils:
The BFD linker essentially work like the link.exe behavior but with
alignment 4 instead of 32. The BFD linker also supports an extension to
COFF which adds an -aligncomm argument to the .drectve section which
permits specifying a precise alignment for a variable but MC currently
doesn't support editing .drectve in this way.
With all of this in mind, we decide to play a little trick: we can
ensure that the alignment will be respected by bumping the size of the
global to it's alignment.
llvm-svn: 218201
under AVX.
This really just documents the current state of the world. I'm going to
try to flesh it out to cover any test cases I plan to improve prior to
improving them so that the delta made by changes is actually visible to
code reviewers.
This is made easier by the fact that I now have a script to automate the
process of producing test cases including the check lines. =]
llvm-svn: 218199
single-input shuffles with doubles. This allows them to fold memory
operands into the shuffle, etc. This is just the analog to the v4f32
case in my prior commit.
llvm-svn: 218193
instruction for single-vector floating point shuffles. This in turn
allows the shuffles to fold a load into the instruction which is one of
the common regressions hit with the new shuffle lowering.
llvm-svn: 218190
We had a few bugs:
- We were considering the GVKind instead of just looking at the section
characteristics
- We would never print out 'y' when a section was meant to be unreadable
- We would never print out 's' when a section was meant to be shared
- We translated IMAGE_SCN_MEM_DISCARDABLE to 'n' when it should've meant
IMAGE_SCN_LNK_REMOVE
llvm-svn: 218189
duplication of check lines. The idea is to have broad sets of
compilation modes that will frequently diverge without having to always
and immediately explode to the precise ISA feature set.
While this already helps due to VEX encoded differences, it will help
much more as I teach the new shuffle lowering about more of the new VEX
encoded instructions which can still be used to implement 128-bit
shuffles.
llvm-svn: 218188
A problem with our old behavior becomes observable under x86-64 COFF
when we need a read-only GV which has an initializer which is referenced
using a relocation: we would mark the section as writable. Marking the
section as writable interferes with section merging.
This fixes PR21009.
llvm-svn: 218179
tricky case of single-element insertion into the zero lane of a zero
vector.
We can't just use the same pattern here as we do in every other vector
type because the general insertion logic can handle insertion into the
non-zero lane of the vector. However, in SSE4.1 with v4f32 vectors we
have INSERTPS that is a much better choice than the generic one for such
lowerings. But INSERTPS can do lots of other lowerings as well so
factoring its logic into the general insertion logic doesn't work very
well. We also can't just extract the core common part of the general
insertion logic that is faster (forming VZEXT_MOVL synthetic nodes that
lower to MOVSS when they can) because VZEXT_MOVL is often *faster* than
a blend while INSERTPS is slower! So instead we do a restrictive
condition on attempting to use the generic insertion logic to narrow it
to those cases where VZEXT_MOVL won't need a shuffle afterward and thus
will do better than INSERTPS. Then we try blending. Then we go back to
INSERTPS.
This still doesn't generate perfect code for some silly reasons that can
be fixed by tweaking the td files for lowering VZEXT_MOVL to use
XORPS+BLENDPS when available rather than XORPS+MOVSS when the input ends
up in a register rather than a load from memory -- BLENDPSrr has twice
the reciprocal throughput of MOVSSrr. Don't you love this ISA?
llvm-svn: 218177
floating point types and use it for both v2f64 and v2i64 single-element
insertion lowering.
This fixes the last non-AVX performance regression test case I've gotten
of for the new vector shuffle lowering. There is obvious analogous
lowering for v4f32 that I'll add in a follow-up patch (because with
INSERTPS, v4f32 requires special treatment). After that, its AVX stuff.
llvm-svn: 218175
When looking through sign/zero-extensions the code would always assume there is
such an extension instruction and use the wrong operand for the address.
There was also a minor issue in the handling of 'AND' instructions. I
accidentially used a 'cast' instead of a 'dyn_cast'.
llvm-svn: 218161
lowering to support both anyext and zext and to custom lower for many
different microarchitectures.
Using this allows us to get *exactly* the right code for zext and anyext
shuffles in all the vector sizes. For v16i8, the improvement is *huge*.
The new SSE2 test case added I refused to add before this because it was
sooooo muny instructions.
llvm-svn: 218143
To reduce the size of -gmlt data, skip the subprograms without any
inlined subroutines. Since we've now got the ability to make these
determinations in the backend (funnily enough - we added the flag so we
wouldn't produce ranges under -gmlt, but with this change we use the
flag, but go back to producing ranges under -gmlt).
Instead, just produce CU ranges to inform the consumer which parts of
the code are described by this CU's line table. Tools could inspect the
line table directly to compute the range, but the CU ranges only seem to
be about 0.5% of object/executable size, so I'm not too worried about
teaching llvm-symbolizer that trick just yet - it's certainly a possible
piece of future work.
Update an llvm-symbolizer test just to demonstrate that this schema is
acceptable there (if it wasn't, the compiler-rt tests would catch this,
but good to have an in-llvm-tree test for llvm-symbolizer's behavior
here)
Building the clang binary with -gmlt with this patch reduces the total
size of object files by 5.1% (5.56% without ranges) without compression
and the executable by 4.37% (4.75% without ranges).
llvm-svn: 218129
The heuristic used by DAGCombine to form FMAs checks that the FMUL has only one
use, but this is overly-conservative on some systems. Specifically, if the FMA
and the FADD have the same latency (and the FMA does not compete for resources
with the FMUL any more than the FADD does), there is no need for the
restriction, and furthermore, forming the FMA leaving the FMUL can still allow
for higher overall throughput and decreased critical-path length.
Here we add a new TLI callback, enableAggressiveFMAFusion, false by default, to
elide the hasOneUse check. This is enabled for PowerPC by default, as most
PowerPC systems will benefit.
Patch by Olivier Sallenave, thanks!
llvm-svn: 218120
to undef lanes as well as defined widenable lanes. This dramatically
improves the lowering we use for undef-shuffles in a zext-ish pattern
for SSE2.
llvm-svn: 218115
Not sure why I only did SSSE3 here. Also, I've left out some of the SSE2
ones because the shuffles are so absurd it's not worth transcribing
them. Will try to fix them to be sane and then check them.
llvm-svn: 218114
shuffles that are zext-ing.
Not a lot to see here; the undef lane variant is better handled with
pshufd, but this improves the actual zext pattern.
llvm-svn: 218112
to the new vector shuffle lowering code.
This allows us to emit PMOVZX variants consistently for patterns where
it is a viable lowering. This instruction is both fast and allows us to
fold loads into it. This only hooks the new lowering up for i16 and i8
element widths, mostly so I could manage the change to the tests. I'll
add the i32 one next, although it is significantly less interesting.
One thing to note is that we already had some tests for these patterns
but those tests had far less horrible instructions. The problem is that
those tests weren't checking the strict start and end of the instruction
sequence. =[ As a consequence something changed in the lowering making
us generate *TERRIBLE* code for these patterns in SSE2 through SSSE3.
I've consolidated all of the tests and spelled out the madness that we
currently emit for these shuffles. I'm going to try to figure out what
has gone wrong here.
llvm-svn: 218102
With this optimization, we will not always insert zext for values crossing
basic blocks, but insert sext if the users of a value crossing basic block
has preference of sign predicate.
llvm-svn: 218101
This omission will be done in a fancier manner once we're dealing with
"put gmlt in the skeleton CUs under fission" - it'll have to be
conditional on the kind of CU we're emitting into (skeleton or gmlt).
llvm-svn: 218098
This format is simply a regular object file with the bitcode stored in a
section named ".llvmbc", plus any number of other (non-allocated) sections.
One immediate use case for this is to accommodate compilation processes
which expect the object file to contain metadata in non-allocated sections,
such as the ".go_export" section used by some Go compilers [1], although I
imagine that in the future we could consider compiling parts of the module
(such as large non-inlinable functions) directly into the object file to
improve LTO efficiency.
[1] http://golang.org/doc/install/gccgo#Imports
Differential Revision: http://reviews.llvm.org/D4371
llvm-svn: 218078
The fix is slightly different then x86 (see r216117) because the number of values
attached to a return can vary even for a single returned value (e.g., f64 yields
two returned values).
<rdar://problem/18352998>
llvm-svn: 218076
Summary:
This patch was originally in D5304 (I could not find a way to reopen that revision).
It was accepted, commited and broke the build bots because the overloading of
the constructor of ArrayRef for braced initializer lists is not supported by all
toolchains. I then reverted it, and propose this fixed version that uses a plain
C array instead in makeDMB (that array is then converted implicitly to an
ArrayRef, but that is not behind an ifdef). Could someone confirm me whether
initialization lists for plain C arrays are supported by every toolchain used
to build llvm ? Otherwise I can just initialize the array in the old way:
args[0] = ...; .. ; args[5] = ...;
Below is the description of the original patch:
```
I had only tested this code for ARMv7 and ARMv8. This patch adds several
fallback paths if the processor does not support dmb ish:
- dmb sy if a cortex-M with support for dmb
- mcr p15, #0, r0, c7, c10, #5 for ARMv6 (special instruction equivalent to a DMB)
These fallback paths were chosen based on the code for fence seq_cst.
Thanks to luqmana for having noticed this bug.
```
Test Plan: Added more cases to atomic-load-store.ll + make check-all
Reviewers: jfb, t.p.northover, luqmana
Subscribers: llvm-commits, aemerson
Differential Revision: http://reviews.llvm.org/D5386
llvm-svn: 218066
There is no purpose in using it for single-input shuffles as
pshufd is just as fast and doesn't tie the two operands. This removes
a substantial amount of wrong-domain blend operations in SSSE3 mode. It
also completes the usage of PALIGNR for integer shuffles and addresses
one of the test cases Quentin hit with the new vector shuffle lowering.
There is still the question of whether and when to use this for floating
point shuffles. It is faster than shufps or shufpd but in the integer
domain. I don't yet really have a good heuristic here for when to use
this instruction for floating point vectors.
llvm-svn: 218038
When folding the intrinsic flag into the branch or select we also have to
consider the fact if the intrinsic got simplified, because it changes the
flag we have to check for.
llvm-svn: 218034
Small optimization in 'simplifyAddress'. When the offset cannot be encoded in
the load/store instruction, then we need to materialize the address manually.
The add instruction can encode a wider range of immediates than the load/store
instructions. This change tries to fold the offset into the add instruction
first before materializing the offset in a register.
llvm-svn: 218031
The 'AND' instruction could be used to mask out the lower 32 bits of a register.
If this is done inside an address computation we might be able to fold the
instruction into the memory instruction itself.
and x1, x1, #0xffffffff ---> ldrb x0, [x0, w1, uxtw]
ldrb x0, [x0, x1]
llvm-svn: 218030
Certain directives are unsupported on Windows (some of which could/should be
supported). We would not diagnose the use but rather crash during the emission
as we try to access the Target Streamer. Add an assertion to prevent creating a
NULL reference (which is not permitted under C++) as well as a test to ensure
that we can diagnose the disabled directives.
llvm-svn: 218014
PALIGNR. This just adds it to the v8i16 and v16i8 lowering steps where
it is completely unmatched. It also introduces the logic for detecting
rotation shuffle masks even in the presence of single input or blend
masks and arbitrarily undef lanes.
I've added fairly comprehensive tests for the matching logic in v8i16
because the tests at that size are much easier to write and manage.
I've not checked the SSE2 code generated for these tests because the
code is *horrible*. It is absolute madness. Testing it will just make
the test brittle without giving any interesting improvements in the
correctness confidence.
llvm-svn: 218013
Rather than relying on support for a specific directive to determine if we are
targeting MachO, explicitly check the output format.
As an additional bonus, cleanup the caret diagnostic for the non-MachO case and
avoid the spurious error caused by not discarding the statement.
llvm-svn: 218012
For PPC targets, FastISel does not take the sign extension information into account when selecting return instructions whose operands are constants. A consequence of this is that the return of boolean values is not correct. This patch fixes the problem by evaluating the sign extension information also for constants, forwarding this information to PPCMaterializeInt which takes this information to drive the sign extension during the materialization.
llvm-svn: 217993
It is breaking the build on the buildbots but works fine on my machine, I revert
while trying to understand what happens (it appears to depend on the compiler used
to build, I probably used a C++11 feature that is not perfectly supported by some
of the buildbots).
This reverts commit feb3176c4d006f99af8b40373abd56215a90e7cc.
llvm-svn: 217973
This takes advanatage of the CBZ and CBNZ instruction to further optimize the
common null check pattern into a single instruction.
This is related to rdar://problem/18358882.
llvm-svn: 217972
This adds the last two missing floating-point condition codes (FCMP_UEQ and
FCMP_ONE) also to the branch selection. In these two cases an additonal branch
instruction is required.
This also adds unit tests to checks all the different condition codes.
This is related o rdar://problem/18358882.
llvm-svn: 217966
Summary:
I had only tested this code for ARMv7 and ARMv8. This patch adds several
fallback paths if the processor does not support dmb ish:
- dmb sy if a cortex-M with support for dmb
- mcr p15, #0, r0, c7, c10, #5 for ARMv6 (special instruction equivalent to a DMB)
These fallback paths were chosen based on the code for fence seq_cst.
Thanks to luqmana for having noticed this bug.
Test Plan: Added more cases to atomic-load-store.ll + make check-all
Reviewers: jfb, t.p.northover, luqmana
Subscribers: aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D5304
llvm-svn: 217965
Only 1 decimal place should be printed for inline immediates.
Other constants should be hex constants.
Does not include f64 tests because folding those inline
immediates currently does not work.
llvm-svn: 217964
This improves other optimizations such as LSR. A sext may be added to the
compare's other operand, but this can often be hoisted outside of the loop.
llvm-svn: 217953
Example:
define i1 @foo(i32 %a) {
%shr = ashr i32 -9, %a
%cmp = icmp ne i32 %shr, -5
ret i1 %cmp
}
Before this fix, the instruction combiner wrongly thought that %shr
could have never been equal to -5. Therefore, %cmp was always folded to 'true'.
However, when %a is equal to 1, then %cmp evaluates to 'false'. Therefore,
in this example, it is not valid to fold %cmp to 'true'.
The problem was only affecting the case where the comparison was between
negative quantities where one of the quantities was obtained from arithmetic
shift of a negative constant.
This patch fixes the problem with the wrong folding (fixes PR20945).
With this patch, the 'icmp' from the example is now simplified to a
comparison between %a and 1. This still allows us to get rid of the arithmetic
shift (%shr).
llvm-svn: 217950
Summary: This directive is used to tell the assembler to reject DSP-specific instructions.
Reviewers: dsanders
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D5142
llvm-svn: 217946
First step done in this commit is to get flush out enough of the
SymbolizerGetOpInfo() routine to symbolic an X86_64 hello world .o and
its loading of the literal string and call to printf. Also the code to
symbolicate the X86_64_RELOC_SUBTRACTOR relocation and a test is also
added to show a slightly more complicated case.
Next will be to flush out enough of SymbolizerSymbolLookUp() to get the
literal string “Hello world” printed as a comment on the instruction that load
the pointer to it.
llvm-svn: 217893
By class-instance values I mean 'Class<Arg>' in 'Class<Arg>.Field' or in
'Other<Class<Arg>>' (syntactically s SimpleValue). This is to differentiate
from unnamed/anonymous record definitions (syntactically an ObjectBody) which
are not affected by this change.
Consider the testcase:
class Struct<int i> {
int I = !shl(i, 1);
int J = !shl(I, 1);
}
class Class<Struct s> {
int Class_J = s.J;
}
multiclass MultiClass<int i> {
def Def : Class<Struct<i>>;
}
defm Defm : MultiClass<2>;
Before this fix, DefmDef.Class_J yields !shl(I, 1) instead of 8.
This is the sequence of events. We start with this:
multiclass MultiClass<int i> {
def Def : Class<Struct<i>>;
}
During ParseDef the anonymous object for the class-instance value is created:
multiclass Multiclass<int i> {
def anonymous_0 : Struct<i>;
def Def : Class<NAME#anonymous_0>;
}
Then class Struct<i> is added to anonymous_0. Also Class<NAME#anonymous_0> is
added to Def:
multiclass Multiclass<int i> {
def anonymous_0 {
int I = !shl(i, 1);
int J = !shl(I, 1);
}
def Def {
int Class_J = NAME#anonymous_0.J;
}
}
So far so good but then we move on to instantiating this in the defm
by substituting the template arg 'i'.
This is how the anonymous prototype looks after fully instantiating.
defm Defm = {
def Defmanonymous_0 {
int I = 4;
int J = !shl(I, 1);
}
Note that we only resolved the reference to the template arg. The
non-template-arg reference in 'J' has not been resolved yet.
Then we go on to instantiating the Def prototype:
def DefmDef {
int Class_J = NAME#anonymous_0.J;
}
Which is resolved to Defmanonymous_0.J and then to !shl(I, 1).
When we fully resolve each record in a defm, Defmanonymous_0.J does get set
to 8 but that's too late for its use.
The patch adds a new attribute to the Record class that indicates that this
def is actually a class-instance value that may be *used* by other defs in a
multiclass. (This is unlike regular defs which don't reference each other and
thus can be resolved indepedently.) They are then fully resolved before the
other defs while the multiclass is instantiated.
I added vg_leak to the new test. I am not sure if this is necessary but I
don't think I have a way to test it. I can also check in without the XFAIL
and let the bots test this part.
Also tested that X86.td.expanded and AAarch64.td.expanded were unchange before
and after this change. (This issue triggering this problem is a WIP patch.)
Part of <rdar://problem/17688758>
llvm-svn: 217886
Summary: Changed error messages to be more informative and to resemble other clang/llvm error messages (first letter is lower case, no ending punctuation) and updated corresponding tests.
Reviewers: dsanders
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D5065
llvm-svn: 217873
The default implementation of getCmpSelInstrCost, which provides the cost of
icmp/fcmp/select instructions, did not deal sensibly with illegal vector types
that were scalarized. We'd ask for the legalization cost of the vector type,
which would return something like (4, f64) given an input of <4 x double>, and
we'd then check the TLI status of the ISD opcode on that scalar type. This would
result in querying (ISD::VSELECT, f64), for example. Amusingly enough,
ISD::VSELECT on scalar types is marked as Legal by default (as with most other
operations), and most backends never change this because VSELECT is never
generated on scalars. However, seeing the resulting operation as Legal, we'd
neglect to add the scalarization cost before returning. The result is that we'd
grossly under-estimate the cost of cmps/selects on illegal vector types.
Now, if type legalization clearly results in scalarization, we skip the early
return and add the scalarization cost.
llvm-svn: 217859
Teach yaml2obj how to make a bigobj COFF file. Like the rest of LLVM,
we automatically decide whether or not to use regular COFF or bigobj
COFF on the fly depending on how many sections the resulting object
would have.
This ends the task of adding bigobj support to LLVM.
N.B. This was tested by forcing yaml2obj to be used in bigobj mode
regardless of the number of sections. While a dedicated test was
written, the smallest I could make it was 36 MB (!) of yaml and it still
took a significant amount of time to execute on a powerful machine.
llvm-svn: 217858
This finishes the ability of llvm-objdump to print out all information from
the LC_DYLD_INFO load command.
The -bind option prints out symbolic references that dyld must resolve
immediately.
The -lazy-bind option prints out symbolc reference that are lazily resolved on
first use.
The -weak-bind option prints out information about symbols which dyld must
try to coalesce across images.
llvm-svn: 217853
that we don't use VSELECT and directly emit an addsub synthetic node.
Also remove a stale comment referencing VSELECT.
The test case is updated to use 'core2' which only has SSE3, not SSE4.1,
and it still passes. Previously it would not because we lacked
sufficient blend support to legalize the VSELECT.
llvm-svn: 217849
ADDSUBPD nodes out of blends of adds and subs.
This allows us to actually form these instructions with SSE3 rather than
only forming them when we had both SSE3 for the ADDSUB instructions and
SSE4.1 for the blend instructions. ;] Kind-of important.
I've adjusted the CPU requirements on one of the tests to demonstrate
this kicking in nicely for an SSE3 cpu configuration.
llvm-svn: 217848
This adds the missing test case for the previous commit:
Allow handling of vectors during return lowering for little endian machines.
Sorry for the noise.
llvm-svn: 217847
This changes the debug output of the llvm-cov tool to consistently
write to stderr, and moves the highlighting output closer to where
it's relevant.
llvm-svn: 217838
In r217746, though it was supposed to be NFC, I broke llvm-cov's
handling of showing regions without showing counts. This should've
shown up in the existing tests, except they were checking debug output
that was displayed regardless of what was actually output. I've moved
the relevant debug output to a more appropriate place so that the
tests catch this kind of thing.
llvm-svn: 217835
This lowers frem to a runtime libcall inside fast-isel.
The test case also checks the CallLoweringInfo bug that was exposed by this
change.
This fixes rdar://problem/18342783.
llvm-svn: 217833
Summary:
Expand list of supported targets for Mips to include mips32 r1.
Previously it only include r2. More patches are coming where there is
a difference but in the current patches as pushed upstream, r1 and r2
are equivalent.
Test Plan:
simplestorefp1.ll
add new build bots at mips to test this flavor at both -O0 and -O2
Reviewers: dsanders
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D5306
llvm-svn: 217821
On MachO, and MachO only, we cannot have a truly empty function since that
breaks the linker logic for atomizing the section.
When we are emitting a frame pointer, the presence of an unreachable will
create a cfi instruction pointing past the last instruction. This is perfectly
fine. The FDE information encodes the pc range it applies to. If some tool
cannot handle this, we should explicitly say which bug we are working around
and only work around it when it is actually relevant (not for ELF for example).
Given the unreachable we could omit the .cfi_def_cfa_register, but then
again, we could also omit the entire function prologue if we wanted to.
llvm-svn: 217801
introduced in r217629.
We were returning the old sext instead of the new zext as the promoted instruction!
Thanks Joerg Sonnenberger for the test case.
llvm-svn: 217800
Peephole optimization was folding MOVSDrm, which is a zero-extending double
precision floating point load, into ADDPDrr, which is a SIMD add of two packed
double precision floating point values.
(before)
%vreg21<def> = MOVSDrm <fi#0>, 1, %noreg, 0, %noreg; mem:LD8[%7](align=16)(tbaa=<badref>) VR128:%vreg21
%vreg23<def,tied1> = ADDPDrr %vreg20<tied0>, %vreg21; VR128:%vreg23,%vreg20,%vreg21
(after)
%vreg23<def,tied1> = ADDPDrm %vreg20<tied0>, <fi#0>, 1, %noreg, 0, %noreg; mem:LD8[%7](align=16)(tbaa=<badref>) VR128:%vreg23,%vreg20
X86InstrInfo::foldMemoryOperandImpl already had the logic that prevented this
from happening. However the check wasn't being conducted for loads from stack
objects. This commit factors out the logic into a new function and uses it for
checking loads from stack slots are not zero-extending loads.
rdar://problem/18236850
llvm-svn: 217799
Add some more tests to make sure better operand
choices are still made. Leave some cases that seem
to have no reason to ever be e64 alone.
llvm-svn: 217789
I noticed some odd looking cases where addr64 wasn't set
when storing to a pointer in an SGPR. This seems to be intentional,
and partially tested already.
The documentation seems to describe addr64 in terms of which registers
addressing modifiers come from, but I would expect to always need
addr64 when using 64-bit pointers. If no offset is applied,
it makes sense to not need to worry about doing a 64-bit add
for the final address. A small immediate offset can be applied,
so is it OK to not have addr64 set if a carry is necessary when adding
the base pointer in the resource to the offset?
llvm-svn: 217785
when SSE4.1 is available.
This removes a ton of domain crossing from blend code paths that were
ending up in the floating point code path.
This is just the tip of the iceberg though. The real switch is for
integer blend lowering to more actively rely on this instruction being
available so we don't hit shufps at all any longer. =] That will come in
a follow-up patch.
Another place where we need better support is for using PBLENDVB when
doing so avoids the need to have two complementary PSHUFB masks.
llvm-svn: 217767
missing specific checks.
While there is a lot of redundancy here where all-but-one mode use the
same code generation, I'd rather have each variant spelled out and
checked so that readers aren't misled by an omission in the test suite.
llvm-svn: 217765
instructions from the relevant shuffle patterns.
This is the last tweak I'm aware of to generate essentially perfect
v4f32 and v2f64 shuffles with the new vector shuffle lowering up through
SSE4.1. I'm sure I've missed some and it'd be nice to check since v4f32
is amenable to exhaustive exploration, but this is all of the tricks I'm
aware of.
With AVX there is a new trick to use the VPERMILPS instruction, that's
coming up in a subsequent patch.
llvm-svn: 217761
instructions when it finds an appropriate pattern.
These are lovely instructions, and its a shame to not use them. =] They
are fast, and can hand loads folded into their operands, etc.
I've also plumbed the comment shuffle decoding through the various
layers so that the test cases are printed nicely.
llvm-svn: 217758
AVX is available, and generally tidy up things surrounding UNPCK
formation.
Originally, I was thinking that the only advantage of PSHUFD over UNPCK
instruction variants was its free copy, and otherwise we should use the
shorter encoding UNPCK instructions. This isn't right though, there is
a larger advantage of being able to fold a load into the operand of
a PSHUFD. For UNPCK, the operand *must* be in a register so it can be
the second input.
This removes the UNPCK formation in the target-specific DAG combine for
v4i32 shuffles. It also lifts the v8 and v16 cases out of the
AVX-specific check as they are potentially replacing multiple
instructions with a single instruction and so should always be valuable.
The floating point checks are simplified accordingly.
This also adjusts the formation of PSHUFD instructions to attempt to
match the shuffle mask to one which would fit an UNPCK instruction
variant. This was originally motivated to allow it to match the UNPCK
instructions in the combiner, but clearly won't now.
Eventually, we should add a MachineCombiner pass that can form UNPCK
instructions post-RA when the operand is known to be in a register and
thus there is no loss.
llvm-svn: 217755
'punpckhwd' instructions when suitable rather than falling back to the
generic algorithm.
While we could canonicalize to these patterns late in the process, that
wouldn't help when the freedom to use them is only visible during
initial lowering when undef lanes are well understood. This, it turns
out, is very important for matching the shuffle patterns that are used
to lower sign extension. Fixes a small but relevant regression in
gcc-loops with the new lowering.
When I changed this I noticed that several 'pshufd' lowerings became
unpck variants. This is bad because it removes the ability to freely
copy in the same instruction. I've adjusted the widening test to handle
undef lanes correctly and now those will correctly continue to use
'pshufd' to lower. However, this caused a bunch of churn in the test
cases. No functional change, just churn.
Both of these changes are part of addressing a general weakness in the
new lowering -- it doesn't sufficiently leverage undef lanes. I've at
least a couple of patches that will help there at least in an academic
sense.
llvm-svn: 217752
Some ICmpInsts when anded/ored with another ICmpInst trivially reduces
to true or false depending on whether or not all integers or no integers
satisfy the intersected/unioned range.
This sort of trivial looking code can come about when InstCombine
performs a range reduction-type operation on sdiv and the like.
This fixes PR20916.
llvm-svn: 217750
These are super simple. They even take precedence over crazy
instructions like INSERTPS because they have very high throughput on
modern x86 chips.
I still have to teach the integer shuffle variants about this to avoid
so many domain crossings. However, due to the particular instructions
available, that's a touch more complex and so a separate patch.
Also, the backend doesn't seem to realize it can commute blend
instructions by negating the mask. That would help remove a number of
copies here. Suggestions on how to do this welcome, it's an area I'm
less familiar with.
llvm-svn: 217744
support transforming the forms from the new vector shuffle lowering to
use 'movddup' when appropriate.
A bunch of the cases where we actually form 'movddup' don't actually
show up in the test results because something even later than DAG
legalization maps them back to 'unpcklpd'. If this shows back up as
a performance problem, I'll probably chase it down, but it is at least
an encoded size loss. =/
To make this work, also always do this canonicalizing step for floating
point vectors where the baseline shuffle instructions don't provide any
free copies of their inputs. This also causes us to canonicalize
unpck[hl]pd into mov{hl,lh}ps (resp.) which is a nice encoding space
win.
There is one test which is "regressed" by this: extractelement-load.
There, the test case where the optimization it is testing *fails*, the
exact instruction pattern which results is slightly different. This
should probably be fixed by having the appropriate extract formed
earlier in the DAG, but that would defeat the purpose of the test.... If
this test case is critically important for anyone, please let me know
and I'll try to work on it. The prior behavior was actually contrary to
the comment in the test case and seems likely to have been an accident.
llvm-svn: 217738
Check that the post RA scheduler is being skipped, regardless of
whether it's the top-down list latency scheduler or the post-RA
MI scheduler.
llvm-svn: 217725
Similar to my previous -exports-trie option, the -rebase option dumps info from
the LC_DYLD_INFO load command. The rebasing info is a list of the the locations
that dyld needs to adjust if a mach-o image is not loaded at its preferred
address. Since ASLR is now the default, images almost never load at their
preferred address, and thus need to be rebased by dyld.
llvm-svn: 217709
The raw profiles that are generated in compiler-rt always add padding
so that each profile is aligned, so we can simply treat files that
don't have this property as malformed.
Caught by Alexey's new ubsan bot. Thanks!
llvm-svn: 217708
As far as I can tell UTF-8 has been supported since the beginning of Python's
codec support, and it's the de facto standard for text these days, at least
for primarily-English text. This allows us to put Unicode into lit RUN lines.
rdar://problem/18311663
llvm-svn: 217688
Cross-class copies being expensive is actually a trait of the microarchitecture, but as I haven't yet seen an example of a microarchitecture where they're cheap it seems best to just enable this by default, covering the non-mcpu build case.
llvm-svn: 217674
This fixes a call to sys::fs::equivalent that should've been to
CodeCoverageTool::equivalentFiles, which lets us restore the test of
r217476 that was removed in r217478.
This reverts r217478, but the test works this time.
llvm-svn: 217646
Inline asm may specify 'U' and 'X' constraints to print a 'u' for an
update-form memory reference, or an 'x' for an indexed-form memory
reference. However, these are really only useful in GCC internal code
generation. In inline asm the operand of the memory constraint is
typically just a register containing the address, so 'U' and 'X' make
no sense.
This patch quietly accepts 'U' and 'X' in inline asm patterns, but
otherwise does nothing. If we ever unexpectedly see a non-register,
we'll assert and sort it out afterwards.
I've added a new test for these constraints; the test case should be
used for other asm-constraints changes down the road.
llvm-svn: 217622
Do
(shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2)
This is already done for multiplies, but since multiplies
by powers of two are turned into shifts, we also need
to handle it here.
This might want checks for isLegalAddImmediate to avoid
transforming an add of a legal immediate with one that isn't.
llvm-svn: 217610
r189189 implemented AVX512 unpack by essentially performing a 256-bit unpack
between the low and the high 256 bits of src1 into the low part of the
destination and another unpack of the low and high 256 bits of src2 into the
high part of the destination.
I don't think that's how unpack works. AVX512 unpack simply has more 128-bit
lanes but other than it works the same way as AVX. So in each 128-bit lane,
we're always interleaving certain parts of both operands rather different
parts of one of the operands.
E.g. for this:
__v16sf a = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
__v16sf b = { 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 };
__v16sf c = __builtin_shufflevector(a, b, 0, 8, 1, 9, 4, 12, 5, 13, 16,
24, 17, 25, 20, 28, 21, 29);
we generated punpcklps (notice how the elements of a and b are not interleaved
in the shuffle). In turn, c was set to this:
0 16 1 17 4 20 5 21 8 24 9 25 12 28 13 29
Obviously this should have just returned the mask vector of the shuffle
vector.
I mostly reverted this change and made sure the original AVX code worked
for 512-bit vectors as well.
Also updated the tests because they matched the logic from the code.
llvm-svn: 217602
This is an extension of the change made with r215820:
http://llvm.org/viewvc/llvm-project?view=revision&revision=215820
That patch allowed combining of splatted vector FP constants that are multiplied.
This patch allows combining non-uniform vector FP constants too by relaxing the
check on the type of vector. Also, canonicalize a vector fmul in the
same way that we already do for scalars - if only one operand of the fmul is a
constant, make it operand 1. Otherwise, we miss potential folds.
This fold is also done by -instcombine, but it's possible that extra
fmuls may have been generated during lowering.
Differential Revision: http://reviews.llvm.org/D5254
llvm-svn: 217599
Now that the operations are all implemented, we can test this sub-arch here.
Signed-off-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Matt Arsenault <matthew.arsenault@amd.com>
llvm-svn: 217595
David Blaikie's commits r217563 & r217564, which added shared_ptr to the
CostPool have fixed some memory leak issues exposed by the PBQP with
coalescing constraints.
The sanitizer bot was failing because of those leaks. Now that the leaks
are gone, we can reenable the aarch64/pbqp test.
llvm-svn: 217580
We used to crash processing any relevant @llvm.assume on a 32-bit target
(because we'd ask SE to subtract expressions of differing types). I've copied
our 'simple.ll' test, but with the data layout from arm-linux-gnueabihf to get
some meaningful test coverage here.
llvm-svn: 217574
The routine that determines an alignment given some SCEV returns zero if the
answer is unknown. In a case where we could determine the increment of an
AddRec but not the starting alignment, we would compute the integer modulus by
zero (which is illegal and traps). Prevent this by returning early if either
the start or increment alignment is unknown (zero).
llvm-svn: 217544
"Unroll" is not the appropriate name for this variable. Clang already uses
the term "interleave" in pragmas and metadata for this.
Differential Revision: http://reviews.llvm.org/D5066
llvm-svn: 217528
This adds target specific support for using the PBQP register allocator on the
AArch64, for the A57 cpu.
By default, the PBQP allocator is not used, unless explicitely required
on the command line with "-aarch64-pbqp".
llvm-svn: 217504
using static relocation model and small code model.
Summary: currently we generate GOT based relocations for weak symbol
references regardless of the underlying relocation model. This should
be change so that in static relocation model we use a constant pool
load instead.
Patch from: Keith Walker
Reviewers: Renato Golin, Tim Northover
llvm-svn: 217503
The only Thumb-1 multi-store capable of using LR is the PUSH instruction, which
translates to STMDB, so we shouldn't convert STMIAs.
Patch by Sergey Dmitrouk.
llvm-svn: 217498
This adds support for reading the "bigobj" variant of COFF produced by
cl's /bigobj and mingw's -mbig-obj.
The most significant difference that bigobj brings is more than 2**16
sections to COFF.
bigobj brings a few interesting differences with it:
- It doesn't have a Characteristics field in the file header.
- It doesn't have a SizeOfOptionalHeader field in the file header (it's
only used in executable files).
- Auxiliary symbol records have the same width as a symbol table entry.
Since symbol table entries are bigger, so are auxiliary symbol
records.
Write support will come soon.
Differential Revision: http://reviews.llvm.org/D5259
llvm-svn: 217496
It appears that the -filename-equivalence option for testing llvm-cov
doesn't work correctly with -show-expansions. I'm reverting this test
to get the bots green while I look into fixing that.
This partially reverts r217476
llvm-svn: 217478
This commit adds aliases for the sync instruction (synciobdma,
syncs, syncw, syncws) which are used by the Octeon CPU.
Reviewed by D. Sanders
llvm-svn: 217477
This is the plugin version of pr20882.
This handles the case of every common symbol being in the IR. We will need some
support from gold to handle the case where some symbols are in ELF and some in
the IR.
llvm-svn: 217458
Summary:
This directive is used to reset the assembler options to their initial values.
Assembly programmers use it in conjunction with the ".set mipsX" directives.
This patch depends on the .set push/pop directive (http://reviews.llvm.org/D4821).
Contains work done by Matheus Almeida.
Reviewers: dsanders
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D4957
llvm-svn: 217438
Summary:
In AT&T annotation for both x86_64 and x32 calls should be printed as
callq in assembly. It's only a matter of correct mnemonic, object output
is ok.
Test Plan: trivial test added
Reviewers: nadav, dschuff, craig.topper
Subscribers: llvm-commits, zinovy.nis
Differential Revision: http://reviews.llvm.org/D5213
llvm-svn: 217435
Summary:
These directives are used to save the current assembler options (in the case of ".set push") and restore the previously saved options (in the case of ".set pop").
Contains work done by Matheus Almeida.
Reviewers: dsanders
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D4821
llvm-svn: 217432
When compiling without SSE2, isTruncStoreLegal(F64, F32) would return Legal, whereas with SSE2 it would return Expand. And since the Target doesn't seem to actually handle a truncstore for double -> float, it would just output a store of a full double in the space for a float hence overwriting other bits on the stack.
Patch by Luqman Aden!
llvm-svn: 217410
Previously, fast-isel would not clean up after failing to select a call
instruction, because it would have called flushLocalValueMap() which moves
the insertion point, making SavedInsertPt in selectInstruction() invalid.
Fixing this by making SavedInsertPt a member variable, and having
flushLocalValueMap() update it.
This removes some redundant code at -O0, and more importantly fixes PR20863.
Differential Revision: http://reviews.llvm.org/D5249
llvm-svn: 217401
This adds a basic (but important) use of @llvm.assume calls in ScalarEvolution.
When SE is attempting to validate a condition guarding a loop (such as whether
or not the loop count can be zero), this check should also include dominating
assumptions.
llvm-svn: 217348
From a combination of @llvm.assume calls (and perhaps through other means, such
as range metadata), it is possible that all bits of a return value might be
known. Previously, InstCombine did not check for this (which is understandable
given assumptions of constant propagation), but means that we'd miss simple
cases where assumptions are involved.
llvm-svn: 217346
This change teaches LazyValueInfo to use the @llvm.assume intrinsic. Like with
the known-bits change (r217342), this requires feeding a "context" instruction
pointer through many functions. Aside from a little refactoring to reuse the
logic that turns predicates into constant ranges in LVI, the only new code is
that which can 'merge' the range from an assumption into that otherwise
computed. There is also a small addition to JumpThreading so that it can have
LVI use assumptions in the same block as the comparison feeding a conditional
branch.
With this patch, we can now simplify this as expected:
int foo(int a) {
__builtin_assume(a > 5);
if (a > 3) {
bar();
return 1;
}
return 0;
}
llvm-svn: 217345
This adds a ScalarEvolution-powered transformation that updates load, store and
memory intrinsic pointer alignments based on invariant((a+q) & b == 0)
expressions. Many of the simple cases we can get with ValueTracking, but we
still need something like this for the more complicated cases (such as those
with an offset) that require some algebra. Note that gcc's
__builtin_assume_aligned's optional third argument provides exactly for this
kind of 'misalignment' offset for which this kind of logic is necessary.
The primary motivation is to fixup alignments for vector loads/stores after
vectorization (and unrolling). This pass is added to the optimization pipeline
just after the SLP vectorizer runs (which, admittedly, does not preserve SE,
although I imagine it could). Regardless, I actually don't think that the
preservation matters too much in this case: SE computes lazily, and this pass
won't issue any SE queries unless there are any assume intrinsics, so there
should be no real additional cost in the common case (SLP does preserve DT and
LoopInfo).
llvm-svn: 217344
This builds on r217342, which added the infrastructure to compute known bits
using assumptions (@llvm.assume calls). That original commit added only a few
patterns (to catch common cases related to determining pointer alignment); this
change adds several other patterns for simple cases.
r217342 contained that, for assume(v & b = a), bits in the mask
that are known to be one, we can propagate known bits from the a to v. It also
had a known-bits transfer for assume(a = b). This patch adds:
assume(~(v & b) = a) : For those bits in the mask that are known to be one, we
can propagate inverted known bits from the a to v.
assume(v | b = a) : For those bits in b that are known to be zero, we can
propagate known bits from the a to v.
assume(~(v | b) = a): For those bits in b that are known to be zero, we can
propagate inverted known bits from the a to v.
assume(v ^ b = a) : For those bits in b that are known to be zero, we can
propagate known bits from the a to v. For those bits in
b that are known to be one, we can propagate inverted
known bits from the a to v.
assume(~(v ^ b) = a) : For those bits in b that are known to be zero, we can
propagate inverted known bits from the a to v. For those
bits in b that are known to be one, we can propagate
known bits from the a to v.
assume(v << c = a) : For those bits in a that are known, we can propagate them
to known bits in v shifted to the right by c.
assume(~(v << c) = a) : For those bits in a that are known, we can propagate
them inverted to known bits in v shifted to the right by c.
assume(v >> c = a) : For those bits in a that are known, we can propagate them
to known bits in v shifted to the right by c.
assume(~(v >> c) = a) : For those bits in a that are known, we can propagate
them inverted to known bits in v shifted to the right by c.
assume(v >=_s c) where c is non-negative: The sign bit of v is zero
assume(v >_s c) where c is at least -1: The sign bit of v is zero
assume(v <=_s c) where c is negative: The sign bit of v is one
assume(v <_s c) where c is non-positive: The sign bit of v is one
assume(v <=_u c): Transfer the known high zero bits
assume(v <_u c): Transfer the known high zero bits (if c is know to be a power
of 2, transfer one more)
A small addition to InstCombine was necessary for some of the test cases. The
problem is that when InstCombine was simplifying and, or, etc. it would fail to
check the 'do I know all of the bits' condition before checking less specific
conditions and would not fully constant-fold the result. I'm not sure how to
trigger this aside from using assumptions, so I've just included the change
here.
llvm-svn: 217343
This change, which allows @llvm.assume to be used from within computeKnownBits
(and other associated functions in ValueTracking), adds some (optional)
parameters to computeKnownBits and friends. These functions now (optionally)
take a "context" instruction pointer, an AssumptionTracker pointer, and also a
DomTree pointer, and most of the changes are just to pass this new information
when it is easily available from InstSimplify, InstCombine, etc.
As explained below, the significant conceptual change is that known properties
of a value might depend on the control-flow location of the use (because we
care that the @llvm.assume dominates the use because assumptions have
control-flow dependencies). This means that, when we ask if bits are known in a
value, we might get different answers for different uses.
The significant changes are all in ValueTracking. Two main changes: First, as
with the rest of the code, new parameters need to be passed around. To make
this easier, I grouped them into a structure, and I made internal static
versions of the relevant functions that take this structure as a parameter. The
new code does as you might expect, it looks for @llvm.assume calls that make
use of the value we're trying to learn something about (often indirectly),
attempts to pattern match that expression, and uses the result if successful.
By making use of the AssumptionTracker, the process of finding @llvm.assume
calls is not expensive.
Part of the structure being passed around inside ValueTracking is a set of
already-considered @llvm.assume calls. This is to prevent a query using, for
example, the assume(a == b), to recurse on itself. The context and DT params
are used to find applicable assumptions. An assumption needs to dominate the
context instruction, or come after it deterministically. In this latter case we
only handle the specific case where both the assumption and the context
instruction are in the same block, and we need to exclude assumptions from
being used to simplify their own ephemeral values (those which contribute only
to the assumption) because otherwise the assumption would prove its feeding
comparison trivial and would be removed.
This commit adds the plumbing and the logic for a simple masked-bit propagation
(just enough to write a regression test). Future commits add more patterns
(and, correspondingly, more regression tests).
llvm-svn: 217342
It's probably not a huge deal to not do this - if we could, maybe the
address could be reused by a subprogram low_pc and avoid an extra
relocation, but it's just one per CU at best.
llvm-svn: 217338
This adds a set of utility functions for collecting 'ephemeral' values. These
are LLVM IR values that are used only by @llvm.assume intrinsics (directly or
indirectly), and thus will be removed prior to code generation, implying that
they should be considered free for certain purposes (like inlining). The
inliner's cost analysis, and a few other passes, have been updated to account
for ephemeral values using the provided functionality.
This functionality is important for the usability of @llvm.assume, because it
limits the "non-local" side-effects of adding llvm.assume on inlining, loop
unrolling, etc. (these are hints, and do not generate code, so they should not
directly contribute to estimates of execution cost).
llvm-svn: 217335
support for MOVDDUP which is really important for matrix multiply style
operations that do lots of non-vector-aligned load and splats.
The original motivation was to add support for MOVDDUP as the lack of it
regresses matmul_f64_4x4 by 5% or so. However, all of the rules here
were somewhat suspicious.
First, we should always be using the floating point domain shuffles,
regardless of how many copies we have to make as a movapd is *crazy*
faster than the domain switching cost on some chips. (Mostly because
movapd is crazy cheap.) Because SHUFPD can't do the copy-for-free trick
of the PSHUF instructions, there is no need to avoid canonicalizing on
UNPCK variants, so do that canonicalizing. This also ensures we have the
chance to form MOVDDUP. =]
Second, we assume SSE2 support when doing any vector lowering, and given
that we should just use UNPCKLPD and UNPCKHPD as they can operate on
registers or memory. If vectors get spilled or come from memory at all
this is going to allow the load to be folded into the operation. If we
want to optimize for encoding size (the only difference, and only
a 2 byte difference) it should be done *much* later, likely after RA.
llvm-svn: 217332
DWARF address ranges contain a reference to the debug_info section. This offset
is an absolute relocation except on non-PE/COFF targets where it is section
relative. We would emit this incorrectly, and trying to map the debug info from
the address would fail.
llvm-svn: 217317
parsing (and latent bug in the instruction definitions).
This is effectively a revert of r136287 which tried to address
a specific and narrow case of immediate operands failing to be accepted
by x86 instructions with a pretty heavy hammer: it introduced a new kind
of operand that behaved differently. All of that is removed with this
commit, but the test cases are both preserved and enhanced.
The core problem that r136287 and this commit are trying to handle is
that gas accepts both of the following instructions:
insertps $192, %xmm0, %xmm1
insertps $-64, %xmm0, %xmm1
These will encode to the same byte sequence, with the immediate
occupying an 8-bit entry. The first form was fixed by r136287 but that
broke the prior handling of the second form! =[ Ironically, we would
still emit the second form in some cases and then be unable to
re-assemble the output.
The reason why the first instruction failed to be handled is because
prior to r136287 the operands ere marked 'i32i8imm' which forces them to
be sign-extenable. Clearly, that won't work for 192 in a single byte.
However, making thim zero-extended or "unsigned" doesn't really address
the core issue either because it breaks negative immediates. The correct
fix is to make these operands 'i8imm' reflecting that they can be either
signed or unsigned but must be 8-bit immediates. This patch backs out
r136287 and then changes those places as well as some others to use
'i8imm' rather than one of the extended variants.
Naturally, this broke something else. The custom DAG nodes had to be
updated to have a much more accurate type constraint of an i8 node, and
a bunch of Pat immediates needed to be specified as i8 values.
The fallout didn't end there though. We also then ceased to be able to
match the instruction-specific intrinsics to the instructions so
modified. Digging, this is because they too used i32 rather than i8 in
their signature. So I've also switched those intrinsics to i8 arguments
in line with the instructions.
In order to make the intrinsic adjustments of course, I also had to add
auto upgrading for the intrinsics.
I suspect that the intrinsic argument types may have led everything down
this rabbit hole. Pretty happy with the result.
llvm-svn: 217310
computation was totally wrong, but somehow it didn't really show up with
llc.
I've added an assert that triggers on multiple existing test cases and
updated one of them to show the correct value.
There appear to still be more bugs lurking around insertps's mask. =/
However, note that this only really impacts the new vector shuffle
lowering.
llvm-svn: 217289
follows '~' in a clobber constraint string.
Previously llc would hit an llvm_unreachable when compiling an inline-asm
instruction with malformed constraint string "~x{21}". This commit enables
LLParser to catch the error earlier and print a more helpful diagnostic.
rdar://problem/14206559
llvm-svn: 217288
This problem is bigger than just fsub, but this is the minimum fix to solve
fneg for PR20556 ( http://llvm.org/bugs/show_bug.cgi?id=20556 ), and we solve
zero subtraction with the same change.
llvm-svn: 217286
When linking llvm.global_ctors with the optional third element we have to handle
it specially and only copy the elements whose keys were also copied.
llvm-svn: 217281
Summary:
Until r216870 LLVMCreateObjectFile returned nullptr in case of an error,
so callers could check if the call was successful. Now, it always
returns an OwningBinary wrapped as an LLVMObjectFileRef, so callers
can't check if the call was successul.
This results in a segfault running e.g.
llvm-c-test --object-list-sections < /dev/null
So the old behaviour should be restored.
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D5143
llvm-svn: 217279
This reverts commit r217211.
Both the bfd ld and gold outputs were valid. They were using a Rela relocation,
so the value present in the relocated location was not used, which caused me
to misread the output.
llvm-svn: 217264
shuffle lowering for integer vectors and share it from v4i32, v8i16, and
v16i8 code paths.
Ironically, the SSE2 v16i8 code for this is now better than the SSSE3!
=] Will have to fix the SSSE3 code next to just using a single pshufb.
llvm-svn: 217240
The special case did not work when run under -reassociate and can easily
be expressed by a further generalization of an existing pattern.
llvm-svn: 217227
The header contains an offset to the DWARF line table for the CU. The offset
must be section relative for COFF and absolute for others. The non-assembly
code path for the DWARF header generation already has the correct emission for
the headers. This corrects the assembly input path.
This was identified by BFD objecting to the LLVM generated DWARF information.
llvm-svn: 217222
Patched by Sergey Dmitrouk.
This pass tries to make consecutive compares of values use same operands to
allow CSE pass to remove duplicated instructions. For this it analyzes
branches and adjusts comparisons with immediate values by converting:
GE -> GT
GT -> GE
LT -> LE
LE -> LT
and adjusting immediate values appropriately. It basically corrects two
immediate values towards each other to make them equal.
llvm-svn: 217220