This is important because of different addressing modes
depending on the address space for GPU targets.
This only adds the argument, and does not update
any of the uses to provide the correct address space.
llvm-svn: 238723
This adds new SDNodes for signed/unsigned min/max. These nodes are built from
select/icmp pairs matched at SDAGBuilder stage.
This patch adds the nodes, as well as legalization support and sets them to
be "expand" for all targets.
NFC for now; this will be tested when I switch AArch64 to using these new
nodes.
llvm-svn: 237423
Other targets probably should as well. Since r237161, compiler-rt has
both, but I don't see why anything other than gnueabi would use a
gnueabi naming scheme.
llvm-svn: 237324
Summary:
This patch adds legalization support to operate on FP16 as a load/store type
and do operations on it as floats.
Tests for ARM are added to test/CodeGen/ARM/fp16-promote.ll
Reviewers: srhines, t.p.northover
Differential Revision: http://reviews.llvm.org/D8755
llvm-svn: 235215
The original f32->f64 promotion logic was refactored into roughly the
currently shape in r37781. However, starting with r132263, the
legalizer has been split into different kinds, and the previous
"Promote" (which did the right thing) was search-and-replace'd into
"PromoteInteger". The divide gradually deepened, with type legalization
("PromoteInteger") being separated from ops legalization
("Promote", which still works for floating point ops).
Fast-forward to today: there's no in-tree target with legal f64 but
illegal f32 (rather: no tests were harmed in the making of this patch).
With such a target, i.e., if you trick the legalizer into going through
the PromoteInteger path for FP, you get the expected brokenness.
For instance, there's no PromoteIntRes_FADD (the name itself sounds
wrong), so we'll just hit some assert in the PromoteInteger path.
Don't pretend we can promote f32 to f64. Instead, always soften.
llvm-svn: 233464
We used to mark a bunch of libm nodes as Expand for f16. There are no
libcalls we can use for those, so we eventually just hit an unhelpful
llvm_unreachable in ExpandFPLibCall.
Instead, just ignore them altogether. If nothing else changes, we'll
then get the more descriptive and pleasant "Cannot select" fatal error.
There's an argument to be made for consistency, but f16 is already
special in all the good ways, and as long as there's no f16 support in
the ops expander (this patch), as well as the Soften/Expand float
legalizers (which, when hit, will currently segfault), I think there's
no point in even pretending we can legalize any of this.
This shouldn't affect anything that's not already broken.
llvm-svn: 233328
Summary:
Now that the DataLayout is a mandatory part of the module, let's start
cleaning the codebase. This patch is a first attempt at doing that.
This patch is not exactly NFC as for instance some places were passing
a nullptr instead of the DataLayout, possibly just because there was a
default value on the DataLayout argument to many functions in the API.
Even though it is not purely NFC, there is no change in the
validation.
I turned as many pointer to DataLayout to references, this helped
figuring out all the places where a nullptr could come up.
I had initially a local version of this patch broken into over 30
independant, commits but some later commit were cleaning the API and
touching part of the code modified in the previous commits, so it
seemed cleaner without the intermediate state.
Test Plan:
Reviewers: echristo
Subscribers: llvm-commits
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 231740
This required plumbing a TargetRegisterInfo through computeRegisterProperties
and into findRepresentativeClass which uses it for register class
iteration. This required passing a subtarget into a few target specific
initializations of TargetLowering.
llvm-svn: 230583
the .h file. It's used in only one place (other than recursively)
and there's no need to include it everywhere.
Saves almost 900k from total llvm object file size.
llvm-svn: 230561
This allows sharing of FMA forming combines to work
with instructions that have the same semantics as a separate
multiply and add.
This is expand by default, and only formed post legalization
so it shouldn't have much impact on targets that do not want it.
llvm-svn: 230070
derived classes.
Since global data alignment, layout, and mangling is often based on the
DataLayout, move it to the TargetMachine. This ensures that global
data is going to be layed out and mangled consistently if the subtarget
changes on a per function basis. Prior to this all targets(*) have
had subtarget dependent code moved out and onto the TargetMachine.
*One target hasn't been migrated as part of this change: R600. The
R600 port has, as a subtarget feature, the size of pointers and
this affects global data layout. I've currently hacked in a FIXME
to enable progress, but the port needs to be updated to either pass
the 64-bitness to the TargetMachine, or fix the DataLayout to
avoid subtarget dependent features.
llvm-svn: 227113
This requires a new hook to prevent expanding sqrt in terms
of rsqrt and reciprocal. v_rcp_f32, v_rsq_f32, and v_sqrt_f32 are
all the same rate, so this expansion would just double the number
of instructions and cycles.
llvm-svn: 225828
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224351
This is the second patch in a small series. This patch contains the MachineInstruction and x86-64 backend pieces required to lower Statepoints. It does not include the code to actually generate the STATEPOINT machine instruction and as a result, the entire patch is currently dead code. I will be submitting the SelectionDAG parts within the next 24-48 hours. Since those pieces are by far the most complicated, I wanted to minimize the size of that patch. That patch will include the tests which exercise the functionality in this patch. The entire series can be seen as one combined whole in http://reviews.llvm.org/D5683.
The STATEPOINT psuedo node is generated after all gc values are explicitly spilled to stack slots. The purpose of this node is to wrap an actual call instruction while recording the spill locations of the meta arguments used for garbage collection and other purposes. The STATEPOINT is modeled as modifing all of those locations to prevent backend optimizations from forwarding the value from before the STATEPOINT to after the STATEPOINT. (Doing so would break relocation semantics for collectors which wish to relocate roots.)
The implementation of STATEPOINT is closely modeled on PATCHPOINT. Eventually, much of the code in this patch will be removed. The long term plan is to merge the functionality provided by statepoints and patchpoints. Merging their implementations in the backend is likely to be a good starting point.
Reviewed by: atrick, ributzka
llvm-svn: 223085
isPow2DivCheap
That name doesn't specify signed or unsigned.
Lazy as I am, I eventually read the function and variable comments. It turns out that this is strictly about signed div. But I discovered that the comments are wrong:
srl/add/sra
is not the general sequence for signed integer division by power-of-2. We need one more 'sra':
sra/srl/add/sra
That's the sequence produced in DAGCombiner. The first 'sra' may be removed when dividing by exactly '2', but that's a special case.
This patch corrects the comments, changes the name of the flag bit, and changes the name of the accessor methods.
No functional change intended.
Differential Revision: http://reviews.llvm.org/D5010
llvm-svn: 216237
floating point exceptions, added use of flag to fold potentially exception
raising floating point math in selection DAG. No functionality change, as
targets have to explicitly ask for this behavior and none does today.
llvm-svn: 215222
be deleted. This will be reapplied as soon as possible and before
the 3.6 branch date at any rate.
Approved by Jim Grosbach, Lang Hames, Rafael Espindola.
This reverts commits r215111, 215115, 215116, 215117, 215136.
llvm-svn: 215154
I am sure we will be finding bits and pieces of dead code for years to
come, but this is a good start.
Thanks to Lang Hames for making MCJIT a good replacement!
llvm-svn: 215111
This makes the first stage DAG for @llvm.convert.to.fp16 an fptrunc,
and correspondingly @llvm.convert.from.fp16 an fpext. The legalisation
path is now uniform, regardless of the input IR:
fptrunc -> FP_TO_FP16 (if f16 illegal) -> libcall
fpext -> FP16_TO_FP (if f16 illegal) -> libcall
Each target should be able to select the version that best matches its
operations and not be required to duplicate patterns for both fptrunc
and FP_TO_FP16 (for example).
As a result we can remove some redundant AArch64 patterns.
llvm-svn: 213507
Actual support for softening f16 operations is still limited, and can be added
when it's needed. But Soften is much closer to being a useful thing to try
than keeping it Legal when no registers can actually hold such values.
Longer term, we probably want something between Soften and Promote semantics
for most targets, it'll be more efficient to promote the 4 basic operations to
f32 than libcall them.
llvm-svn: 213372
Previously we asserted on this code. Currently compiler-rt doesn't
actually implement any of these new libcalls, but external help is
pretty much the only viable option for LLVM.
I've followed the much more generic "__truncST2" naming, as opposed to
the odd name for f32 -> f16 truncation. This can obviously be changed
later, or overridden by any targets that need to.
llvm-svn: 213252
to the zero-extend-vector-inreg node introduced previously for the same
purpose: manage the type legalization of widened extend operations,
especially to support the experimental widening mode for x86.
I'm adding both because sign-extend is expanded in terms of any-extend
with shifts to propagate the sign bit. This removes the last
fundamental scalarization from vec_cast2.ll (a test case that hit many
really bad edge cases for widening legalization), although the trunc
tests in that file still appear scalarized because the the shuffle
legalization is scalarizing. Funny thing, I've been working on that.
Some initial experiments with this and SSE2 scenarios is showing
moderately good behavior already for sign extension. Still some work to
do on the shuffle combining on X86 before we're generating optimal
sequences, but avoiding scalarization is a huge step forward.
llvm-svn: 212714
Summary:
On MIPS32r6/MIPS64r6, floating point comparisons return 0 or -1 but integer
comparisons return 0 or 1.
Updated the various uses of getBooleanContents. Two simplifications had to be
disabled when float and int boolean contents differ:
- ScalarizeVecRes_VSELECT except when the kind of boolean contents is trivially
discoverable (i.e. when the condition of the VSELECT is a SETCC node).
- visitVSELECT (select C, 0, 1) -> (xor C, 1).
Come to think of it, this one could test for the common case of 'C'
being a SETCC too.
Preserved existing behaviour for all other targets and updated the affected
MIPS32r6/MIPS64r6 tests. This also fixes the pi benchmark where the 'low'
variable was counting in the wrong direction because it thought it could simply
add the result of the comparison.
Reviewers: hfinkel
Reviewed By: hfinkel
Subscribers: hfinkel, jholewinski, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D4389
llvm-svn: 212697
don't need to set it manually.
This is based on feedback from Tom who pointed out that if every target
needs to handle this we need to reach out to those maintainers. In fact,
it doesn't make sense to duplicate everything when anything other than
expand seems unlikely at this stage.
llvm-svn: 212661
vector type legalization strategies in a more fine grained manner, and
change the legalization of several v1iN types and v1f32 to be widening
rather than scalarization on AArch64.
This fixes an assertion failure caused by scalarizing nodes like "v1i32
trunc v1i64". As v1i64 is legal it will fail to scalarize v1i32.
This also provides a foundation for other targets to have more granular
control over how vector types are legalized.
Patch by Hao Liu, reviewed by Tim Northover. I'm committing it to allow
some work to start taking place on top of this patch as it adds some
really important hooks to the backend that I'd like to immediately start
using. =]
http://reviews.llvm.org/D4322
llvm-svn: 212242
This commit adds a weak variant of the cmpxchg operation, as described
in C++11. A cmpxchg instruction with this modifier is permitted to
fail to store, even if the comparison indicated it should.
As a result, cmpxchg instructions must return a flag indicating
success in addition to their original iN value loaded. Thus, for
uniformity *all* cmpxchg instructions now return "{ iN, i1 }". The
second flag is 1 when the store succeeded.
At the DAG level, a new ATOMIC_CMP_SWAP_WITH_SUCCESS node has been
added as the natural representation for the new cmpxchg instructions.
It is a strong cmpxchg.
By default this gets Expanded to the existing ATOMIC_CMP_SWAP during
Legalization, so existing backends should see no change in behaviour.
If they wish to deal with the enhanced node instead, they can call
setOperationAction on it. Beware: as a node with 2 results, it cannot
be selected from TableGen.
Currently, no use is made of the extra information provided in this
patch. Test updates are almost entirely adapting the input IR to the
new scheme.
Summary for out of tree users:
------------------------------
+ Legacy Bitcode files are upgraded during read.
+ Legacy assembly IR files will be invalid.
+ Front-ends must adapt to different type for "cmpxchg".
+ Backends should be unaffected by default.
llvm-svn: 210903