GCC does this for non-zero discriminators and since GCC doesn't produce
column info, that was the only place it comes up there. For LLVM, since
we can emit discriminators and/or column info, it makes more sense to
invert the condition and just test for changes in line number.
This should resolve at least some of the GDB 7.5 test suite failures
created by recent Clang changes that increase the location fidelity
(which, since Clang defaults to including column info on Linux by
default created a bunch of cases that confused GDB).
In theory we could do this better/differently by grouping actual source
statements together in a similar manner to the way lexical scopes are
handled but given that GDB isn't really in a position to consume that (&
users are probably somewhat used to different lines being different
'statements') this seems the safest and cheapest change. (I'm concerned
that doing this 'right' would bloat the debugloc data even further -
something Duncan's working hard to address)
llvm-svn: 225011
Under the large code model, we cannot assume that __morestack lives within
2^31 bytes of the call site, so we cannot use pc-relative addressing. We
cannot perform the call via a temporary register, as the rax register may
be used to store the static chain, and all other suitable registers may be
either callee-save or used for parameter passing. We cannot use the stack
at this point either because __morestack manipulates the stack directly.
To avoid these issues, perform an indirect call via a read-only memory
location containing the address.
This solution is not perfect, as it assumes that the .rodata section
is laid out within 2^31 bytes of each function body, but this seems to
be sufficient for JIT.
Differential Revision: http://reviews.llvm.org/D6787
llvm-svn: 225003
If a linker directive is already quoted, don't try to quote it again, otherwise it creates a mess.
This pops up in places like:
#pragma comment(linker,"\"/foo bar'\"")
Differential Revision: http://reviews.llvm.org/D6792
llvm-svn: 224998
If the control flow is modelling an if-statement where the only instruction in
the 'then' basic block (excluding the terminator) is a call to cttz/ctlz,
CodeGenPrepare can try to speculate the cttz/ctlz call and simplify the control
flow graph.
Example:
\code
entry:
%cmp = icmp eq i64 %val, 0
br i1 %cmp, label %end.bb, label %then.bb
then.bb:
%c = tail call i64 @llvm.cttz.i64(i64 %val, i1 true)
br label %end.bb
end.bb:
%cond = phi i64 [ %c, %then.bb ], [ 64, %entry]
\code
In this example, basic block %then.bb is taken if value %val is not zero.
Also, the phi node in %end.bb would propagate the size-of in bits of %val
only if %val is equal to zero.
With this patch, CodeGenPrepare will try to hoist the call to cttz from %then.bb
into basic block %entry only if cttz is cheap to speculate for the target.
Added two new hooks in TargetLowering.h to let targets customize the behavior
(i.e. decide whether it is cheap or not to speculate calls to cttz/ctlz). The
two new methods are 'isCheapToSpeculateCtlz' and 'isCheapToSpeculateCttz'.
By default, both methods return 'false'.
On X86, method 'isCheapToSpeculateCtlz' returns true only if the target has
LZCNT. Method 'isCheapToSpeculateCttz' only returns true if the target has BMI.
Differential Revision: http://reviews.llvm.org/D6728
llvm-svn: 224899
Masked vector intrinsics are a part of common LLVM IR, but they are really supported on AVX2 and AVX-512 targets. I added a code that translates masked intrinsic for all other targets. The masked vector intrinsic is converted to a chain of scalar operations inside conditional basic blocks.
http://reviews.llvm.org/D6436
llvm-svn: 224897
This function constructs the main liverange by merging all subranges if
subregister liveness tracking is available. This should be slightly
faster to compute instead of performing the liveness calculation again
for the main range. More importantly it avoids cases where the main
liverange would cover positions where no subrange was live. These cases
happened for partial definitions where the actual defined part was dead
and only the undefined parts used later.
The register coalescing requires that every part covered by the main
live range has at least one subrange live.
I also expect this function to become usefull later for places where the
subranges are modified in a way that it is hard to correctly fix the
main liverange in the machine scheduler, we can simply reconstruct it
from subranges then.
llvm-svn: 224806
Without a reference the code did not remember when moving the iterators
of the subranges/registerunit ranges forward and instead would scan from
the beginning again at the next position.
llvm-svn: 224803
Right now in DAG Combine check the validity of the returned type
only when -debug is given on the command line. However usually
the test cases in the validation does not use -debug.
An Assert build should always check this.
llvm-svn: 224779
Previously I tried to plug musttail into the existing vararg lowering
code. That turned out to be a mistake, because non-vararg calls use
significantly different register lowering, even on x86. For example, AVX
vectors are usually passed in registers to normal functions and memory
to vararg functions. Now musttail uses a completely separate lowering.
Hopefully this can be used as the basis for non-x86 perfect forwarding.
Reviewers: majnemer
Differential Revision: http://reviews.llvm.org/D6156
llvm-svn: 224745
Extend the existing code which handles this for zext. This makes this
more useful for targets with ZeroOrNegativeOne BooleanContent and
obsoletes a custom combine SI uses for i1 setcc (sext(i1), 0, setne)
since the constant will now be shrunk to i1.
llvm-svn: 224691
We must not add kill flags when reading a vreg with some undefined
subregisters, if subreg liveness tracking is enabled. This is because
the register allocator may reuse these undefined subregisters for other
values which are not killed.
llvm-svn: 224664
It is intended to be used for a family of personality functions that
have similar IR preparation requirements. Typically when interoperating
with MSVC personality functions, bits of functionality need to be
outlined from the main function into helper functions. There is also
usually more than one landing pad per invoke, which does not match the
LLVM IR landingpad representation.
None of this is implemented yet. This change just adds a new enum that
is active for *-windows-msvc and delegates to the EH removal preparation
pass. No functionality change for other targets.
llvm-svn: 224625
This also fixes problems with undef copies of subregisters. I can't
attach a testcase for that as none of the targets in trunk has
subregister liveness tracking enabled.
llvm-svn: 224560
- This also fixes a bug introduced in r223880 where values were not
correctly marked as Dead anymore.
- Cleanup computeDeadValues(): split up SubRange code variant, simplify
arguments.
llvm-svn: 224538
of the abi we should be using. For targets that don't use the
option there's no change, otherwise this allows external users
to set the ABI via string and avoid some of the -backend-option
pain in clang.
Use this option to move the ABI for the ARM port from the
Subtarget to the TargetMachine and update the testcases
accordingly since it's no longer valid to set via -mattr.
llvm-svn: 224492
This fixes a problem where stripCopies() would switch to values in the
main liverange when it crossed a copy instruction. However when joining
subranges we need to stay in the respective subregister ranges.
llvm-svn: 224461
The ExecutionDepsFix previously mapped each register to 1 or zero
registers of the register class it was called with and therefore
simulating liveness for. This was problematic for cases involving wider
registers like Q0 on ARM where ExecutionDepsFix gets invoked for the Dxx
registers. In these cases the wide register would get mapped to the last
matching D register, while it should have been all matching D registers.
This commit changes the AliasMap to use a SmallVector to map registers
to potentially multiple destination regclass registers. This is required
to avoid regressions with subregister liveness tracking enabled.
llvm-svn: 224447
This handles the case of a BUILD_VECTOR being constructed out of elements extracted from a vector twice the size of the result vector. Previously this was always scalarized. Now, we try to construct a shuffle node that feeds on extract_subvectors.
This fixes PR15872 and provides a partial fix for PR21711.
Differential Revision: http://reviews.llvm.org/D6678
llvm-svn: 224429
Summary:
When generating MIPS assembly, LLVM always overrides the default assembler options by emitting the '.set noreorder', '.set nomacro' and '.set noat' directives,
while GCC uses the default options if an assembly-level function contains inline assembly code.
This becomes a problem when the code generated by LLVM is interleaved with inline assembly which assumes GCC-like assembler options (from Linux, for example).
This patch fixes these conflicts by setting the appropriate assembler options at the beginning of an inline asm block and popping them at the end.
Reviewers: dsanders
Reviewed By: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6637
llvm-svn: 224425
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
SwitchInst::getNumCases() returns unsinged, so using uint64_t to count cases
seems unnecessary.
Also fix a missing CHECK in the test case.
llvm-svn: 224393
SelectionDAG::isConsecutiveLoad() was not detecting consecutive loads
when the first load was offset from a base address.
This patch recognizes that pattern and subtracts the offset before comparing
the second load to see if it is consecutive.
The codegen change in the new test case improves from:
vmovsd 32(%rdi), %xmm0
vmovsd 48(%rdi), %xmm1
vmovhpd 56(%rdi), %xmm1, %xmm1
vmovhpd 40(%rdi), %xmm0, %xmm0
vinsertf128 $1, %xmm1, %ymm0, %ymm0
To:
vmovups 32(%rdi), %ymm0
An existing test case is also improved from:
vmovsd (%rdi), %xmm0
vmovsd 16(%rdi), %xmm1
vmovsd 24(%rdi), %xmm2
vunpcklpd %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm2[0]
vmovhpd 8(%rdi), %xmm1, %xmm3
To:
vmovsd (%rdi), %xmm0
vmovsd 16(%rdi), %xmm1
vmovhpd 24(%rdi), %xmm0, %xmm0
vmovhpd 8(%rdi), %xmm1, %xmm1
This patch fixes PR21771 ( http://llvm.org/bugs/show_bug.cgi?id=21771 ).
Differential Revision: http://reviews.llvm.org/D6642
llvm-svn: 224379
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224351
This changes subrange calculation to calculate subranges sequentially
instead of in parallel. The code is easier to understand that way and
addresses the code review issues raised about LiveOutData being
hard to understand/needing more comments by removing them :)
llvm-svn: 224313