VSX is an ISA extension supported on the POWER7 and later cores that enhances
floating-point vector and scalar capabilities. Among other things, this adds
<2 x double> support and generally helps to reduce register pressure.
The interesting part of this ISA feature is the register configuration: there
are 64 new 128-bit vector registers, the 32 of which are super-registers of the
existing 32 scalar floating-point registers, and the second 32 of which overlap
with the 32 Altivec vector registers. This makes things like vector insertion
and extraction tricky: this can be free but only if we force a restriction to
the right register subclass when needed. A new "minipass" PPCVSXCopy takes care
of this (although it could do a more-optimal job of it; see the comment about
unnecessary copies below).
Please note that, currently, VSX is not enabled by default when targeting
anything because it is not yet ready for that. The assembler and disassembler
are fully implemented and tested. However:
- CodeGen support causes miscompiles; test-suite runtime failures:
MultiSource/Benchmarks/FreeBench/distray/distray
MultiSource/Benchmarks/McCat/08-main/main
MultiSource/Benchmarks/Olden/voronoi/voronoi
MultiSource/Benchmarks/mafft/pairlocalalign
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
SingleSource/Benchmarks/CoyoteBench/almabench
SingleSource/Benchmarks/Misc/matmul_f64_4x4
- The lowering currently falls back to using Altivec instructions far more
than it should. Worse, there are some things that are scalarized through the
stack that shouldn't be.
- A lot of unnecessary copies make it past the optimizers, and this needs to
be fixed.
- Many more regression tests are needed.
Normally, I'd fix these things prior to committing, but there are some
students and other contributors who would like to work this, and so it makes
sense to move this development process upstream where it can be subject to the
regular code-review procedures.
llvm-svn: 203768
Support to the IAS was added to actually parse and handle the complex SO
expressions. However, the object file lowering was not updated to compensate
for the fact that the shift operand may be an absolute expression.
When trying to assemble to an object file, the lowering would fail while
succeeding when emitting purely assembly. Add an appropriate test.
The test case is inspired by the test case provided by Jiangning Liu who also
brought the issue to light.
llvm-svn: 203762
Chandler voiced some concern with checking this in without some
discussion first. Reverting for now.
This reverts r203703, r203704, r203708, and 203709.
llvm-svn: 203723
Extend what's currently done for shift because the HW performs this masking
implicitly:
(rotl:i32 x, (and y, 31)) -> (rotl:i32 x, y)
I use the newly factored out multiclass that was only supporting shifts so
far.
For testing I extended my testcase for the new rotation idiom.
<rdar://problem/15295856>
llvm-svn: 203718
On ELF and COFF an alias is just another name for a position in the file.
There is no way to refer to a position in another file, so an alias to
undefined is meaningless.
MachO currently doesn't support aliases. The spec has a N_INDR, which when
implemented will have a different set of restrictions. Adding support for
it shouldn't be harder than any other IR extension.
For now, having the IR represent what is actually possible with current
tools makes it easier to fix the design of GlobalAlias.
llvm-svn: 203705
This replaces the llvm-profdata tool with a version that uses the
recently introduced Profile library. The new tool has the ability to
generate and summarize profdata files as well as merging them.
llvm-svn: 203704
This allows us to generate table lookups for code such as:
unsigned test(unsigned x) {
switch (x) {
case 100: return 0;
case 101: return 1;
case 103: return 2;
case 105: return 3;
case 107: return 4;
case 109: return 5;
case 110: return 6;
default: return f(x);
}
}
Since cases 102, 104, etc. are not constants, the lookup table has holes
in those positions. We therefore guard the table lookup with a bitmask check.
Patch by Jasper Neumann!
llvm-svn: 203694
Summary:
Correct the match patterns and the lowerings that made the CodeGen tests pass despite the mistakes.
The original testcase that discovered the problem was SingleSource/UnitTests/SignlessType/factor.c in test-suite.
During review, we also found that some of the existing CodeGen tests were incorrect and fixed them:
* bitwise.ll: In bsel_v16i8 the IfSet/IfClear were reversed because bsel and bmnz have different operand orders and the test didn't correctly account for this. bmnz goes 'IfClear, IfSet, CondMask', while bsel goes 'CondMask, IfClear, IfSet'.
* vec.ll: In the cases where a bsel is emitted as a bmnz (they are the same operation with a different input tied to the result) the operands were in the wrong order.
* compare.ll and compare_float.ll: The bsel operand order was correct for a greater-than comparison, but a greater-than comparison instruction doesn't exist. Lowering this operation inverts the condition so the IfSet/IfClear need to be swapped to match.
The differences between BSEL, BMNZ, and BMZ and how they map to/from vselect are rather confusing. I've therefore added a note to MSA.txt to explain this in a single place in addition to the comments that explain each case.
Reviewers: matheusalmeida, jacksprat
Reviewed By: matheusalmeida
Differential Revision: http://llvm-reviews.chandlerc.com/D3028
llvm-svn: 203657
When the list of VFP registers to be saved was non-contiguous (so multiple
vpush/vpop instructions were needed) these were being ordered oddly, as in:
vpush {d8, d9}
vpush {d11}
This led to the layout in memory being [d11, d8, d9] which is ugly and doesn't
match the CFI_INSTRUCTIONs we're generating either (so Dwarf info would be
broken).
This switches the order of vpush/vpop (in both prologue and epilogue,
obviously) so that the Dwarf locations are correct again.
rdar://problem/16264856
llvm-svn: 203655
It seems gas can't handle CFI directives with VFP register names ("d12", etc.).
This broke us trying to build Chromium for Android after 201423.
A gas bug has been filed: https://sourceware.org/bugzilla/show_bug.cgi?id=16694
compnerd suggested making this conditional on whether we're using the integrated
assembler or not. I'll look into that in a follow-up patch.
Differential Revision: http://llvm-reviews.chandlerc.com/D3049
llvm-svn: 203635
I could fold the callers into their one call site, but the indirection
(given how verbose choosing the section is) seemed helpful.
The use of a member function pointer's a bit "tricky", but seems limited
enough, the call sites are simple/clean/clear, and there's only one use.
llvm-svn: 203619
* Add masking instructions before indirect calls (in MC layer).
* Align call + branch delay to the bundle end (in MC layer).
Differential Revision: http://llvm-reviews.chandlerc.com/D3032
llvm-svn: 203606
GuardMalloc can print info to stderr, causing these tests to fail.
Since FileCheck errors on empty inputs, just add a bit of dummy
data to make it happy.
llvm-svn: 203595
This fixes the bug where we would bitcast the 64-bit floating point result
of cmpneqsd to a 64-bit integer even on 32-bit targets.
Differential Revision: http://llvm-reviews.chandlerc.com/D3009
llvm-svn: 203581
Use the options in the ARMISelLowering to control whether tail calls are
optimised or not. Previously, this option was entirely ignored on the ARM
target and only honoured on x86.
This option is mostly useful in profiling scenarios. The default remains that
tail call optimisations will be applied.
llvm-svn: 203577
This option is from 2010, designed to work around a linker issue on Darwin for
ARM. According to grosbach this is no longer an issue and this option can
safely be removed.
llvm-svn: 203576
Tail call optimisation was previously disabled on all targets other than
iOS5.0+. This enables the tail call optimisation on all Thumb 2 capable
platforms.
The test adjustments are to remove the IR hint "tail" to function invocation.
The tests were designed assuming that tail call optimisations would not kick in
which no longer holds true.
llvm-svn: 203575
After r203553 overflow intrinsics and their non-intrinsic (normal)
instruction get hashed to the same value. This patch prevents PRE from
moving an instruction into a predecessor block, and trying to add a phi
node that gets two different types (the intrinsic result and the
non-intrinsic result), resulting in a failing assert.
llvm-svn: 203574
The syntax for "cmpxchg" should now look something like:
cmpxchg i32* %addr, i32 42, i32 3 acquire monotonic
where the second ordering argument gives the required semantics in the case
that no exchange takes place. It should be no stronger than the first ordering
constraint and cannot be either "release" or "acq_rel" (since no store will
have taken place).
rdar://problem/15996804
llvm-svn: 203559
When an overflow intrinsic is followed by a non-overflow instruction,
replace the latter with an extract. For example:
%sadd = tail call { i32, i1 } @llvm.sadd.with.overflow.i32(i32 %a, i32 %b)
%sadd3 = add i32 %a, %b
Here the add statement will be replaced by an extract.
When an overflow intrinsic follows a non-overflow instruction, a clone
of the intrinsic is inserted before the normal instruction, which makes
it the same as the previous case. Subsequent runs of GVN can then clean
up the duplicate instructions and insert the extract.
This fixes PR8817.
llvm-svn: 203553
When the MOVBE instructions are available, use them for 16-bit endian
swapping as well as for 32 and 64 bit.
The patterns were already present on the instructions, but weren't being
matched because the operation was unconditionally marked to 'Expand.'
Change that to be conditional on whether the MOVBE instructions are
available. Use 'rolw' to implement the in-register version (32 and 64
bit have the dedicated 'bswap' instruction for that).
Patch by Louis Gerbarg <lgg@apple.com>.
rdar://15479984
llvm-svn: 203524
During LTO, user-supplied definitions of C library functions often
exist. -instcombine uses Module::getOrInsertFunction() to get a handle
on library functions (e.g., @puts, when optimizing @printf).
Previously, Module::getOrInsertFunction() would rename any matching
functions with local linkage, and create a new declaration. In LTO,
this is the opposite of desired behaviour, as it skips by the
user-supplied version of the library function and creates a new
undefined reference which the linker often cannot resolve.
After some discussing with Rafael on the list, it looks like it's
undesired behaviour. If a consumer actually *needs* this behaviour, we
should add new API with a more explicit name.
I added two testcases: one specifically for the -instcombine behaviour
and one for the LTO flow.
<rdar://problem/16165191>
llvm-svn: 203513
the legalization cost must be included to get an accurate
estimation of the total cost of the scalarized vector.
The inaccurate cost triggered unprofitable SLP vectorization on
32-bit X86.
Summary:
Include legalization overhead when computing scalarization cost
Reviewers: hfinkel, nadav
CC: chandlerc, rnk, llvm-commits
Differential Revision: http://llvm-reviews.chandlerc.com/D2992
llvm-svn: 203509
Summary:
When the sample profiles include discriminator information,
use the discriminator values to distinguish instruction weights
in different basic blocks.
This modifies the BodySamples mapping to map <line, discriminator> pairs
to weights. Instructions on the same line but different blocks, will
use different discriminator values. This, in turn, means that the blocks
may have different weights.
Other changes in this patch:
- Add tests for positive values of line offset, discriminator and samples.
- Change data types from uint32_t to unsigned and int and do additional
validation.
Reviewers: chandlerc
CC: llvm-commits
Differential Revision: http://llvm-reviews.chandlerc.com/D2857
llvm-svn: 203508
optimize a call to a llvm intrinsic to something that invovles a call to a C
library call, make sure it sets the right calling convention on the call.
e.g.
extern double pow(double, double);
double t(double x) {
return pow(10, x);
}
Compiles to something like this for AAPCS-VFP:
define arm_aapcs_vfpcc double @t(double %x) #0 {
entry:
%0 = call double @llvm.pow.f64(double 1.000000e+01, double %x)
ret double %0
}
declare double @llvm.pow.f64(double, double) #1
Simplify libcall (part of instcombine) will turn the above into:
define arm_aapcs_vfpcc double @t(double %x) #0 {
entry:
%__exp10 = call double @__exp10(double %x) #1
ret double %__exp10
}
declare double @__exp10(double)
The pre-instcombine code works because calls to LLVM builtins are special.
Instruction selection will chose the right calling convention for the call.
However, the code after instcombine is wrong. The call to __exp10 will use
the C calling convention.
I can think of 3 options to fix this.
1. Make "C" calling convention just work since the target should know what CC
is being used.
This doesn't work because each function can use different CC with the "pcs"
attribute.
2. Have Clang add the right CC keyword on the calls to LLVM builtin.
This will work but it doesn't match the LLVM IR specification which states
these are "Standard C Library Intrinsics".
3. Fix simplify libcall so the resulting calls to the C routines will have the
proper CC keyword. e.g.
%__exp10 = call arm_aapcs_vfpcc double @__exp10(double %x) #1
This works and is the solution I implemented here.
Both solutions #2 and #3 would work. After carefully considering the pros and
cons, I decided to implement #3 for the following reasons.
1. It doesn't change the "spec" of the intrinsics.
2. It's a self-contained fix.
There are a couple of potential downsides.
1. There could be other places in the optimizer that is broken in the same way
that's not addressed by this.
2. There could be other calling conventions that need to be propagated by
simplify-libcall that's not handled.
But for now, this is the fix that I'm most comfortable with.
llvm-svn: 203488
* Add masking instructions before loads and stores (in MC layer).
* Add masking instructions after SP changes (in MC layer).
* Forbid loads, stores and SP changes in delay slots (in MI layer).
Differential Revision: http://llvm-reviews.chandlerc.com/D2904
llvm-svn: 203484
The function was making too many assumptions about its input:
1. The NEON_VDUP optimisation was far too aggressive, assuming (I
think) that the input would always be BUILD_VECTOR.
2. We were treating most unknown concats as legal (by returning Op
rather than SDValue()). I think only concats of pairs of vectors are
actually legal.
http://llvm.org/PR19094
llvm-svn: 203450
The grammar for LLVM IR is not well specified in any document but seems
to obey the following rules:
- Attributes which have parenthesized arguments are never preceded by
commas. This form of attribute is the only one which ever has
optional arguments. However, not all of these attributes support
optional arguments: 'thread_local' supports an optional argument but
'addrspace' does not. Interestingly, 'addrspace' is documented as
being a "qualifier". What constitutes a qualifier? I cannot find a
definition.
- Some attributes use a space between the keyword and the value.
Examples of this form are 'align' and 'section'. These are always
preceded by a comma.
- Otherwise, the attribute has no argument. These attributes do not
have a preceding comma.
Sometimes an attribute goes before the instruction, between the
instruction and it's type, or after it's type. 'atomicrmw' has
'volatile' between the instruction and the type while 'call' has 'tail'
preceding the instruction.
With all this in mind, it seems most consistent for 'inalloca' on an
'inalloca' instruction to occur before between the instruction and the
type. Unlike the current formulation, there would be no preceding
comma. The combination 'alloca inalloca' doesn't look particularly
appetizing, perhaps a better spelling of 'inalloca' is down the road.
llvm-svn: 203376
This is the new idiom:
x<<(y&31) | x>>((0-y)&31)
which is recognized as:
x ROTL (y&31)
The change refines matchRotateSub. In
Neg & (OpSize - 1) == (OpSize - Pos) & (OpSize - 1), if Pos is
Pos' & (OpSize - 1) we can just use Pos' instead of Pos.
llvm-svn: 203315
be split and the result type widened.
When the condition of a vselect has to be split it makes no sense widening the
vselect and thereby widening the condition. We end up in an endless loop of
widening (vselect result type) and splitting (condition mask type) doing this.
Instead, split both the condition and the vselect and widen the result.
I ran this over the test suite with i686 and mattr=+sse and saw no regressions.
Fixes PR18036.
llvm-svn: 203311
These are sometimes created by the shrink to boolean optimization in the
globalopt pass.
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
llvm-svn: 203280
This helps the instruction selector to lower an i64 * i64 -> i128
multiplication into a single instruction on targets which support it.
Patch by Manuel Jacob.
llvm-svn: 203230
Sequences of insertelement/extractelements are sometimes used to build
vectorsr; this code tries to put them back together into shuffles, but
could only produce a completely uniform shuffle types (<N x T> from two
<N x T> sources).
This should allow shuffles with different numbers of elements on the
input and output sides as well.
llvm-svn: 203229
The old system was fairly convoluted:
* A temporary label was created.
* A single PROLOG_LABEL was created with it.
* A few MCCFIInstructions were created with the same label.
The semantics were that the cfi instructions were mapped to the PROLOG_LABEL
via the temporary label. The output position was that of the PROLOG_LABEL.
The temporary label itself was used only for doing the mapping.
The new CFI_INSTRUCTION has a 1:1 mapping to MCCFIInstructions and points to
one by holding an index into the CFI instructions of this function.
I did consider removing MMI.getFrameInstructions completelly and having
CFI_INSTRUCTION own a MCCFIInstruction, but MCCFIInstructions have non
trivial constructors and destructors and are somewhat big, so the this setup
is probably better.
The net result is that we don't create temporary labels that are never used.
llvm-svn: 203204
This patch teaches the DAGCombiner how to fold a binary OR between two
shufflevector into a single shuffle vector when possible.
The rules are:
1. fold (or (shuf A, V_0, MA), (shuf B, V_0, MB)) -> (shuf A, B, Mask1)
2. fold (or (shuf A, V_0, MA), (shuf B, V_0, MB)) -> (shuf B, A, Mask2)
The DAGCombiner can take advantage of the fact that OR is commutative and
compute two possible shuffle masks (Mask1 and Mask2) for the resulting
shuffle node.
Before folding a dag according to either rule 1 or 2, DAGCombiner verifies
that the resulting shuffle mask is legal for the target.
DAGCombiner would firstly try to fold according to 1.; If not possible
then it will try to fold according to 2.
If both Mask1 and Mask2 are illegal then we conservatively don't fold
the OR instruction.
llvm-svn: 203156
Summary:
This provides support for CP and DP relative global accesses in inline
asm.
Reviewers: robertlytton
Reviewed By: robertlytton
Differential Revision: http://llvm-reviews.chandlerc.com/D2943
llvm-svn: 203129
for the Cortex-A53 subtarget in the AArch64 backend.
This patch lays the ground work to annotate each AArch64 instruction
(no NEON yet) with a list of SchedReadWrite types. The patch also
provides the Cortex-A53 processor resources, maps those the the default
SchedReadWrites, and provides basic latency. NEON support will be added
in a subsequent patch with proper forwarding logic.
Verification was done by setting the pre-RA scheduler to linearize to
better gauge the effect of the MIScheduler. Even without modeling the
forward logic, the results show a modest improvement for Cortex-A53.
Reviewers: apazos, mcrosier, atrick
Patch by Dave Estes <cestes@codeaurora.org>!
llvm-svn: 203125
This is consistent with GDB ToT and reduces the number of relocations in
(type and compile) units, substantially reducing relocations and debug
size in fission + type units builds.
llvm-svn: 203082
When copying an i1 value into a GPR for a vaarg call, we need to explicitly
zero-extend the i1 value (otherwise an invalid CRBIT -> GPR copy will be
generated).
llvm-svn: 203041
are operations that do not access memory but may be sensitive
to floating-point environment changes. LLVM does not attempt
to model FP environment changes, so this was unnecessarily conservative
and was getting on the way of some optimizations, in particular
SLP vectorization.
llvm-svn: 203037
On cores without fpcvt support, we cannot promote int_to_fp i1 operations,
because there is nothing to promote them to. The most straightforward
implementation of this uses a select to choose between the two possible
resulting floating-point values (and that's what is done here).
llvm-svn: 203015
Before llvm-mc would print it, but llc was assuming that it would produce
another section changing directive before one was needed. That assumption is
false with inline asm.
Fixes PR19049.
Another option would be to always create the section, but in the asm printer
avoid printing sections changes during initialization. That would work, but
* We do use the fact that llvm-mc prints it in testing. The tests can be changed
if needed.
* A quick poll on IRC suggest that most developers prefer the implicit .text to
be printed.
llvm-svn: 203001
Patchpoints already did this. Doing it for stackmaps is a convenience
for the runtime in the event that it needs to scratch register to
patch or perform a runtime call thunk.
Unlike patchpoints, we just assume the AnyRegCC calling
convention. This is the only language and target independent calling
convention specific to stackmaps so makes sense. Although the calling
convention is not currently used to select the scratch registers.
llvm-svn: 202943
selection dag (PR19012)
In X86SelectionDagInfo::EmitTargetCodeForMemcpy we check with MachineFrameInfo
to make sure that ESI isn't used as a base pointer register before we choose to
emit rep movs (which clobbers esi).
The problem is that MachineFrameInfo wouldn't know about dynamic allocas or
inline asm that clobbers the stack pointer until SelectionDAGBuilder has
encountered them.
This patch fixes the problem by checking for such things when building the
FunctionLoweringInfo.
Differential Revision: http://llvm-reviews.chandlerc.com/D2954
llvm-svn: 202930
Unwind info contents were indented at the same level as function table
contents. That's a bit confusing because the unwind info is pointed by
function table. In other places we usually increment indentation depth
by one when dereferncing a pointer.
This patch also removes extraneous newlines between function tables.
llvm-svn: 202879
Previously for:
tail call void inttoptr (i64 65536 to void ()*)() nounwind
We would emit:
bl 65536
The immediate operand of the bl instruction is a relative offset so it is
wrong to use the absolute address here.
llvm-svn: 202860
Summary:
Previously, attempting to extract lanes 2 and 3 would actually extract lane 1.
The MSA CodeGen tests only covered lanes 0 and 1.
Differential Revision: http://llvm-reviews.chandlerc.com/D2935
llvm-svn: 202848
The original code does not work correctly on executable files because the
code is written in such a way that only object files are assumed to be given
to llvm-objdump.
Contents of RuntimeFunction are different between executables and objects. In
executables, fields in RuntimeFunction have actual addresses to unwind info
structures. On the other hand, in object files, the fields have zero value,
but instead there are relocations pointing to the fields, so that Linker will
fill them at link-time.
So, when we are reading an object file, we need to use relocation info to
find the location of unwind info. When executable, we should just look at the
values in RuntimeFunction.
llvm-svn: 202785
for the Cortex-A53 subtarget in the AArch64 backend.
This patch lays the ground work to annotate each AArch64 instruction
(no NEON yet) with a list of SchedReadWrite types. The patch also
provides the Cortex-A53 processor resources, maps those the the default
SchedReadWrites, and provides basic latency. NEON support will be added
in a subsequent patch with proper forwarding logic.
Verification was done by setting the pre-RA scheduler to linearize to
better gauge the effect of the MIScheduler. Even without modeling the
forward logic, the results show a modest improvement for Cortex-A53.
Reviewers: apazos, mcrosier, atrick
Patch by Dave Estes <cestes@codeaurora.org>!
llvm-svn: 202767
DWARF discriminators are used to distinguish multiple control flow paths
on the same source location. When this happens, instructions across
basic block boundaries will share the same debug location.
This pass detects this situation and creates a new lexical scope to one
of the two instructions. This lexical scope is a child scope of the
original and contains a new discriminator value. This discriminator is
then picked up from MCObjectStreamer::EmitDwarfLocDirective to be
written on the object file.
This fixes http://llvm.org/bugs/show_bug.cgi?id=18270.
llvm-svn: 202752
Summary:
Parts of the compiler still believed MSA load/stores have a 16-bit offset when
it is actually 10-bit. Corrected this, and fixed a closely related issue this
uncovered where load/stores with 10-bit and 12-bit offsets (MSA and microMIPS
respectively) could not load/store using offsets from the stack/frame pointer.
They accepted frameindex+offset, but not frameindex by itself.
Reviewers: jacksprat, matheusalmeida
Reviewed By: jacksprat
Differential Revision: http://llvm-reviews.chandlerc.com/D2888
llvm-svn: 202717
This fixes invalid lengths in .debug_aranges on big-endian mips64
(lengths appear to be left-shifted by 32 bits) and in .debug_loc.
Differential Revision: http://llvm-reviews.chandlerc.com/D2517
llvm-svn: 202716
Now that the PowerPC backend can track individual CR bits as first-class
registers, we should also have a way of allocating them for inline asm
statements. Because these registers are only one bit, if an output variable is
implicitly cast to a larger integer size, we'll get an any_extend to that
larger type (this is part of the existing target-independent logic). As a
result, regardless of the size of the output type, only the first bit is
meaningful.
The constraint identifier "wc" has been chosen for this purpose. Although gcc
does not currently support allocating individual CR bits, this identifier
choice has been coordinated with the gcc PowerPC team, and will be marked as
reserved for this purpose in the gcc constraints.md file.
llvm-svn: 202657
This includes instructions that relate to memory access (load/store/GEP), comparison instructions and calls.
Work was done by lama.saba@intel.com.
llvm-svn: 202647
This generalizes the code to eliminate extra truncs/exts around i1 bit
operations to also do the same on PPC64 for i32 bit operations. This eliminates
a fairly prevalent code wart:
int foo(int a) {
return a == 5 ? 7 : 8;
}
On PPC64, because of the extension implied by the ABI, this would generate:
cmplwi 0, 3, 5
li 12, 8
li 4, 7
isel 3, 4, 12, 2
rldicl 3, 3, 0, 32
blr
where the 'rldicl 3, 3, 0, 32', the extension, is completely unnecessary. At
least for the single-BB case (which is all that the DAG combine mechanism can
handle), this unnecessary extension is no longer generated.
llvm-svn: 202600
Inside iterate, we scan backwards then scan forwards in a loop. When iteration
is not zero, the last node was just updated so we can skip it. But when
iteration is zero, we can't skip the last node.
For the testing case, fixing this will save a spill and move register copies
from hot path to cold path.
llvm-svn: 202557
Tools that use the CommandLine library currently exit with an error
when invoked with -version or -help. This is unusual and non-standard,
so we'll fix them to exit successfully instead.
I don't expect that anyone relies on the current behaviour, so this
should be a fairly safe change.
llvm-svn: 202530
* Align targets of indirect jumps to instruction bundle boundaries (in MI layer).
* Add masking instructions before indirect jumps (in MC layer).
Differential Revision: http://llvm-reviews.chandlerc.com/D2847
llvm-svn: 202479
The PPC isel instruction can fold 0 into the first operand (thus eliminating
the need to materialize a zero-containing register when the 'true' result of
the isel is 0). When the isel is fed by a bit register operation that we can
invert, do so as part of the bit-register-operation peephole routine.
llvm-svn: 202469
This change enables tracking i1 values in the PowerPC backend using the
condition register bits. These bits can be treated on PowerPC as separate
registers; individual bit operations (and, or, xor, etc.) are supported.
Tracking booleans in CR bits has several advantages:
- Reduction in register pressure (because we no longer need GPRs to store
boolean values).
- Logical operations on booleans can be handled more efficiently; we used to
have to move all results from comparisons into GPRs, perform promoted
logical operations in GPRs, and then move the result back into condition
register bits to be used by conditional branches. This can be very
inefficient, because the throughput of these CR <-> GPR moves have high
latency and low throughput (especially when other associated instructions
are accounted for).
- On the POWER7 and similar cores, we can increase total throughput by using
the CR bits. CR bit operations have a dedicated functional unit.
Most of this is more-or-less mechanical: Adjustments were needed in the
calling-convention code, support was added for spilling/restoring individual
condition-register bits, and conditional branch instruction definitions taking
specific CR bits were added (plus patterns and code for generating bit-level
operations).
This is enabled by default when running at -O2 and higher. For -O0 and -O1,
where the ability to debug is more important, this feature is disabled by
default. Individual CR bits do not have assigned DWARF register numbers,
and storing values in CR bits makes them invisible to the debugger.
It is critical, however, that we don't move i1 values that have been promoted
to larger values (such as those passed as function arguments) into bit
registers only to quickly turn around and move the values back into GPRs (such
as happens when values are returned by functions). A pair of target-specific
DAG combines are added to remove the trunc/extends in:
trunc(binary-ops(binary-ops(zext(x), zext(y)), ...)
and:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
In short, we only want to use CR bits where some of the i1 values come from
comparisons or are used by conditional branches or selects. To put it another
way, if we can do the entire i1 computation in GPRs, then we probably should
(on the POWER7, the GPR-operation throughput is higher, and for all cores, the
CR <-> GPR moves are expensive).
POWER7 test-suite performance results (from 10 runs in each configuration):
SingleSource/Benchmarks/Misc/mandel-2: 35% speedup
MultiSource/Benchmarks/Prolangs-C++/city/city: 21% speedup
MultiSource/Benchmarks/MiBench/automotive-susan: 23% speedup
SingleSource/Benchmarks/CoyoteBench/huffbench: 13% speedup
SingleSource/Benchmarks/Misc-C++/Large/sphereflake: 13% speedup
SingleSource/Benchmarks/Misc-C++/mandel-text: 10% speedup
SingleSource/Benchmarks/Misc-C++-EH/spirit: 10% slowdown
MultiSource/Applications/lemon/lemon: 8% slowdown
llvm-svn: 202451
scan the register file for sub- and super-registers.
No functionality change intended.
(Tests are updated because the comments in the assembler output are
different.)
llvm-svn: 202416
If a function returns a large struct by value return the first 4 words
in registers and the rest on the stack in a location reserved by the
caller. This is needed to support the xC language which supports
functions returning an arbitrary number of return values. This is
r202397 reapplied with a fix to avoid an uninitialized read of a member.
llvm-svn: 202414
Summary:
If a function returns a large struct by value return the first 4 words
in registers and the rest on the stack in a location reserved by the
caller. This is needed to support the xC language which supports
functions returning an arbitrary number of return values.
Reviewers: robertlytton
Reviewed By: robertlytton
CC: llvm-commits
Differential Revision: http://llvm-reviews.chandlerc.com/D2889
llvm-svn: 202397
Summary:
If the src, dst and size of a memcpy are known to be 4 byte aligned we
can call __memcpy_4() instead of memcpy().
Reviewers: robertlytton
Reviewed By: robertlytton
CC: llvm-commits
Differential Revision: http://llvm-reviews.chandlerc.com/D2871
llvm-svn: 202395
Summary:
Fixes an issue where a test attempts to use -mcpu=x86-64 on non-X86-64 targets.
This triggers an assertion in the MIPS backend since it doesn't know what ABI to
use by default for unrecognized processors.
CC: llvm-commits, rafael
Differential Revision: http://llvm-reviews.chandlerc.com/D2877
llvm-svn: 202369
any ranges - this includes CU ranges where we were previously emitting an
end list marker even if we didn't have a list.
Testcase includes a test for line table only code emission as the problem
was noticed while writing this test.
llvm-svn: 202357
If the SI_KILL operand is constant, we can either clear the exec mask if
the operand is negative, or do nothing otherwise.
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
llvm-svn: 202337
any ranges to the list of ranges for the CU as we don't want to emit
them anyway. This ensures that we will still emit ranges if we have
a compile unit compiled with only line tables and one compiled with
full debug info requested (we'll emit for the one with full debug info).
Update testcase metadata accordingly to continue emitting ranges.
llvm-svn: 202333
and update everything accordingly. This can be used to conditionalize
the amount of output in the backend based on the amount of debug
requested/metadata emission scheme by a front end (e.g. clang).
Paired with a commit to clang.
llvm-svn: 202332
This handles pathological cases in which we see 2x increase in spill
code for large blocks (~50k instructions). I don't have a unit test
for this behavior.
Fixes rdar://16072279.
llvm-svn: 202304
The current approach to lower a vsetult is to flip the sign bit of the
operands, swap the operands and then use a (signed) pcmpgt. psubus (unsigned
saturating subtract) can be used to emulate a vsetult more efficiently:
+ case ISD::SETULT: {
+ // If the comparison is against a constant we can turn this into a
+ // setule. With psubus, setule does not require a swap. This is
+ // beneficial because the constant in the register is no longer
+ // destructed as the destination so it can be hoisted out of a loop.
I also enable lowering via psubus in a few other cases where it's clearly
beneficial: setule and setuge if minu/maxu cannot be used.
rdar://problem/14338765
Patch by Adam Nemet <anemet@apple.com>.
llvm-svn: 202301
We should apply fastcc whenever profitable. We can expand this list,
but there are lots of conventions with performance implications that we
don't want to change.
Differential Revision: http://llvm-reviews.chandlerc.com/D2705
llvm-svn: 202293
COFF object files with 0 as string table size are currently rejected. This
prevents us from reading object files written by tools like cvtres that
violate the PECOFF spec and write 0 instead of 4 for the size of an empty
string table.
llvm-svn: 202292