Summary:
Nodes that have no uses are eventually pruned when they are selected
from the worklist. Record nodes newly added to the worklist or DAG and
perform pruning after every combine attempt.
Reviewers: efriedma, RKSimon, craig.topper, spatel, jyknight
Reviewed By: jyknight
Subscribers: jdoerfert, jyknight, nemanjai, jvesely, nhaehnle, javed.absar, hiraditya, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58070
llvm-svn: 357283
G_SELECT uses a 1-bit scalar for the condition, and is currently
implemented with a plain CMPri against 0. This means that values such as
0x1110 are interpreted as true, when instead the higher bits should be
treated as undefined and therefore ignored. Replace the CMPri with a
TSTri against 0x1, which performs an implicit AND, yielding the expected
result.
llvm-svn: 357153
ARMBaseInstrInfo::getNumLDMAddresses is making bad assumptions about the
memory operands of load and store-multiple operations. This doesn't
really fix the problem properly, but it's enough to prevent crashing,
at least.
Fixes https://bugs.llvm.org/show_bug.cgi?id=41231 .
Differential Revision: https://reviews.llvm.org/D59834
llvm-svn: 357109
Various SelectionDAG non-combine operations (e.g. the getNode smart
constructor and legalization) may leave dangling nodes by applying
optimizations or not fully pruning unused result values. This can
result in nodes that are never added to the worklist and therefore can
not be pruned.
Add a node inserter as the current node deleter to make sure such
nodes have the chance of being pruned.
Many minor changes, mostly positive.
llvm-svn: 356996
We currently use only VLDR/VSTR for all 64-bit loads/stores, so the
memory operands must be word-aligned. Mark aligned operations as legal
and narrow non-aligned ones to 32 bits.
While we're here, also mark non-power-of-2 loads/stores as unsupported.
llvm-svn: 356872
In r322972/r323136, the iteration here was changed to catch cases at the
beginning of a basic block... but we accidentally deleted an important
safety check. Restore that check to the way it was.
Fixes https://bugs.llvm.org/show_bug.cgi?id=41116
Differential Revision: https://reviews.llvm.org/D59680
llvm-svn: 356809
This takes sequences like "mov r4, sp; str r0, [r4]", and optimizes them
to something like "str r0, [sp]".
For regular stack variables, this optimization was already implemented:
we lower loads and stores using frame indexes, which are expanded later.
However, when constructing a call frame for a call with more than four
arguments, the existing optimization doesn't apply. We need to use
stores which are actually relative to the current value of sp, and don't
have an associated frame index.
This patch adds a special case to handle that construct. At the DAG
level, this is an ISD::STORE where the address is a CopyFromReg from SP
(plus a small constant offset).
This applies only to Thumb1: in Thumb2 or ARM mode, a regular store
instruction can access SP directly, so the COPY gets eliminated by
existing code.
The change to ARMDAGToDAGISel::SelectThumbAddrModeSP is a related
cleanup: we shouldn't pretend that it can select anything other than
frame indexes.
Differential Revision: https://reviews.llvm.org/D59568
llvm-svn: 356601
The 2nd loop calculates spill costs but reports free registers as cost
0 anyway, so there is little benefit from having a separate early
loop.
Surprisingly this is not NFC, as many register are marked regDisabled
so the first loop often picks up later registers unnecessarily instead
of the first one available in the allocation order...
Patch by Matthias Braun
llvm-svn: 356499
tMOVr and tPUSH/tPOP/tPOP_RET have register constraints which can't be
expressed in TableGen, so check them explicitly. I've unfortunately run
into issues with both of these recently; hopefully this saves some time
for someone else in the future.
Differential Revision: https://reviews.llvm.org/D59383
llvm-svn: 356303
Bail early when we don't have a preheader and also if the target is
big endian because it's written with only little endian in mind!
Differential Revision: https://reviews.llvm.org/D59368
llvm-svn: 356243
I found these by asserting in clang for any GCCBuiltin that doesn't
require mangling and requires a constant for the builtin. This means
that intrinsics are missing which don't use GCCBuiltin, don't have
builtins defined in clang, or were missing the constant annotation in
the builtin definition.
llvm-svn: 356144
When choosing whether a pair of loads can be combined into a single
wide load, we check that the load only has a sext user and that sext
also only has one user. But this can prevent the transformation in
the cases when parallel macs use the same loaded data multiple times.
To enable this, we need to fix up any other uses after creating the
wide load: generating a trunc and a shift + trunc pair to recreate
the narrow values. We also need to keep a record of which loads have
already been widened.
Differential Revision: https://reviews.llvm.org/D59215
llvm-svn: 356132
Summary:
A number of optimizations are inhibited by single-use TokenFactors not
being merged into the TokenFactor using it. This makes we consider if
we can do the merge immediately.
Most tests changes here are due to the change in visitation causing
minor reorderings and associated reassociation of paired memory
operations.
CodeGen tests with non-reordering changes:
X86/aligned-variadic.ll -- memory-based add folded into stored leaq
value.
X86/constant-combiners.ll -- Optimizes out overlap between stores.
X86/pr40631_deadstore_elision -- folds constant byte store into
preceding quad word constant store.
Reviewers: RKSimon, craig.topper, spatel, efriedma, courbet
Reviewed By: courbet
Subscribers: dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, javed.absar, eraman, hiraditya, kbarton, jrtc27, atanasyan, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59260
llvm-svn: 356068
Expand MULO with constant power of two operand into a shift. The
overflow is checked with (x << shift) >> shift == x, where the right
shift will be logical for umulo and arithmetic for smulo (with
exception for multiplications by signed_min).
Differential Revision: https://reviews.llvm.org/D59041
llvm-svn: 355937
Currently the store+load is folded and both operands of the umulo
end up being constants. To avoid this getting folded away entirely,
make sure at least one operand is non-constant.
Also remove some allocas which don't seem relevant to the test.
llvm-svn: 355776
In some loops, we end up generating loop induction variables that look like:
{(-1 * (zext i16 (%i0 * %i1) to i32))<nsw>,+,1}
As opposed to the simpler:
{(zext i16 (%i0 * %i1) to i32),+,-1}
i.e we count up from -limit to 0, not the simpler counting down from limit to
0. This is because the scores, as LSR calculates them, are the same and the
second is filtered in place of the first. We end up with a redundant SUB from 0
in the code.
This patch tries to make the calculation of the setup cost a little more
thoroughly, recursing into the scev members to better approximate the setup
required. The cost function for comparing LSR costs is:
return std::tie(C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds,
C1.ScaleCost, C1.ImmCost, C1.SetupCost) <
std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,
C2.ScaleCost, C2.ImmCost, C2.SetupCost);
So this will only alter results if none of the other variables turn out to be
different.
Differential Revision: https://reviews.llvm.org/D58770
llvm-svn: 355597
When lowering a select_cc node where the true and false values are of type f16,
we can't use a general conditional move because the FP16 instructions do not
support conditional execution. Instead, we must ensure that the condition code
is one of the four supported by the VSEL instruction.
Differential revision: https://reviews.llvm.org/D58813
llvm-svn: 355385
The isScaledConstantInRange function takes upper and lower bounds which are
checked after dividing by the scale, so the bounds checks for half, single and
double precision should all be the same. Previously, we had wrong bounds checks
for half precision, so selected an immediate the instructions can't actually
represent.
Differential revision: https://reviews.llvm.org/D58822
llvm-svn: 355305
The new addressing mode added for the v8.2A FP16 instructions uses bit 8 of the
immediate to encode the sign of the offset, like the other FP loads/stores, so
need to be treated the same way.
Differential revision: https://reviews.llvm.org/D58816
llvm-svn: 355201
This function was not checking for the condition code variants which are
undefined if either input is NaN, so we were missing selection of the VSEL
instruction in some cases when using -fno-honor-nans or -ffast-math.
Differential revision: https://reviews.llvm.org/D58812
llvm-svn: 355199
There was a time when we couldn't dump target-specific flags such as
arm-sbrel etc, so the tests didn't check for them. We can now be more
specific in our tests.
llvm-svn: 355189
Add the same level of support as for ARM mode (i.e. still no TLS
support).
In most cases, it is sufficient to replace the opcodes with the
t2-equivalent, but there are some idiosyncrasies that I decided to
preserve because I don't understand the full implications:
* For ARM we use LDRi12 to load from constant pools, but for Thumb we
use t2LDRpci (I'm not sure if the ideal would be to use t2LDRi12 for
Thumb as well, or to use LDRcp for ARM).
* For Thumb we don't have an equivalent for MOV|LDRLIT_ga_pcrel_ldr, so
we have to generate MOV|LDRLIT_ga_pcrel plus a load from GOT.
The tests are in separate files because they're hard enough to read even
without doubling the number of checks.
llvm-svn: 355077
This adds a few extra Thumb1 opcodes to improve the peephole opimisers
ability to remove redundant cmp instructions. tADC and tSBC require
a small fixup to prevent MOVS being moved past the instruction, giving
the wrong flags.
Differential Revision: https://reviews.llvm.org/D58281
llvm-svn: 354791
More or less all the instructions defined in the v8.2a full-fp16
extension are defined as UNPREDICTABLE if you put them in an IT block
(Thumb) or use with any condition other than AL (ARM). LLVM didn't
know that, and was happy to conditionalise them.
In order to force these instructions to count as not predicable, I had
to make a small Tablegen change. The code generation back end mostly
decides if an instruction was predicable by looking for something it
can identify as a predicate operand; there's an isPredicable bit flag
that overrides that check in the positive direction, but nothing that
overrides it in the negative direction.
(I considered the alternative approach of actually removing the
predicate operand from those instructions, but thought that it would
be more painful overall for instructions differing only in data type
to have different shapes of operand list. This way, the only code that
has to notice the difference is the if-converter.)
So I've added an isUnpredicable bit alongside isPredicable, and set
that bit on the right subset of FP16 instructions, and also on the
VSEL, VMAXNM/VMINNM and VRINT[ANPM] families which should be
unpredicable for all data types.
I've included a couple of representative regression tests, both of
which previously caused an fp16 instruction to be conditionalised in
ARM state and (with -arm-no-restrict-it) to be put in an IT block in
Thumb.
Reviewers: SjoerdMeijer, t.p.northover, efriedma
Reviewed By: efriedma
Subscribers: jdoerfert, javed.absar, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57823
llvm-svn: 354768
This adds a number of missing Thumb1 opcodes so that the peephole optimiser can
remove redundant CMP instructions.
Reapplying this after the first attempt broke non-thumb1 code as the t2ADDri
instruction can be used with frame indices. In thumb1 we use tADDframe.
Differential Revision: https://reviews.llvm.org/D57833
llvm-svn: 354667
This is exactly the same as arm mode, so for the instruction selector
tests we just extract them to a new file and run with the same checks
for both arm and thumb mode.
For the legalizer we need to update the tests for soft float a bit, but
only because BL and tBL are slightly different. We could be pedantic and
check that we get a well-formed BL for arm mode and a tBL for thumb, but
for the purposes of the legalizer test it's sufficient to just skip over
the predicate operands in the checks. Also note that we have the
pedantic checks in the divmod test, so we're covered.
llvm-svn: 354665
This adds a number of missing Thumb1 opcodes so that the peephole optimiser can
remove redundant CMP instructions.
Differential Revision: https://reviews.llvm.org/D57833
llvm-svn: 354564
During type promotion, sometimes we convert negative an add with a
negative constant into a sub with a positive constant. The loop that
performs this transformation has two issues:
- it iterates over a set, causing non-determinism.
- it breaks, instead of continuing, when it finds the first
non-negative operand.
Differential Revision: https://reviews.llvm.org/D58452
llvm-svn: 354557
ConvertTruncs is used to replace a trunc for an AND mask, however
this function wasn't working as expected. By performing the change
later, we can create a wide type integer mask instead of a narrow -1
value, which could then be simply removed (incorrectly). Because we
now perform this action later, it's necessary to cache the trunc type
before we perform the promotion.
Differential Revision: https://reviews.llvm.org/D57686
llvm-svn: 354108
The v8m.base ISA contains movw, which can operate on an unsigned
16-bit value. Add the pattern that converts an add with a negative
value, that could fit into 16-bits when negated, into a sub with that
positive value.
Differential Revision: https://reviews.llvm.org/D57942
llvm-svn: 353692
The whole design of generating LDMs/STMs is fragile and unreliable: it depends on
rescheduling here in the LoadStoreOptimizer that isn't register pressure aware
and regalloc that isn't aware of generating LDMs/STMs.
This patch adds a (hidden) option to control the total number of instructions that
can be re-ordered. I appreciate this looks only a tiny bit better than a hard-coded
constant, but at least it allows more easy experimentation with different values
for now. Ideally we calculate this reorder limit based on some heuristics, and take
register pressure into account. I might be looking into that next.
Differential Revision: https://reviews.llvm.org/D57954
llvm-svn: 353678
The sqrt case is faster and we already do this for the case where
the exponent is 0.25. This adds the 0.75 case which is also not
sensitive to signed zeros.
Patch by Whitney Tsang (Whitney)
Differential revision: https://reviews.llvm.org/D57434
llvm-svn: 353557
Modify GenerateConstantOffsetsImpl to create offsets that can be used
by indexed addressing modes. If formulae can be generated which
result in the constant offset being the same size as the recurrence,
we can generate a pre-indexed access. This allows the pointer to be
updated via the single pre-indexed access so that (hopefully) no
add/subs are required to update it for the next iteration. For small
cores, this can significantly improve performance DSP-like loops.
Differential Revision: https://reviews.llvm.org/D55373
llvm-svn: 353403
A number of of tests were using imm operands, not cimm. Since CSE
relies on the exact ConstantInt* pointer used, and implicit
conversions are generally evil, also enforce the bitsize of the types.
llvm-svn: 353113
Previously, LiveRegUnits was assuming that if a block has no successors
and does not return, then no registers are live at the end of it
(because the end of the block is unreachable). This was causing the
register scavenger to use callee-saved registers to materialise stack
frame addresses without saving them in the prologue. This would normally
be fine, because the end of the block is unreachable, but this is not
legal if the block ends by throwing a C++ exception. If this happens,
the scratch register will be modified, but its previous value won't be
preserved, so it doesn't get restored by the exception unwinder.
Differential revision: https://reviews.llvm.org/D57381
llvm-svn: 352844
Constants can also be materialised using the negated value and a MVN, and this
case seem to have been missed for Thumb2. To check the constant materialisation
costs, we now call getT2SOImmVal twice, once for the original constant and then
also for its negated value, and this function checks if the constant can both
be splatted or rotated.
This was revealed by a test that optimises for minsize: instead of a LDR
literal pool load and having a literal pool entry, just a MVN with an immediate
is smaller (and also faster).
Differential Revision: https://reviews.llvm.org/D57327
llvm-svn: 352737
And instead just generate a libcall. My motivating example on ARM was a simple:
shl i64 %A, %B
for which the code bloat is quite significant. For other targets that also
accept __int128/i128 such as AArch64 and X86, it is also beneficial for these
cases to generate a libcall when optimising for minsize. On these 64-bit targets,
the 64-bits shifts are of course unaffected because the SHIFT/SHIFT_PARTS
lowering operation action is not set to custom/expand.
Differential Revision: https://reviews.llvm.org/D57386
llvm-svn: 352736
This attempts to optimise negative values used in load/store operands
a little. We currently try to selct them as rr, materialising the
negative constant using a MOV/MVN pair. This instead selects ri with
an immediate of 0, forcing the add node to become a simpler sub.
Differential Revision: https://reviews.llvm.org/D57121
llvm-svn: 352475
Support G_SDIV, G_UDIV, G_SREM and G_UREM.
The only significant difference between arm and thumb mode is that we
need to check a different subtarget feature.
llvm-svn: 352346
Same as ARM.
On this occasion we split some of the instruction select tests for more
complicated instructions into their own files, so we can reuse them for
ARM and Thumb mode. Likewise for the legalizer tests.
llvm-svn: 352188
https://reviews.llvm.org/D57178
Now add a hook in TargetPassConfig to query if CSE needs to be
enabled. By default this hook returns false only for O0 opt level but
this can be overridden by the target.
As a consequence of the default of enabled for non O0, a few tests
needed to be updated to not use CSE (by passing in -O0) to the run
line.
reviewed by: arsenm
llvm-svn: 352126
In the last stage of type promotion, we replace any zext that uses a
new trunc with the operand of the trunc. This is okay when we only
allowed one type to be optimised, but now its the case that the trunc
maybe needed to produce a more narrow type than the one we were
optimising for. So we need to check this before doing the replacement.
Differential Revision: https://reviews.llvm.org/D57041
llvm-svn: 351935
The current check in CombineToPreIndexedLoadStore is too
conversative, preventing a pre-indexed store when the base pointer
is a predecessor of the value being stored. Instead, we should check
the pointer operand of the store.
Differential Revision: https://reviews.llvm.org/D56719
llvm-svn: 351933
Allow varargs functions to be called, both in arm and thumb mode. This
boils down to choosing the correct calling convention, which we can
easily test by making sure arm_aapcscc is used instead of
arm_aapcs_vfpcc when the callee is variadic.
llvm-svn: 351424
ReduceLoadWidth can trigger using a shifted mask is used and this
requires that the function return a shl node to correct for the
offset. However, the way that this was implemented meant that the
returned result could be an existing node, which would be incorrect.
This fixes the method of inserting the new node and replacing uses.
Differential Revision: https://reviews.llvm.org/D50432
llvm-svn: 351310
compiler identification lines in test-cases.
(Doing so only because it's then easier to search for references which
are actually important and need fixing.)
llvm-svn: 351200
Make it possible for TableGen to produce code for selecting MOVi32imm.
This allows reasonably recent ARM targets to select a lot more constants
than before.
We achieve this by adding GISelPredicateCode to arm_i32imm. It's
impossible to use the exact same code for both DAGISel and GlobalISel,
since one uses "Subtarget->" and the other "STI." to refer to the
subtarget. Moreover, in GlobalISel we don't have ready access to the
MachineFunction, so we need to add a bit of code for obtaining it from
the instruction that we're selecting. This is also the reason why it
needs to remain a PatLeaf instead of the more specific IntImmLeaf.
llvm-svn: 351056
Part of the effort to refactoring frame pointer code generation. We used
to use two function attributes "no-frame-pointer-elim" and
"no-frame-pointer-elim-non-leaf" to represent three kinds of frame
pointer usage: (all) frames use frame pointer, (non-leaf) frames use
frame pointer, (none) frame use frame pointer. This CL makes the idea
explicit by using only one enum function attribute "frame-pointer"
Option "-frame-pointer=" replaces "-disable-fp-elim" for tools such as
llc.
"no-frame-pointer-elim" and "no-frame-pointer-elim-non-leaf" are still
supported for easy migration to "frame-pointer".
tests are mostly updated with
// replace command line args ‘-disable-fp-elim=false’ with ‘-frame-pointer=none’
grep -iIrnl '\-disable-fp-elim=false' * | xargs sed -i '' -e "s/-disable-fp-elim=false/-frame-pointer=none/g"
// replace command line args ‘-disable-fp-elim’ with ‘-frame-pointer=all’
grep -iIrnl '\-disable-fp-elim' * | xargs sed -i '' -e "s/-disable-fp-elim/-frame-pointer=all/g"
Patch by Yuanfang Chen (tabloid.adroit)!
Differential Revision: https://reviews.llvm.org/D56351
llvm-svn: 351049
Using a PatLeaf for sext_16_node allowed matching smulbb and smlabb
instructions once the operands had been sign extended. But we also
need to use sext_inreg operands along with sext_16_node to catch a
few more cases that enable use to remove the unnecessary sxth.
Differential Revision: https://reviews.llvm.org/D55992
llvm-svn: 350613
This patch adds the sign/zero extension done by
vgetlane to ARM computeKnownBitsForTargetNode.
Differential revision: https://reviews.llvm.org/D56098
llvm-svn: 350553
All we have to do is mark it as legal.
This allows us to select a lot of new patterns handled by TableGen. This
patch adds tests for them and splits up the existing test file for
binary operators into 2 files, one for arithmetic ops and one for
logical ones.
llvm-svn: 349610
These features (fairly) recently got split out into their own feature, so we
should make CodeGen use them when available. The main change here is that the
check used to be based on the triple, but now it's based on CPU features.
llvm-svn: 349355
The transform performs a bitwise logic op in a wider type followed by
truncate when both inputs are truncated from the same source type:
logic_op (truncate x), (truncate y) --> truncate (logic_op x, y)
There are a bunch of other checks that should prevent doing this when
it might be harmful.
We already do this transform for scalars in this spot. The vector
limitation was shared with a check for the case when the operands are
extended. I'm not sure if that limit is needed either, but that would
be a separate patch.
Differential Revision: https://reviews.llvm.org/D55448
llvm-svn: 349303
Mark G_ADD, G_SUB, G_MUL, G_AND, G_OR and G_XOR as legal for both ARM
and Thumb2.
Extract the legalizer tests for these opcodes into another file.
Add tests for the instruction selector.
llvm-svn: 349142
Mark G_SEXT, G_ZEXT and G_ANYEXT to 32 bits as legal and add support for
them in the instruction selector. This uses handwritten code again
because the patterns that are generated with TableGen are tuned for what
the DAG combiner would produce and not for simple sext/zext nodes.
Luckily, we only need to update the opcodes to use the Thumb2 variants,
everything else can be reused from ARM.
llvm-svn: 349026
Summary:
All targets either just return false here or properly model `Fast`, so I
don't think there is any reason to prevent CodeGen from doing the right
thing here.
Subscribers: nemanjai, javed.absar, eraman, jsji, llvm-commits
Differential Revision: https://reviews.llvm.org/D55365
llvm-svn: 349016
Unfortunately we can't use TableGen for this because it doesn't yet
support predicates on the source pattern root. Therefore, add a bit of
handwritten code to the instruction selector to handle the most basic
cases.
Also mark them as legal and extract their legalizer test cases to a new
test file.
llvm-svn: 348920
This patch restricts the capability of G_MERGE_VALUES, and uses the new
G_BUILD_VECTOR and G_CONCAT_VECTORS opcodes instead in the appropriate places.
This patch also includes AArch64 support for selecting G_BUILD_VECTOR of <4 x s32>
and <2 x s64> vectors.
Differential Revisions: https://reviews.llvm.org/D53629
llvm-svn: 348788
When we had dynamic call frames (i.e. sp adjustment around each call) we
were including that adjustment into offsets calculated based on r6, even
though it's only sp that changes. This led to incorrect stack slot
accesses.
llvm-svn: 348591
Adds fatal errors for any target that does not support the Tiny or Kernel
codemodels by rejigging the getEffectiveCodeModel calls.
Differential Revision: https://reviews.llvm.org/D50141
llvm-svn: 348585
This has two positive effects. First, using a custom node prevents
recombination leading to an infinite loop since the output DAG is notionally a
little more complex than the input one. Using a flag-setting instruction also
allows the subtraction to be folded with the related comparison more easily.
https://reviews.llvm.org/D53190
llvm-svn: 348122
Don't expand SDIV with an immediate that is a power of 2 if we optimise for
minimum code size. For example:
sdiv %1, i32 4
gets expanded to a sequence of 3 instructions, but this is suboptimal for
minimum code size so instead we just generate a MOV and a SDIV if integer
division is supported.
Differential Revision: https://reviews.llvm.org/D54546
llvm-svn: 347965
SplitVecOp_TruncateHelper tries to promote the result type while splitting FP_TO_SINT/UINT. It then concatenates the result and introduces a truncate to the original result type. But it does this without inserting the AssertZExt/AssertSExt that the regular result type promotion would insert. Nor does it turn FP_TO_UINT into FP_TO_SINT the way normal result type promotion for these operations does. This is bad on X86 which doesn't support FP_TO_SINT until AVX512.
This patch disables the use of SplitVecOp_TruncateHelper for these operations and just lets normal promotion handle it. I've tweaked a couple things in X86ISelLowering to avoid a few obvious regressions there. I believe all the changes on X86 are improvements. The other targets look neutral.
Differential Revision: https://reviews.llvm.org/D54906
llvm-svn: 347593
We can now select CLZ via the TableGen'erated code, so support G_CTLZ
and G_CTLZ_ZERO_UNDEF throughout the pipeline for types <= s32.
Legalizer:
If the CLZ instruction is available, use it for both G_CTLZ and
G_CTLZ_ZERO_UNDEF. Otherwise, use a libcall for G_CTLZ_ZERO_UNDEF and
lower G_CTLZ in terms of it.
In order to achieve this we need to add support to the LegalizerHelper
for the legalization of G_CTLZ_ZERO_UNDEF for s32 as a libcall (__clzsi2).
We also need to allow lowering of G_CTLZ in terms of G_CTLZ_ZERO_UNDEF
if that is supported as a libcall, as opposed to just if it is Legal or
Custom. Due to a minor refactoring of the helper function in charge of
this, we will also allow the same behaviour for G_CTTZ and G_CTPOP.
This is not going to be a problem in practice since we don't yet have
support for treating G_CTTZ and G_CTPOP as libcalls (not even in
DAGISel).
Reg bank select:
Map G_CTLZ to GPR. G_CTLZ_ZERO_UNDEF should not make it to this point.
Instruction select:
Nothing to do.
llvm-svn: 347545
Both zext and sext are currently allowed during the search for narrow
sequences and sexts operands are later added to the mac candidates.
But operands of muls are also added, without checking whether they're
sext or zext, which means we can generate a signed smlad when we
shouldn't.
Differential Revision: https://reviews.llvm.org/D54790
llvm-svn: 347542
I am working on making FileCheck stricter (in D54769 and D53710) so that it
issues diagnostics when there's something wrong with tests.
This is a cleanup for dangling prefixes in the ARM codegen tests, e.g.:
--check-prefixes=A,B
where A occurs in the check file, but B doesn't. This can be innocent if A does
all the required checking, but can also be a bug in that test if it results in
the test actually not checking anything (if A for example only checks a common
label). Test CodeGen/ARM/smml.ll is such an example.
Differential Revision: https://reviews.llvm.org/D54842
llvm-svn: 347487
This code takes a truncate, fp_to_int, or int_to_fp with a legal result type and an input type that needs to be split and enlarges the elements in the result type before doing the split. Then inserts a follow up truncate or fp_round after concatenating the two halves back together.
But if the input type of the original op is being split on its way to ultimately being scalarized we're just going to end up building a vector from scalars and then truncating or rounding it in the vector register. Seems kind of silly to enlarge the result element type of the operation only to end up with scalar code and then building a vector with large elements only to make the elements smaller again in the vector register. Seems better to just try to get away producing smaller result types in the scalarized code.
The X86 test case that changes is a pretty contrived test case that exists because of a bug we used to have in our AVG matching code. I think the code is better now, but its not realistic anyway.
llvm-svn: 347482
We fail to canonicalize IR this way (prefer 'not' ops to arbitrary 'xor'),
but that would not matter without this patch because DAGCombiner was
reversing that transform. I think we need this transform in the backend
regardless of what happens in IR to catch cases where the shift-xor
is formed late from GEP or other ops.
https://rise4fun.com/Alive/NC1
Name: shl
Pre: (-1 << C2) == C1
%shl = shl i8 %x, C2
%r = xor i8 %shl, C1
=>
%not = xor i8 %x, -1
%r = shl i8 %not, C2
Name: shr
Pre: (-1 u>> C2) == C1
%sh = lshr i8 %x, C2
%r = xor i8 %sh, C1
=>
%not = xor i8 %x, -1
%r = lshr i8 %not, C2
https://bugs.llvm.org/show_bug.cgi?id=39657
llvm-svn: 347478
r334871 has made it possible for TableGen'erated code to select BFC, but
it has not added a test for it on the ARM side. Add it now to make sure
we don't introduce regressions if we ever change anything about that
rule.
llvm-svn: 347447
Truncs are treated as sources if their produce a value of the same
type as the one we currently trying to promote. Truncs used to be
considered as a sink if their operand was the same value type.
We now allow smaller types in the search, so we should search through
truncs that produce a smaller value. These truncs can then be
converted to an AND mask.
This leaves sinks as being:
- points where the value in the register is being observed, such as
an icmp, switch or store.
- points where value types have to match, such as calls and returns.
- zext are included to ease the transformation and are generally
removed later on.
During this change, it also became apart from truncating sinks was
broken: if a sink used a source, its type information had already
been lost by the time the truncation happens. So I've changed the
method of caching the type information.
Differential Revision: https://reviews.llvm.org/D54515
llvm-svn: 347191
The scan was incorrectly skipping the first instruction, so a register
could appear to be dead when it was actually live. This eventually leads
to a machine verifier failure and miscompile in arm-ldst-opt.
Differential Revision: https://reviews.llvm.org/D54491
llvm-svn: 346821
Summary:
Handle extra output from index loads in cases where we wish to
forward a load value directly from a preceeding store.
Fixes PR39571.
Reviewers: peter.smith, rengolin
Subscribers: javed.absar, hiraditya, arphaman, llvm-commits
Differential Revision: https://reviews.llvm.org/D54265
llvm-svn: 346654
LDRcp should be deleted when the dest register is dead in register
coalescing. Without MemOp, dead LDRcp will cause dead constant pool
value which references to non-existing label.
Patch by Yin Ma.
Differential Revision: https://reviews.llvm.org/D54173
llvm-svn: 346563
Now that we have mixed type sizes, i1 values need to be explicitly
handled as we want to avoid promoting these values.
Differential Revision: https://reviews.llvm.org/D54308
llvm-svn: 346499
Previously, during the search, all values had to have the same
'TypeSize', which is equal to number of bits of the integer type of
the icmp operand. All values in the tree had to match this size;
meaning that, if we searched from i16, we wouldn't accept i8s. A
change in type size requires zext and truncs to perform the casts so,
to allow mixed narrow types, the handling of these instructions is
now slightly different:
- we allow casts if their result or operand is <= TypeSize.
- zexts are sinks if their result > TypeSize.
- truncs are still sinks if their operand == TypeSize.
- truncs are still sources if their result == TypeSize.
The transformation bails on finding an icmp that operates on data
smaller than the current TypeSize.
Differential Revision: https://reviews.llvm.org/D54108
llvm-svn: 346480
FindBetterNeighborChains simulateanously improves the chain
dependencies of a chain of related stores avoiding the generation of
extra token factors. For chains longer than the GatherAllAliasDepths,
stores further down in the chain will necessarily fail, a potentially
significant waste and preventing otherwise trivial parallelization.
This patch directly parallelize the chains of stores before improving
each store. This generally improves DAG-level parallelism.
Reviewers: courbet, spatel, RKSimon, bogner, efriedma, craig.topper, rnk
Subscribers: sdardis, javed.absar, hiraditya, jrtc27, atanasyan, llvm-commits
Differential Revision: https://reviews.llvm.org/D53552
llvm-svn: 346432
The lowering was missing live-ins in certain cases, like a sequence of
multiple tMOVCCr_pseudo instructions. This would lead to a verifier
failure, and on pre-v6 Thumb CPSR would be incorrectly clobbered.
For reasons I don't completely understand, it's hard to get a sequence
of multiple tMOVCCr_pseudo instructions; the issue only seems to show up
with 64-bit comparisons where the result is zero-extended. I added some
extra testcases in case that changes in the future. Probably some
optimization opportunities here if anyone is interested. (@test_slt_not
is the case that was getting miscompiled.)
The code to check the liveness of CPSR was stolen from
X86ISelLowering.cpp; maybe it could be refactored into common helper,
but I have no idea where to put it.
Differential Revision: https://reviews.llvm.org/D54192
llvm-svn: 346355
Turn the assert in PrepareConstants into a conditon so that we can
handle mul instructions with negative immediates.
Differential Revision: https://reviews.llvm.org/D54094
llvm-svn: 346126
r345840 slightly changed the way promotion happens which could
result in zext and truncs having the same source and destination
types. This fixes that issue.
We can now also remove the zext and trunc in the following case:
(zext (trunc (promoted op)), i32)
This means that we can no longer treat a value, that is only used by
a sink, to be safe to promote.
I've also added in some extra asserts and replaced a cast for a
dyn_cast.
Differential Revision: https://reviews.llvm.org/D54032
llvm-svn: 346125
reduceBuildVecConvertToConvertBuildVec vectorizes int2float in the DAGCombiner, which means that even if the LV/SLP has decided to keep scalar code using the cost models, this will override this.
While there are cases where vectorization is necessary in the DAG (mainly due to legalization artefacts), I don't think this is the case here, we should assume that the vectorizers know what they are doing.
Differential Revision: https://reviews.llvm.org/D53712
llvm-svn: 345964
While mutating instructions, we sign extended negative constant
operands for binary operators that can safely overflow. This was to
allow instructions, such as add nuw i8 %a, -2, to still be able to
perform a subtraction. However, the code to handle constants doesn't
take into consideration that instructions, such as sub nuw i8 -2, %a,
require the i8 -2 to be converted into i32 254.
This is a relatively simple fix, but I've taken the time to
reorganise the code a bit - mainly that instructions that can be
promoted are cached and splitting up the Mutate function.
Differential Revision: https://reviews.llvm.org/D53972
llvm-svn: 345840
I think this is the actual important property; the previous visibility
check was an approximation.
Differential Revision: https://reviews.llvm.org/D53852
llvm-svn: 345790
Shows up rarely for 64-bit arithmetic, more frequently for the compare
patterns added in r325323.
Differential Revision: https://reviews.llvm.org/D53848
llvm-svn: 345782
The debug-use flag must be set exactly for uses on DBG_VALUEs. This is
so obvious that it can be trivially inferred while parsing. This will
reduce noise when printing while omitting an information that has little
value to the user.
The parser will keep recognizing the flag for compatibility with old
`.mir` files.
Differential Revision: https://reviews.llvm.org/D53903
llvm-svn: 345671
The SchedModel allows the addition of ReadAdvances to express that certain
operands of the instructions are needed at a later point than the others.
RegAlloc may add pseudo operands that are not part of the instruction
descriptor, and therefore cannot have any read advance entries. This meant
that in some cases the desired read advance was nullified by such a pseudo
operand, which still had the original latency.
This patch fixes this by making sure that such pseudo operands get a zero
latency during DAG construction.
Review: Matthias Braun, Ulrich Weigand.
https://reviews.llvm.org/D49671
llvm-svn: 345606
- Relex hard coded registers and stack frame sizes
- Some test cleanups
- Change phi-dbg.ll to match on mir output after phi elimination instead
of going through the whole codegen pipeline.
This is in preparation for https://reviews.llvm.org/D52010
I'm committing all the test changes upfront that work before and after
independently.
llvm-svn: 345532
The "dead" markings allow existing target-independent optimizations,
like MachineSink, to trigger more frequently. The CPSR defs would have
eventually been marked dead by LiveVariables, so this only affects
optimizations before regalloc.
The ARMBaseInstrInfo.cpp change is fixing a bug which is only visible
with this change: the transform adds a use to an otherwise dead def
of CPSR. This is covered by existing regression tests.
thumb2-tbh.ll breaks for Thumb1 due to MachineLICM changing the
generated code; I'll fix it in D53452.
Differential Revision: https://reviews.llvm.org/D53453
llvm-svn: 345420
While working on FileCheck producing better diagnostics in D53710, I noticed
that our test case is broken in a few different ways. The test was running, but
results were not checked as prefix CHECK-COMMON wasn't defined (which is what
FileCheck should warn about). Also, the output was different in 2 cases because
of recent changes in ARMCodeGenPrepare.
Differential Revision: https://reviews.llvm.org/D53746
llvm-svn: 345386
Summary:
Changes all uses of minnan/maxnan to minimum/maximum
globally. These names emphasize that the semantic difference between
these operations is more than just NaN-propagation.
Reviewers: arsenm, aheejin, dschuff, javed.absar
Subscribers: jholewinski, sdardis, wdng, sbc100, jgravelle-google, jrtc27, atanasyan, llvm-commits
Differential Revision: https://reviews.llvm.org/D53112
llvm-svn: 345218
The BKPT instruction is specified to cause a software breakpoint,
and at least on Linux results in a SIGTRAP. This makes it more
suitable for implementing debugtrap than TRAP (aka UDF #254), which
is specified to cause an undefined instruction exception and results
in a SIGILL on Linux.
Moreover, BKPT is not marked as a terminator, which is not only
consistent with the IR instruction but allows the analyzeBlock
function to correctly analyze a basic block containing the instruction,
which fixes an assertion failure in the machine block placement pass
previously triggered by the included test case.
Because BKPT is only supported starting with ARMv5T, we continue to
use UDF #254 when targeting v4T.
Differential Revision: https://reviews.llvm.org/D53614
llvm-svn: 345171
A global alias may use indices which are not considered in bounds. In
such a case, accessing the base object will fail as it only peers
through inbounds accesses. This pattern is used by the swift compiler
to create references to preceeding members in the type metadata. This
would cause the code generation to fail when targeting a platform that
used ELF as the object file format. Be conservative and fail the
read-only check if we run into an alias that we cannot peer through.
llvm-svn: 345107
Previously reverted in rL343082.
Original commit message:
On failing to find sequences that can be converted into dual macs,
try to find sequential 16-bit loads that are used by muls which we
can then use smultb, smulbt, smultt with a wide load.
Differential Revision: https://reviews.llvm.org/D51983
llvm-svn: 344693
This is patch 2/2, following up on D53314, and is the functional change
to prevent fusing mul + add sequences into VFMAs.
Differential revision: https://reviews.llvm.org/D53315
llvm-svn: 344683
This is a follow up of rL342874, which stopped fusing muls and adds into VMLAs
for performance reasons on the Cortex-M4 and Cortex-M33. This is a serie of 2
patches, that is trying to achieve the same for VFMA. The second column in the
table below shows what we were generating before rL342874, the third column
what changed with rL342874, and the last column what we want to achieve with
these 2 patches:
--------------------------------------------------------
| Opt | < rL342874 | >= rL342874 | |
|------------------------------------------------------|
|-O3 | vmla | vmul | vmul |
| | | vadd | vadd |
|------------------------------------------------------|
|-Ofast | vfma | vfma | vmul |
| | | | vadd |
|------------------------------------------------------|
|-Oz | vmla | vmla | vmla |
--------------------------------------------------------
This patch 1/2, is a cleanup of the spaghetti predicate logic on the different
VMLA and VFMA codegen rules, so that we can make the final functional change in
patch 2/2. This also fixes a typo in the regression test added in rL342874.
Differential revision: https://reviews.llvm.org/D53314
llvm-svn: 344671
As I suggested on PR39281, this patch uses PADDL pairwise addition to widen from the vXi8 CTPOP result to the target vector type.
This is a blocker for moving more x86 code to generic vector CTPOP expansion (P32655 + D53258) - ARM's vXi64 CTPOP currently expands, which would generate a vXi64 MUL but ARM's custom lowering expands the general MUL case and vectors aren't well handled in LegalizeDAG - improving the CTPOP lowering was a lot easier than fixing the MUL lowering for this one case......
Differential Revision: https://reviews.llvm.org/D53257
llvm-svn: 344512
When deciding if it is safe to optimize a conditional branch to a CBZ or
CBNZ the offsets of the BasicBlocks from the start of the function are
estimated. For inline assembly the generic getInlineAsmLength() function is
used to get a worst case estimate of the inline assembly by multiplying the
number of instructions by the max instruction size of 4 bytes. This
unfortunately doesn't take into account the generation of Thumb implicit IT
instructions. In edge cases such as when all the instructions in the block
are 4-bytes in size and there is an implicit IT then the size is
underestimated. This can cause an out of range CBZ or CBNZ to be generated.
The patch takes a conservative approach and assumes that every instruction
in the inline assembly block may have an implicit IT.
Fixes pr31805
Differential Revision: https://reviews.llvm.org/D52834
llvm-svn: 343960
Correctly check for relocations in the constant to promote. And don't
allow promoting a constant multiple times.
This partially fixes https://bugs.llvm.org//show_bug.cgi?id=32780 ;
it's not a complete fix because we also need to prevent
ARMConstantIslands from cloning the constant.
(-arm-promote-constant is currently off by default, and it stays off
with this patch. I'll look into turning it on again when all the known
issues are fixed.)
Differential Revision: https://reviews.llvm.org/D51472
llvm-svn: 343361
This mostly affects IR generated by non-clang frontends because clang
generally sets the alignment of globals explicitly.
Fixes https://bugs.llvm.org//show_bug.cgi?id=32394 .
(-arm-promote-constant is currently off by default, and it stays off
with this patch. I'll look into turning it on again when all the known
issues are fixed.)
Differential Revision: https://reviews.llvm.org/D51469
llvm-svn: 343359
The NoMovt feature prevents the use of MOVW/MOVT
instructions on Cortex-M23 for performance reasons.
These instructions are required for execute only code
so NoMovt should be disabled when that option is enabled.
Differential Revision: https://reviews.llvm.org/D52551
llvm-svn: 343302
It was the case when calling MO::dump(), but MI::dump() was still
depending on hasComplexRegisterTies().
The MIR output is not affected.
llvm-svn: 343107
When calculating whether a value can safely overflow for use by an
icmp, we weren't checking that the value couldn't wrap around. To do
this we need the icmp to be using a constant, as well as the incoming
add or sub.
bugzilla report: https://bugs.llvm.org/show_bug.cgi?id=39060
Differential Revision: https://reviews.llvm.org/D52463
llvm-svn: 343092
Adding NonNull as attributes to returned pointers has the unfortunate side
effect of disabling tail calls. This patch ignores the NonNull attribute when
we decide whether to tail merge, in the same way that we ignore the NoAlias
attribute, as it has no affect on the call sequence.
Differential Revision: https://reviews.llvm.org/D52238
llvm-svn: 343091
This broke Chromium's Android build (https://crbug.com/889390) and the
polly-aosp buildbot
(http://lab.llvm.org:8011/builders/aosp-O3-polly-before-vectorizer-unprofitable).
> Originally committed in rL342210 but was reverted in rL342260 because
> it was causing issues in vectorized code, because I had forgotten to
> ensure that we're operating on scalar values.
>
> Original commit message:
>
> On failing to find sequences that can be converted into dual macs,
> try to find sequential 16-bit loads that are used by muls which we
> can then use smultb, smulbt, smultt with a wide load.
>
> Differential Revision: https://reviews.llvm.org/D51983
llvm-svn: 343082
A sequence of VMUL and VADD instructions always give the same or better
performance than a fused VMLA instruction on the Cortex-M4 and Cortex-M33.
Executing the VMUL and VADD back-to-back requires the same cycles, but
having separate instructions allows scheduling to avoid the hazard between
these 2 instructions.
Differential Revision: https://reviews.llvm.org/D52289
llvm-svn: 342874
- The load store optimizer is currently merging multiple loads/stores into VLDM/VSTM with more than 16 doubleword registers
- This is an UNPREDICTABLE instruction and shouldn't be done
- It looks like the Limit for how many registers included in a merge got dropped at some point so I am reintroducing it in this patch
- This fixes https://bugs.llvm.org/show_bug.cgi?id=38389
Differential Revision: https://reviews.llvm.org/D52085
llvm-svn: 342872
Originally committed in rL342210 but was reverted in rL342260 because
it was causing issues in vectorized code, because I had forgotten to
ensure that we're operating on scalar values.
Original commit message:
On failing to find sequences that can be converted into dual macs,
try to find sequential 16-bit loads that are used by muls which we
can then use smultb, smulbt, smultt with a wide load.
Differential Revision: https://reviews.llvm.org/D51983
llvm-svn: 342870
Fixes the unwind information generated for floating-point registers.
Previously, all padding registers were assumed to be four bytes wide. Now, the
width of the register is used to specify the amount of padding.
Patch by Jackson Woodruff!
Differential revision: https://reviews.llvm.org/D51494
llvm-svn: 342545
Since Android API version 9 the Android libm has had the sincos functions, so
they should be recognised as libcalls and sincos optimisation should be applied.
Differential Revision: https://reviews.llvm.org/D52025
llvm-svn: 342471
This reverts r342395 as it caused error
> Argument value type does not match pointer operand type!
> %0 = atomicrmw volatile xchg i8* %_Value1, i32 1 monotonic, !dbg !25
> i8in function atomic_flag_test_and_set
> fatal error: error in backend: Broken function found, compilation aborted!
on bot http://green.lab.llvm.org/green/job/clang-stage1-configure-RA/
More details are available at https://reviews.llvm.org/D52080
llvm-svn: 342431
isSupportedValue explicitly checked and accepted many types of value,
primarily for debugging reasons. Remove most of these checks and do a
bit of refactoring now that the pass is more stable. This also enables
ZExts to be sources, but this has very little practical benefit at the
moment extend instructions will still be introduced.
Differential Revision: https://reviews.llvm.org/D52080
llvm-svn: 342395
We allow overflowing instructions if they're decreasing and only used
by an unsigned compare. Add the extra condition that the icmp cannot
be using a negative immediate.
Differential Revision: https://reviews.llvm.org/D52102
llvm-svn: 342392
On failing to find sequences that can be converted into dual macs,
try to find sequential 16-bit loads that are used by muls which we
can then use smultb, smulbt, smultt with a wide load.
Differential Revision: https://reviews.llvm.org/D51983
llvm-svn: 342210
We previously only allowed truncs as sinks, but now allow them as
sources too. We do this by checking that the result type is the
narrow type that we're trying to optimise for.
Differential Revision: https://reviews.llvm.org/D51978
llvm-svn: 342141
Part of FixConsts wrongly assumes either a 8- or 16-bit constant
which can result in the wrong constants being generated during
promotion.
Differential Revision: https://reviews.llvm.org/D52032
llvm-svn: 342140
The Technical Reference Manuals for these two CPUs state that branching
to an unaligned 32-bit instruction incurs an extra pipeline reload
penalty. That's bad.
This also enables the optimization at -Os since it costs on average one
byte per loop in return for 1 cycle per iteration, which is pretty good
going.
llvm-svn: 342127
SMLAD and SMLALD instructions also come in the form of SMLADX and
SMLALDX which perform an exchange on their second operand. To support
this, more of the loads in the MAC candidates are compared for
sequential access and a boolean value has been added to BinOpChain.
AddMACCandiate has been refactored into a small pattern matching
state machine to reduce the amount of duplicated code, but also to
enable the matching to be more flexible. CreateParallelMACPairs now
iterates through all the candidates to find parallel ones.
Differential Revision: https://reviews.llvm.org/D51424
llvm-svn: 342033
Search from i64 reducing phis, as well as i32, to allow the
generation of smlald instructions.
Differential Revision: https://reviews.llvm.org/D51101
llvm-svn: 341941
Because t2LDREX (& t2STREX) were marked as AddrModeNone, but did allow a
FrameIndex operand, rewriteT2FrameIndex asserted. This gives them a
proper addressing-mode and tells the rewriter about it so that encodable
offsets are exploited and others are rejected.
Should fix PR38828.
llvm-svn: 341642
SHF_ARM_PURECODE flag when being built with the -mexecute-only flag.
All code sections of an ELF must have the flag set for the final .text
section to be execute-only, otherwise the flag gets removed.
A HasData flag is added to MCSection to aid in the determination that
the section is empty. A virtual setTargetSectionFlags is added to
MCELFObjectTargetWriter to allow subclasses to set target specific
section flags to be added to sections which we then use in the ARM
backend to set SHF_ARM_PURECODE.
Patch by Ivan Lozano!
Reviewed By: echristo
Differential Revision: https://reviews.llvm.org/D48792
llvm-svn: 341593
This was proposed as an IR transform in D49306, but it was not clearly justifiable as a canonicalization.
Here, we only do the transform when the target tells us that sqrt can be lowered with inline code.
This is the basic case. Some potential enhancements are in the TODO comments:
1. Generalize the transform for other exponents (allow more than 2 sqrt calcs if that's really cheaper).
2. If we have less fast-math-flags, generate code to avoid -0.0 and/or INF.
3. Allow the transform when optimizing/minimizing size (might require a target hook to get that right).
Note that by default, x86 converts single-precision sqrt calcs into sqrt reciprocal estimate with
refinement. That codegen is controlled by CPU attributes and can be manually overridden. We have plenty
of test coverage for that already, so I didn't bother to include extra testing for that here. AArch uses
its full-precision ops in all cases (not sure if that's the intended behavior or not, but that should
also be covered by existing tests).
Differential Revision: https://reviews.llvm.org/D51630
llvm-svn: 341481
The runtime pseudo relocations can't handle the ARM format embedded
addresses in movw/movt pairs. By using stubs, the potentially
dllimported addresses can be touched up by the runtime pseudo relocation
framework.
Differential Revision: https://reviews.llvm.org/D51450
llvm-svn: 341176
Summary:
This is a continuation of https://reviews.llvm.org/D49727
Below the original text, current changes in the comments:
Currently, in line with GCC, when specifying reserved registers like sp or pc on an inline asm() clobber list, we don't always preserve the original value across the statement. And in general, overwriting reserved registers can have surprising results.
For example:
extern int bar(int[]);
int foo(int i) {
int a[i]; // VLA
asm volatile(
"mov r7, #1"
:
:
: "r7"
);
return 1 + bar(a);
}
Compiled for thumb, this gives:
$ clang --target=arm-arm-none-eabi -march=armv7a -c test.c -o - -S -O1 -mthumb
...
foo:
.fnstart
@ %bb.0: @ %entry
.save {r4, r5, r6, r7, lr}
push {r4, r5, r6, r7, lr}
.setfp r7, sp, #12
add r7, sp, #12
.pad #4
sub sp, #4
movs r1, #7
add.w r0, r1, r0, lsl #2
bic r0, r0, #7
sub.w r0, sp, r0
mov sp, r0
@APP
mov.w r7, #1
@NO_APP
bl bar
adds r0, #1
sub.w r4, r7, #12
mov sp, r4
pop {r4, r5, r6, r7, pc}
...
r7 is used as the frame pointer for thumb targets, and this function needs to restore the SP from the FP because of the variable-length stack allocation a. r7 is clobbered by the inline assembly (and r7 is included in the clobber list), but LLVM does not preserve the value of the frame pointer across the assembly block.
This type of behavior is similar to GCC's and has been discussed on the bugtracker: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11807 . No consensus seemed to have been reached on the way forward. Clang behavior has briefly been discussed on the CFE mailing (starting here: http://lists.llvm.org/pipermail/cfe-dev/2018-July/058392.html). I've opted for following Eli Friedman's advice to print warnings when there are reserved registers on the clobber list so as not to diverge from GCC behavior for now.
The patch uses MachineRegisterInfo's target-specific knowledge of reserved registers, just before we convert the inline asm string in the AsmPrinter.
If we find a reserved register, we print a warning:
repro.c:6:7: warning: inline asm clobber list contains reserved registers: R7 [-Winline-asm]
"mov r7, #1"
^
Reviewers: efriedma, olista01, javed.absar
Reviewed By: efriedma
Subscribers: eraman, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D51165
llvm-svn: 341062
Summary:
Global variables that are external and zero initialized are
supposed to be merged with global variables in the bss section
rather than the data section.
Reviewers: efriedma, rengolin, t.p.northover, javed.absar, asl, john.brawn, pcc
Reviewed By: efriedma
Subscribers: dmgreen, llvm-commits
Differential Revision: https://reviews.llvm.org/D51379
llvm-svn: 341008
The inline sequence is very long (about 70 bytes on Thumb1), so it's
not really a good idea to inline it, especially when optimizing for
size.
Differential Revision: https://reviews.llvm.org/D47917
llvm-svn: 340458
On Windows, movw+movt pairs with relocations are handled with a single
relocation that covers them both. Therefore we can't inject anything
between these instructions, otherwise the relocation (which in LLVM
only is treated as the movw instruction's relocation, while the movt
instruction's relocation is dropped) will end up bogus.
These instructions are bundled up until right before the constant
islands pass, making this effectively the only place that can split
them apart.
Differential Revision: https://reviews.llvm.org/D51032
llvm-svn: 340451
This avoids a potential infinite loop setting and unsetting bits in the
mask.
Reduced from a failure on the polly-aosp bot.
Differential Revision: https://reviews.llvm.org/D51066
llvm-svn: 340446
Add intrinsic isel patterns for sxtb16, sxtab16, uxtb16 and uxtab16
so that they can perform a ror.
Differential Revision: https://reviews.llvm.org/D51034
llvm-svn: 340405
During combining, ReduceLoadWdith is used to combine AND nodes that
mask loads into narrow loads. This patch allows the mask to be a
shifted constant. This results in a narrow load which is then left
shifted to compensate for the new offset.
Differential Revision: https://reviews.llvm.org/D50432
llvm-svn: 340261
There is no way in the universe, that doing a full-width division in
software will be faster than doing overflowing multiplication in
software in the first place, especially given that this same full-width
multiplication needs to be done anyway.
This patch replaces the previous implementation with a direct lowering
into an overflowing multiplication algorithm based on half-width
operations.
Correctness of the algorithm was verified by exhaustively checking the
output of this algorithm for overflowing multiplication of 16 bit
integers against an obviously correct widening multiplication. Baring
any oversights introduced by porting the algorithm to DAG, confidence in
correctness of this algorithm is extremely high.
Following table shows the change in both t = runtime and s = space. The
change is expressed as a multiplier of original, so anything under 1 is
“better” and anything above 1 is worse.
+-------+-----------+-----------+-------------+-------------+
| Arch | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s |
+-------+-----------+-----------+-------------+-------------+
| X64 | - | - | ~0.5 | ~0.64 |
| i686 | ~0.5 | ~0.6666 | ~0.05 | ~0.9 |
| armv7 | - | ~0.75 | - | ~1.4 |
+-------+-----------+-----------+-------------+-------------+
Performance numbers have been collected by running overflowing
multiplication in a loop under `perf` on two x86_64 (one Intel Haswell,
other AMD Ryzen) based machines. Size numbers have been collected by
looking at the size of function containing an overflowing multiply in
a loop.
All in all, it can be seen that both performance and size has improved
except in the case of armv7 where code size has regressed for 128-bit
multiply. u128*u128 overflowing multiply on 32-bit platforms seem to
benefit from this change a lot, taking only 5% of the time compared to
original algorithm to calculate the same thing.
The final benefit of this change is that LLVM is now capable of lowering
the overflowing unsigned multiply for integers of any bit-width as long
as the target is capable of lowering regular multiplication for the same
bit-width. Previously, 128-bit overflowing multiply was the widest
possible.
Patch by Simonas Kazlauskas!
Differential Revision: https://reviews.llvm.org/D50310
llvm-svn: 339922
While searching through the use-def tree, ignore GetElementPtrInst
instructions because they don't need promoting and neither do their
indices. Otherwise, the wide indices prevent the transformation from
happening.
Differential Revision: https://reviews.llvm.org/D50762
llvm-svn: 339871
Originally committed in r339755 which was reverted in r339806 due to
an asan issue. The issue was caused by my assumption that operands to
a CallInst mapped to the FunctionType Params. CallInsts are now
handled by iterating over their ArgOperands instead of Operands.
Original Message:
Treat signed icmps as 'sinks', allowing them to be in the use-def
tree, enabling more promotions to be performed. As a sink, any
promoted incoming values need to be truncated before being used by
the signed icmp.
Differential Revision: https://reviews.llvm.org/D50067
llvm-svn: 339858
We only try to promote types with are smaller than 16-bits, but we
also need to check that the type is not less than 8-bits.
Differential Revision: https://reviews.llvm.org/D50769
llvm-svn: 339770
Treat signed icmps as 'sinks', allowing them to be in the use-def
tree, enabling more promotions to be performed. As a sink, any
promoted incoming values need to be truncated before being used by
the signed icmp.
Differential Revision: https://reviews.llvm.org/D50067
llvm-svn: 339755
Add pointers to the list of allowed types, but don't try to promote
them. Also fixed a bug with the promotion of undef values, so a new
value is now created instead of mutating in place. We also now only
promote if there's an instruction in the use-def chains other than
the icmp, sinks and sources.
Differential Revision: https://reviews.llvm.org/D50054
llvm-svn: 339754
LLVM normally prefers to minimize the number of bits set in an AND
immediate, but that doesn't always match the available ARM instructions.
In Thumb1 mode, prefer uxtb or uxth where possible; otherwise, prefer
a two-instruction sequence movs+ands or movs+bics.
Some potential improvements outlined in
ARMTargetLowering::targetShrinkDemandedConstant, but seems to work
pretty well already.
The ARMISelDAGToDAG fix ensures we don't generate an invalid UBFX
instruction due to a larger-than-expected mask. (It's orthogonal, in
some sense, but as far as I can tell it's either impossible or nearly
impossible to reproduce the bug without this change.)
According to my testing, this seems to consistently improve codesize by
a small amount by forming bic more often for ISD::AND with an immediate.
Differential Revision: https://reviews.llvm.org/D50030
llvm-svn: 339472
Enabling ARMCodeGenPrepare by default caused a whole load of
failures. This is due to zexts and truncs not being handled properly.
ZExts are messy so it's just easier to disable for now and truncs
are allowed only as 'sinks'. I still need to figure out why allowing
them as 'sources' causes so many failures. The other main changes are
that we are explicit in the types that we converting to, it's now
always 'TypeSize'. Type support is also now performed while checking
for valid opcodes as it unnecessarily complicated having the checks
are different stages.
I've moved the tests around too, so we have the zext and truncs in
their own file as well as the overflowing opcode tests.
Differential Revision: https://reviews.llvm.org/D50518
llvm-svn: 339432
LLVM triple normalization is handling "unknown" and empty components
differently; for example given "x86_64-unknown-linux-gnu" and
"x86_64-linux-gnu" which should be equivalent, triple normalization
returns "x86_64-unknown-linux-gnu" and "x86_64--linux-gnu". autoconf's
config.sub returns "x86_64-unknown-linux-gnu" for both
"x86_64-linux-gnu" and "x86_64-unknown-linux-gnu". This changes the
triple normalization to behave the same way, replacing empty triple
components with "unknown".
This addresses PR37129.
Differential Revision: https://reviews.llvm.org/D50219
llvm-svn: 339294
Normally, if any registers are spilled, we prefer to spill lr on Thumb1
so we can fold the "bx lr" into the "pop". However, if there are tail
calls involved, restoring lr is expensive, so skip the optimization in
that case.
The spill of r7 in the new test also isn't necessary, but that's
mostly orthogonal to this patch. (It's the same code in
ARMFrameLowering, but it's not related to tail calls.)
Differential Revision: https://reviews.llvm.org/D49459
llvm-svn: 339283
Summary:
Currently, in line with GCC, when specifying reserved registers like sp or pc on an inline asm() clobber list, we don't always preserve the original value across the statement. And in general, overwriting reserved registers can have surprising results.
For example:
```
extern int bar(int[]);
int foo(int i) {
int a[i]; // VLA
asm volatile(
"mov r7, #1"
:
:
: "r7"
);
return 1 + bar(a);
}
```
Compiled for thumb, this gives:
```
$ clang --target=arm-arm-none-eabi -march=armv7a -c test.c -o - -S -O1 -mthumb
...
foo:
.fnstart
@ %bb.0: @ %entry
.save {r4, r5, r6, r7, lr}
push {r4, r5, r6, r7, lr}
.setfp r7, sp, #12
add r7, sp, #12
.pad #4
sub sp, #4
movs r1, #7
add.w r0, r1, r0, lsl #2
bic r0, r0, #7
sub.w r0, sp, r0
mov sp, r0
@APP
mov.w r7, #1
@NO_APP
bl bar
adds r0, #1
sub.w r4, r7, #12
mov sp, r4
pop {r4, r5, r6, r7, pc}
...
```
r7 is used as the frame pointer for thumb targets, and this function needs to restore the SP from the FP because of the variable-length stack allocation a. r7 is clobbered by the inline assembly (and r7 is included in the clobber list), but LLVM does not preserve the value of the frame pointer across the assembly block.
This type of behavior is similar to GCC's and has been discussed on the bugtracker: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11807 . No consensus seemed to have been reached on the way forward. Clang behavior has briefly been discussed on the CFE mailing (starting here: http://lists.llvm.org/pipermail/cfe-dev/2018-July/058392.html). I've opted for following Eli Friedman's advice to print warnings when there are reserved registers on the clobber list so as not to diverge from GCC behavior for now.
The patch uses MachineRegisterInfo's target-specific knowledge of reserved registers, just before we convert the inline asm string in the AsmPrinter.
If we find a reserved register, we print a warning:
```
repro.c:6:7: warning: inline asm clobber list contains reserved registers: R7 [-Winline-asm]
"mov r7, #1"
^
```
Reviewers: eli.friedman, olista01, javed.absar, efriedma
Reviewed By: efriedma
Subscribers: efriedma, eraman, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D49727
llvm-svn: 339257
Summary: Extend fix for PR34170 to support inline assembly with multiple output operands that do not naturally go in the register class it is constrained to (eg. double in a 32-bit GPR as in the PR).
Reviewers: bogner, t.p.northover, lattner, javed.absar, efriedma
Reviewed By: efriedma
Subscribers: efriedma, tra, eraman, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D45437
llvm-svn: 339225
Clang support for the Armv8.2-A FP16 vector intrinsic was committed in
rC328277, but this was never followed up, i.e. the LLVM part is missing.
I've raised PR38404, and this is the first step to address this. I.e.,
this adds tests for the Armv8.2-A FP16 vector intrinsic, and thus shows
which intrinsics already work, and which need further work.
Differential Revision: https://reviews.llvm.org/D50142
llvm-svn: 338568
Disable ARMCodeGenPrepare by default again. It is causing verifier
failues in V8 that look like:
Duplicate integer as switch case
switch i32 %trunc, label %if.end13 [
i32 0, label %cleanup36
i32 0, label %if.then8
], !dbg !4981
i32 0
fatal error: error in backend: Broken function found, compilation aborted!
I will continue reducing the test case and send it along.
llvm-svn: 338452
This reapplies commit r338206 reverted by r338214 since the bug that
r338206 uncovered has been fixed in r338268.
Add support for inline assembly with matching input operand that do not
naturally go in the register class it is constrained to (eg. double in a
32-bit GPR). Note that regular input is already handled by existing
code.
llvm-svn: 338269
Code in `CC_ARM_AAPCS_Custom_Aggregate()` is responsible for handling
homogeneous aggregates for `CC_ARM_AAPCS_VFP`. When an aggregate ends up
fully on stack, the function tries to pack all resulting items of the
aggregate as tightly as possible according to AAPCS.
Once the first item was laid out, the alignment used for consecutive
items was the size of one item. This logic went wrong for 128-bit
vectors because their alignment is normally only 64 bits, and so could
result in inserting unexpected padding between the first and second
element.
The patch fixes the problem by updating the alignment with the item size
only if this results in reducing it.
Differential Revision: https://reviews.llvm.org/D49720
llvm-svn: 338233
Add support for inline assembly with matching input operand that do not
naturally go in the register class it is constrained to (eg. double in a
32-bit GPR). Note that regular input is already handled by existing
code.
llvm-svn: 338206
This feature enables the fusion of such operations on Cortex A57 and Cortex
A72, as recommended in their Software Optimisation Guides, sections 4.14 and
4.11, respectively.
Differential revision: https://reviews.llvm.org/D49563
llvm-svn: 338147
Add support for inline assembly with output operand that do not
naturally go in the register class it is constrained to (eg. double in a
32-bit GPR as in the PR).
llvm-svn: 337903
ARM Stage 2 builders have been suspiciously broken since the pass was
committed. Disabling to hopefully fix the bots and give me time to
debug.
llvm-svn: 337821
Arm specific codegen prepare is implemented to perform type promotion
on icmp operands, which can enable the removal of uxtb and uxth
(unsigned extend) instructions. This is possible because performing
type promotion before ISel alleviates this duty from the DAG builder
which has to perform legalisation, but has a limited view on data
ranges.
The pass visits any instruction operand of an icmp and creates a
worklist to traverse the use-def tree to determine whether the values
can simply be promoted. Our concern is values in the registers
overflowing the narrow (i8, i16) data range, so instructions marked
with nuw can be promoted easily. For add and sub instructions, we are
able to use the parallel dsp instructions to operate on scalar data
types and avoid overflowing bits. Underflowing adds and subs are also
permitted when the result is only used by an unsigned icmp.
Differential Revision: https://reviews.llvm.org/D48832
llvm-svn: 337687
Enable the optimization of operations on DPR and SPR via a feature instead
of checking the target.
Differential revision: https://reviews.llvm.org/D49463
llvm-svn: 337575
We were emitting incorrect calls to libm functions that LLVM had decided it
knew about because the default is soft-float.
Recommitted without breaking ELF this time.
llvm-svn: 337450
If we are only extracting vector elements via EXTRACT_VECTOR_ELT(s) we may be able to use SimplifyDemandedVectorElts to avoid unnecessary vector ops.
Differential Revision: https://reviews.llvm.org/D49262
llvm-svn: 337258
See https://reviews.llvm.org/D47106 for details.
Reviewed By: probinson
Differential Revision: https://reviews.llvm.org/D47171
This commit drops that patch's changes to:
llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
llvm/test/CodeGen/NVPTX/param-load-store.ll
For some reason, the dos line endings there prevent me from commiting
via the monorepo. A follow-up commit (not via the monorepo) will
finish the patch.
llvm-svn: 336843
This fixes an issue that we were not properly supporting multiple reduction
stmts in a loop, and not generating SMLADs for these cases. The alias analysis
checks were done too early, making it too conservative.
Differential revision: https://reviews.llvm.org/D49125
llvm-svn: 336795
The original code attempted to do this, but the std::abs() call didn't
actually do anything due to implicit type conversions. Fix the type
conversions, and perform the correct check for negative immediates.
This probably has very little practical impact, but it's worth fixing
just to avoid confusion in the future, I think.
Differential Revision: https://reviews.llvm.org/D48907
llvm-svn: 336742
Added statistics for the number of SMLAD instructions created, and
als renamed the pass name to -arm-parallel-dsp.
Differential Revision: https://reviews.llvm.org/D48971
llvm-svn: 336441
D48278
Allow to reduce redundant shift masks.
For example:
x1 = x & 0xAB00
x2 = (x >> 8) & 0xAB
can be reduced to:
x1 = x & 0xAB00
x2 = x1 >> 8
It only allows folding when the masks and shift values are constants.
llvm-svn: 336426
We were miscompiling i8 loads, so reject them as unsupported narrow operations
for now.
Differential Revision: https://reviews.llvm.org/D48944
llvm-svn: 336319
Armv6 introduced instructions to perform 32-bit SIMD operations. The purpose of
this pass is to do some straightforward IR pattern matching to create ACLE DSP
intrinsics, which map on these 32-bit SIMD operations.
Currently, only the SMLAD instruction gets recognised. This instruction
performs two multiplications with 16-bit operands, and stores the result in an
accumulator. We will follow this up with patches to recognise SMLAD in more
cases, and also to generate other DSP instructions (like e.g. SADD16).
Patch by: Sam Parker and Sjoerd Meijer
Differential Revision: https://reviews.llvm.org/D48128
llvm-svn: 335850
This patch adds support for the q versions of the dup
(load-to-all-lanes) NEON intrinsics, such as vld2q_dup_f16() for
example.
Currently, non-q versions of the dup intrinsics are implemented
in clang by generating IR that first loads the elements of the
structure into the first lane with the lane (to-single-lane)
intrinsics, and then propagating it other lanes. There are at
least two problems with this approach. First, there are no
double-spaced to-single-lane byte-element instructions. For
example, there is no such instruction as 'vld2.8 { d0[0], d2[0]
}, [r0]'. That means we cannot rely on the to-single-lane
intrinsics and instructions to implement the q versions of the
dup intrinsics. Note that to-all-lanes instructions do support
all sizes of data items, including bytes.
The second problem with the current approach is that we need a
separate vdup instruction to propagate the structure to each
lane. So for vld4q_dup_f16() we would need four vdup instructions
in addition to the initial vld instruction.
This patch introduces dup LLVM intrinsics and reworks handling of
the currently supported (non-q) NEON dup intrinsics to expand
them into those LLVM intrinsics, thus eliminating the need for
using to-single-lane intrinsics and instructions.
Additionally, this patch adds support for u64 and s64 dup NEON
intrinsics. These are marked as Arch64-only in the ARM NEON
Reference, but it seems there are no reasons to not support them
in AArch32 mode. Please correct, if that is wrong.
That's what we generate with this patch applied:
vld2q_dup_f16:
vld2.16 {d0[], d2[]}, [r0]
vld2.16 {d1[], d3[]}, [r0]
vld3q_dup_f16:
vld3.16 {d0[], d2[], d4[]}, [r0]
vld3.16 {d1[], d3[], d5[]}, [r0]
vld4q_dup_f16:
vld4.16 {d0[], d2[], d4[], d6[]}, [r0]
vld4.16 {d1[], d3[], d5[], d7[]}, [r0]
Differential Revision: https://reviews.llvm.org/D48439
llvm-svn: 335733
Summary:
If a routine with no stack frame makes a sibling call, we need to
preserve the stack space check even if the local stack frame is empty,
since the call target could be a "no-split" function (in which case
the linker needs to be able to fix up the prolog sequence in order to
switch to a larger stack).
This fixes PR37807.
Reviewers: cherry, javed.absar
Subscribers: srhines, llvm-commits
Differential Revision: https://reviews.llvm.org/D48444
llvm-svn: 335604
With compilation fix.
Original commit message:
D39788 added a '.stack-size' section containing metadata on function stack sizes
to output ELF files behind the new -stack-size-section flag.
This change does following two things on top:
1) Imagine the case when there are -ffunction-sections flag given and there are text sections in COMDATs.
The patch adds a '.stack-size' section into corresponding COMDAT group, so that linker will be able to
eliminate them fast during resolving the COMDATs.
2) Patch sets a SHF_LINK_ORDER flag and links '.stack-size' with the corresponding .text.
With that linker will be able to do -gc-sections on dead stack sizes sections.
Differential revision: https://reviews.llvm.org/D46874
llvm-svn: 335336
D39788 added a '.stack-size' section containing metadata on function stack sizes
to output ELF files behind the new -stack-size-section flag.
This change does following two things on top:
1) Imagine the case when there are -ffunction-sections flag given and there are text sections in COMDATs.
The patch adds a '.stack-size' section into corresponding COMDAT group, so that linker will be able to
eliminate them fast during resolving the COMDATs.
2) Patch sets a SHF_LINK_ORDER flag and links '.stack-size' with the corresponding .text.
With that linker will be able to do -gc-sections on dead stack sizes sections.
Differential revision: https://reviews.llvm.org/D46874
llvm-svn: 335332