We can handle the distinction easily enough in the generic code, and
this makes it easier to abstract the selection of type/location from
the code to insert code.
This is a bug fix in legality check.
When we encounter triangular loops such as the following form:
for (int i = 0; i < m; i++)
for (int j = 0; j < i; j++), or
for (int i = 0; i < m; i++)
for (int j = 0; j*i < n; j++),
we should not perform interchange since the number of executions
of the loop body will be different before and after interchange,
resulting in incorrect results.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D101305
When making compilation relocatable, for example in distributed
compilation scenarios, we want to set compilation dir to a relative
value like `.` but this presents a problem when generating reports
because if the file path is relative as well, for example `..`, you
may end up writing files outside of the output directory.
This change introduces a flag that allows overriding the compilation
directory that's stored inside the profile with a different value that
is absolute.
Differential Revision: https://reviews.llvm.org/D100232
These can be used to create eh-frame section fixing passes outside the usual
linker pipeline, which can be useful for tests and tools that just want to
verify or dump graphs.
This commit reorders some fields and fixes the width of others to try to
maintain more consistent columns. It also switches to long-hand scope
and linkage names, since LinkGraph dumps aren't read often enough for
single-character codes to be memorable.
Remove the requirement that the instruction is a BinaryOperator,
make the predicate check more compact and use slightly more
meaningful naming for the and operands.
The waitcnt pass would increment the number of vmem events for some buffer
invalidates that were not handled by the pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D102252
When an integer is converted into floating point in subword vector extract,
it can be done in 2 instructions instead of the 3+ instructions it generates
right now. This patch removes the uncessary generation.
Differential: https://reviews.llvm.org/D100604
GlobalOpt implements a heap SROA (SROA for an malloc allocatated struct or array
of structs) which is largely undertested (heap-sra-[1234].ll are basically the
same test with very little difference) and does not trigger at all when
bootstrapping clang (it only supports the case of one single store).
The heap SROA implementation causes PR50027 (GEP is not properly handled; crash or miscompile).
Just drop the implementation. I have deleted some obviously duplicated tests
but kept `heap-sra-[12]{,-no-nullopt}.ll`.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D102257
Similar to X86 D73230 and AArch64 D101872
With this change, we can set dso_local in clang's -fpic -fno-semantic-interposition mode,
for default visibility external linkage non-ifunc-non-COMDAT definitions.
For such dso_local definitions, variable access/taking the address of a
function/calling a function will go through a local alias to avoid GOT/PLT.
Reviewed By: jrtc27, luismarques
Differential Revision: https://reviews.llvm.org/D101875
Make sure the alignment of the generated operations matches the
alignment of the byval argument. Previously, we were just ignoring
alignment and getting lucky.
While I'm here, also delete the unnecessary "tail" handling.
Passing a pointer to a byval argument to a "tail" call is UB, so
rewriting to an alloca doesn't require any special handling.
Differential Revision: https://reviews.llvm.org/D89819
My thought process is that if v2i64 is an LMUL=1 type then v2i32
should be an LMUL=1/2 type. We limit the fractional LMUL so that
SEW=64 clips to LMUL=1, SEW=32 clips to LMUL=1/2, etc. This
ensures there's always a fractional LMUL available to truncate a type.
This does reduce the number of vsetvlis in some cases.
Some tests increase vsetvlis because the best container type for a
mask type is dependent on the LMUL+SEW that the mask was produced
from, but you can't tell that from the type. I think this is
something we need to solve this in the machine IR when optimizing
vsetvlis.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D101215
I've seen this in the RawSpeed's BitPumpMSB*::push() hotpath,
after fixing the buffer abstraction to a more sane one,
when looking into a +5% runtime regression.
I was hoping that this would fix it, but it does not look it does.
This seems to be at least not worse than the original pattern.
But i'm actually mainly interested in the case where we already
compute `(y+32)` (see last test),
https://alive2.llvm.org/ce/z/ZCzJio
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D101944
Limited to splats because we would need to truncate the shift
amount vector otherwise.
I tried to do this with new ISD nodes and a DAG combine to
avoid such a large pattern, but we don't form the splat until
LegalizeDAG and need DAG combine to remove a scalable->fixed->scalable
cast before it becomes visible to the shift node. By the time that
happens we've already visited the truncate node and won't revisit it.
I think I have an idea how to improve i64 on RV32 I'll save for a
follow up.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102019
Since D87304, `align` become an invalid attribute on none pointer types and
verifier will reject bitcode that has invalid `align` attribute.
The problem is before the change, DeadArgumentElimination can easily
turn a pointer return type into a void return type without removing
`align` attribute. Teach Autograde to remove invalid `align` attribute
from return types to maintain bitcode compatibility.
rdar://77022993
Reviewed By: dexonsmith
Differential Revision: https://reviews.llvm.org/D102201
This is a bug fix in legality check.
When we encounter triangular loops such as the following form:
for (int i = 0; i < m; i++)
for (int j = 0; j < i; j++), or
for (int i = 0; i < m; i++)
for (int j = 0; j*i < n; j++),
we should not perform interchange since the number of executions of the loop body
will be different before and after interchange, resulting in incorrect results.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D101305
If the simplified VPValue is a recipe, we need to register it for Instr,
in case it needs to be recorded. The way this is handled in general may
change soon, following some post-commit comments.
This fixes PR50298.
Now that getMemoryOpCost() correctly handles all the vector variants,
we should no longer hand-roll our own version of it, but use it directly.
The AVX512 variant probably needs a similar change,
but there it is less obvious.
foldShuffleOfHorizOp only handled basic shufps(hop(x,y),hop(z,w)) folds - by moving this to canonicalizeShuffleMaskWithHorizOp we can work with more general/combined v4x32 shuffles masks, float/integer domains and support shuffle-of-packs as well.
The next step will be to support 256/512-bit vector cases.
Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.
Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.
Reviewed By: RKSimon, ABataev
Differential Revision: https://reviews.llvm.org/D100684
The test example from https://llvm.org/PR50256 (and reduced here)
shows that we can match a load combine candidate even when there
are no "or" instructions. We can avoid that by confirming that we
do see an "or". This doesn't apply when matching an or-reduction
because that match begins from the operands of the reduction.
Differential Revision: https://reviews.llvm.org/D102074
Moving code sinking pass before structurizer creates more sinking
opportunities.
The extra flow edges introduced by the structurizer can have adverse
effects on sinking, because the sinking pass prefers moving instructions
to blocks with unique predecessors and the structurizer destroys that
property in some cases.
A notable example is moving high-latency image instructions across kills.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D101115
The stack frame update code does not take into consideration spilling
to registers for callee saved registers. The option -ppc-enable-pe-vector-spills
turns on spilling to registers for callee saved registers and may expose a bug
in the code that moves a stack frame pointer update instruction.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D101366
STATEPOINT is a fancy and complex pseudo instruction which
has both tied defs and regmask operand.
Basic FastRA algorithm is as follows:
1. Mark registers used by defs as free
2. If instruction has regmask operand displace clobbered registers
according to regmask.
3. Assign registers for use operands.
In case of tied defs step 1 is replaced with allocation of registers
for them. But regmask is still processed, which may displace already
allocated registers. As a result, tied use and def will get assigned
to different registers.
This patch makes FastRA to process instruction's RegMask (if any) when
checking for physical registers interference.
That way tied operands won't get registers clobbered by regmask.
Reviewed By: arsenm, skatkov
Differential Revision: https://reviews.llvm.org/D99284
As we have been missing support for WebAssembly globals on the IR level,
the lowering of WASM_SYMBOL_TYPE_GLOBAL to IR was incomplete. This
commit fleshes out the lowering support, lowering references to and
definitions of addrspace(1) values to correctly typed
WASM_SYMBOL_TYPE_GLOBAL symbols.
Depends on D101608.
Differential Revision: https://reviews.llvm.org/D101913
This patch adds support for WebAssembly globals in LLVM IR, representing
them as pointers to global values, in a non-default, non-integral
address space. Instruction selection legalizes loads and stores to
these pointers to new WebAssemblyISD nodes GLOBAL_GET and GLOBAL_SET.
Once the lowering creates the new nodes, tablegen pattern matches those
and converts them to Wasm global.get/set of the appropriate type.
Based on work by Paulo Matos in https://reviews.llvm.org/D95425.
Reviewed By: pmatos
Differential Revision: https://reviews.llvm.org/D101608
This patch adds JSON output style to llvm-symbolizer to better support CLI automation by providing a machine readable output.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D96883
For Zvlsseg spilling, we need to convert the pseudo instructions
into multiple vector load/store instructions with appropriate offsets.
For example, for PseudoVSPILL3_M2, we need to convert it to
VS2R %v2, %base
ADDI %base, %base, (vlenb x 2)
VS2R %v4, %base
ADDI %base, %base, (vlenb x 2)
VS2R %v6, %base
We need to keep the size of the offset in the pseudo spilling instructions.
In this case, it is (vlenb x 2).
In the original implementation, we use the size of frame objects divide the
number of vectors in zvlsseg types. The size of frame objects is not
necessary exactly the same as the spilling data. It may be larger than
it. So, we change it to (VLENB x LMUL) in this patch. The calculation is
more direct and easy to understand.
Differential Revision: https://reviews.llvm.org/D101869
This change was originally landed in: 5000a1b4b9
It was reverted in: 061e071d8c
This change adds support for a new WASM_SEG_FLAG_STRINGS flag in
the object format which works in a similar fashion to SHF_STRINGS
in the ELF world.
Unlike the ELF linker this support is currently limited:
- No support for SHF_MERGE (non-string merging)
- Always do full tail merging ("lo" can be merged with "hello")
- Only support single byte strings (p2align 0)
Like the ELF linker merging is only performed at `-O1` and above.
This fixes part of https://bugs.llvm.org/show_bug.cgi?id=48828,
although crucially it doesn't not currently support debug sections
because they are not represented by data segments (they are custom
sections)
Differential Revision: https://reviews.llvm.org/D97657
This is roughly equivalent to the floating point portion of
`AArch64TargetLowering::LowerVSETCC`. Main part that's missing is the v4s16 bit.
This also adds helpers equivalent to `EmitVectorComparison`, and
`changeVectorFPCCToAArch64CC`. This moves `changeFCMPPredToAArch64CC` out of
the selector into AArch64GlobalISelUtils for the sake of code reuse.
This is done in post-legalizer lowering with pseudos to simplify selection.
The imported patterns end up handling selection for us this way.
Differential Revision: https://reviews.llvm.org/D101782
The combines in `tryCombineMemCpyFamily` have heuristics (e.g.
`TLI.getMaxStoresPerMemset`) which consider size. So, theoretically, enabling
these combines on minsize functions shouldn't be harmful.
With this enabled we save 0.9% geomean on CTMark at -Oz, and 5.1% on Bullet.
There are no code size regressions.
Differential Revision: https://reviews.llvm.org/D102198
Let's say you represent (i32, i32) as an i64 from which the parts
are extracted with lshr/trunc. Then, if you compare two tuples by
parts you get something like A[0] == B[0] && A[1] == B[1], just
that the part extraction happens by lshr/trunc and not a narrow
load or similar.
The fold implemented here reduces such equality comparisons by
converting them into a comparison on a larger part of the integer
(which might be the whole integer). It handles both the "and of eq"
and the conjugated "or of ne" case.
I'm being conservative with one-use for now, though this could be
relaxed if profitable (the base pattern converts 11 instructions
into 5 instructions, but there's quite a few variations on how it
can play out).
Differential Revision: https://reviews.llvm.org/D101232
This patch adds support for Darwin's libsystem math vector functions to
TLI. Darwin's libsystem provides a range of vector functions for libm
functions.
This initial patch only adds the 2 x double and 4 x float versions,
which are available on both X86 and ARM64. On X86, wider vector versions
are supported as well.
Reviewed By: jroelofs
Differential Revision: https://reviews.llvm.org/D101856
This change adds support for a new WASM_SEG_FLAG_STRINGS flag in
the object format which works in a similar fashion to SHF_STRINGS
in the ELF world.
Unlike the ELF linker this support is currently limited:
- No support for SHF_MERGE (non-string merging)
- Always do full tail merging ("lo" can be merged with "hello")
- Only support single byte strings (p2align 0)
Like the ELF linker merging is only performed at `-O1` and above.
This fixes part of https://bugs.llvm.org/show_bug.cgi?id=48828,
although crucially it doesn't not currently support debug sections
because they are not represented by data segments (they are custom
sections)
Differential Revision: https://reviews.llvm.org/D97657
For opaque pointers, we're trying to avoid uses of
PointerType::getElementType().
A couple of ISel places use PointerType::getElementType(). Some of these
are easy to fix by using ArgListEntry's indirect types.
The inalloca type wasn't stored there, as opposed to preallocated and
byval which have their indirect types available, so add it and use it.
Differential Revision: https://reviews.llvm.org/D101713
Instead of using VMap, which may include instructions from the
caller as a result of simplification, iterate over the
(FirstNewBlock, Caller->end()) range, which will only include new
instructions.
Fixes https://bugs.llvm.org/show_bug.cgi?id=50270.
Differential Revision: https://reviews.llvm.org/D102110
If spills are to registers instead of to the stack then a copy will be used
and frame index scavenging is not required.
This patch adds debug info to frame index scavenging and makes sure that
spilling to registers does not cause frame index scavenging.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D101360
For contiguous ranges we drop the last bit-test case but in doing so we skip
adding the new MBB PHI edges to the list of replacement PHI edges, and as a
result we incorrectly omit them in the G_PHI in finishPendingPhis().
Was found when bootstrapping clang with -O3 and GlobalISel enabled on Apple Silicon.
This is better no-functional-change-intended than the 1st attempt.
As noted in D102002, there were at least 2 diffs that went
unchecked in pass manager regressions tests: different pass
parameters (SimplifyCFG) and an extension point/callback.
Those should be lifted from the original code blocks correctly
now.
Patch by Artem Radzikhovskyy!
Allow delinearization of fixed sized arrays if we can prove that the GEP indices do not overflow the array dimensions. The checks applied are similar to the ones that are used for delinearization of parametric size arrays. Make sure that the GEP indices are non-negative and that they are smaller than the range of that dimension.
Changes Summary:
- Updated the LIT tests with more exact values, as we are able to delinearize and apply more exact tests
- profitability.ll - now able to delinearize in all cases, no need to use -da-disable-delinearization-checks flag and run the test twice
- loop-interchange-optimization-remarks.ll - in one of the cases we are able to delinearize without using -da-disable-delinearization-checks
- SimpleSIVNoValidityCheckFixedSize.ll - removed unnecessary "-da-disable-delinearization-checks" flag. Now can get the exact answer without it.
- SimpleSIVNoValidityCheckFixedSize.ll and PreliminaryNoValidityCheckFixedSize.ll - made negative tests more explicit, in order to demonstrate the need for "-da-disable-delinearization-checks" flag
Differential Revision: https://reviews.llvm.org/D101486
The logic for x86_64 position-independent TType encodings was backwards,
using 8 bytes where 4 were wanted and 4 where 8 were wanted. For regular
x86_64, this was mostly harmless, exception tables are allowed to use
8-byte encodings even when it is not needed. For the large code model,
and for X32, however, the generated exception tables were wrong. For the
large code model, we cannot assume that the address will fit in 4 bytes.
For X32, we cannot use 64-bit relocations.
Fixes PR50148.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D102132
Analogously to https://reviews.llvm.org/D98794 this patch uses the
`alignstack` attribute to fix incorrect passing of homogeneous
aggregate (HA) arguments on AArch32. The EABI/AAPCS was recently
updated to clarify how VFP co-processor candidates are aligned:
4488e34998
Differential Revision: https://reviews.llvm.org/D100853
This reverts commit fefcb1f878.
It was supposed to be NFC, but as noted in the post-commit
comments in D102002, that was not true: SimplifyCFG uses
different parameters and there's a difference in an
extension point / callback.
Need to remove the old code for avoiding double counting of the gather
nodes with perfect diamond matches within the tree after we started
detecting perfect/shuffled matching in the previous patch D100495. We
may skip the cost for such nodes completely.
Differential Revision: https://reviews.llvm.org/D102023
Expanding a fixed length operation involves wrapping the operation in an
insert/extract subvector pair, as such, when this is done to bitcast we
end up with an extract_subvector of a bitcast. DAGCombine tries to
convert this into a bitcast of an extract_subvector which restores the
initial fixed length bitcast, causing an infinite loop of legalization.
As part of this patch, we must make sure the above DAGCombine does not
trigger after legalization if the created bitcast would not be legal.
Differential Revision: https://reviews.llvm.org/D101990
A ConstantAggregateZero may be created from a scalable vector type.
However, it still assumed fixed number of elements when queried for
them. This patch changes ConstantAggregateZero to correctly report its
element count.
This change fixes a couple of issues. Firstly, it fixes a crash in
Constant::getUniqueValue when called on a scalable-vector
zeroinitializer constant.
Secondly, it fixes a latent bug in GlobalISel's IRTranslator in which
translating a scalable-vector zeroinitializer would hit the assertion in
ConstantAggregateZero::getNumElements when casting to a FixedVectorType,
rather than reporting an error more gracefully. This is currently
hypothetical as the IRTranslator has deeper issues preventing the use of
scalable vector types.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D102082
When using predicated intrinsics, if the predicate used is all lanes active,
use an unpredicated form of the instruction, additionally this allows for
better use of immediate forms.
This only includes instructions where the unpredicated/predicated forms
matched in such a way that instruction selection would not introduce extra
ptrue instructions. This allows us to convert the intrinsics directly to
architecture independent ISD nodes.
Depends on D101062
Differential Revision: https://reviews.llvm.org/D101828
When using predicated arithmetic intrinsics, if the predicate used is all
lanes active, use an unpredicated form of the instruction, additionally
this allows for better use of immediate forms.
This also includes a new complex isel pattern which allows matching an
all active predicate when the types are different but the predicate is a
superset of the type being used. For example, to allow a b8 ptrue for a
b32 predicate operand.
This only includes instructions where the unpredicated/predicated forms
are mismatched between variants, meaning that the removal of the
predicate is done during instruction selection in order to prevent
spurious re-introductions of ptrue instructions.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D101062
The function template `CallLowering::setArgFlags` is invoked both
for arguments and return values. In the latter case, it calls
`getParamStackAlign` with argument index `~0u`. Nothing wrong
happens now, as the argument is safely incremented back to 0
inside `getParamStackAlign` (the type is `unsigned`), but in
principle it's fragile and may become incorrect.
Differential Revision: https://reviews.llvm.org/D102004
DAGCombiner tries to combine a (fpext (load)) to (fround (extload))
but SVE has no FP-extending loads. By marking these as expand,
the combine no longer happens.
This also fixes a similar issue for fptrunc, where the source type
is not a legal type.
Reviewed By: bsmith, kmclaughlin
Differential Revision: https://reviews.llvm.org/D102053
When using parallel loop construct, the OpenMP specification allows for
guided, auto and runtime as scheduling variants (as well as static and
dynamic which are already supported).
This adds the translation from MLIR to LLVM-IR for these scheduling
variants.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101435
Large loads on target that does not useFlatForGlobal have to be split
in regbankselect. This did not happen in case when destination had vgpr
bank and address had sgpr bank.
Instead of checking if address bank is sgpr check bank of the destination.
Differential Revision: https://reviews.llvm.org/D101992
This patch extends VectorLegalizer::ExpandSELECT to permit expansion
also for scalable vector types. The only real change is conditionally
checking for BUILD_VECTOR or SPLAT_VECTOR legality depending on the
vector type.
We can use this to fix "cannot select" errors for scalable vector
selects on the RISCV target. Note that in future patches RISCV will
possibly custom-lower vector SELECTs to VSELECTs for branchless codegen.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102063
Since index_vector is lowered into step_vector in D100816, we can just remove
index_vector, use step_vector for codegen directly.
Differential Revision: https://reviews.llvm.org/D101593
Generalizing this API allows work to be distributed more evenly. In particular,
query callbacks can now be dispatched (rather than running immediately on the
thread that satisfied the query). This avoids the pathalogical case where an
operation on one thread satisfies many queries simultaneously, causing large
amounts of work to be run on that thread while other threads potentially sit
idle.
Ignore ephemeral values (only feeding llvm.assume intrinsics) when
computing the instruction count to decide if a block is small enough for
threading. This is similar to the handling of these values in the
InlineCost computation. These instructions will eventually be removed
and shouldn't count against code size (similar to the existing ignoring
of phis).
Without this change, when enabling -fwhole-program-vtables, which causes
type test / assume sequences to be inserted by clang, we can get
different threading decisions. In particular, when building with
instrumentation FDO it can affect the optimizations decisions before FDO
matching, leading to some mismatches.
Differential Revision: https://reviews.llvm.org/D101494
As confirmed by exegesis measurements, and ref docs.
It does actually execute.
While there, bump latency for MULX32rr, that seems to match measurements.
applyLoopGuards() already combines conditions from multiple nested
guards. However, it cannot use multiple conditions on the same guard,
combined using and/or. Add support for this by recursing into either
`and` or `or`, depending on the direction of the branch.
Differential Revision: https://reviews.llvm.org/D101692
To the best of my knowledge, all instructions are modelled,
and have reasonable values to them; flipping the switch
doesn't cause any diff for MCA tests, so either we're good,
or we have test coverage gaps.
I'm not really sure why no other X86 sched model is marked as complete.