This is preparation for the https://reviews.llvm.org/D101025.
The D101025 will start calculating var locstats for concrete fns
that refere to an abstract origin as well.
When using predicated intrinsics, if the predicate used is all lanes active,
use an unpredicated form of the instruction, additionally this allows for
better use of immediate forms.
This only includes instructions where the unpredicated/predicated forms
matched in such a way that instruction selection would not introduce extra
ptrue instructions. This allows us to convert the intrinsics directly to
architecture independent ISD nodes.
Depends on D101062
Differential Revision: https://reviews.llvm.org/D101828
When using predicated arithmetic intrinsics, if the predicate used is all
lanes active, use an unpredicated form of the instruction, additionally
this allows for better use of immediate forms.
This also includes a new complex isel pattern which allows matching an
all active predicate when the types are different but the predicate is a
superset of the type being used. For example, to allow a b8 ptrue for a
b32 predicate operand.
This only includes instructions where the unpredicated/predicated forms
are mismatched between variants, meaning that the removal of the
predicate is done during instruction selection in order to prevent
spurious re-introductions of ptrue instructions.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D101062
The function template `CallLowering::setArgFlags` is invoked both
for arguments and return values. In the latter case, it calls
`getParamStackAlign` with argument index `~0u`. Nothing wrong
happens now, as the argument is safely incremented back to 0
inside `getParamStackAlign` (the type is `unsigned`), but in
principle it's fragile and may become incorrect.
Differential Revision: https://reviews.llvm.org/D102004
DAGCombiner tries to combine a (fpext (load)) to (fround (extload))
but SVE has no FP-extending loads. By marking these as expand,
the combine no longer happens.
This also fixes a similar issue for fptrunc, where the source type
is not a legal type.
Reviewed By: bsmith, kmclaughlin
Differential Revision: https://reviews.llvm.org/D102053
When using parallel loop construct, the OpenMP specification allows for
guided, auto and runtime as scheduling variants (as well as static and
dynamic which are already supported).
This adds the translation from MLIR to LLVM-IR for these scheduling
variants.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101435
In the buffer deallocation pass, unranked memref types are not properly supported.
After investigating this issue, it turns out that the Clone and Dealloc operation
does not support unranked memref types in the current implementation.
This patch adds the missing feature and enables the transformation of any memref
type.
This patch solves this bug: https://bugs.llvm.org/show_bug.cgi?id=48385
Differential Revision: https://reviews.llvm.org/D101760
According to:
https://docs.python.org/3/library/subprocess.html#subprocess.Popen.poll
poll can return None if the process hasn't terminated.
I'm not quite sure how addr2line could end up closing the pipe without
terminating but we did see this happen on one of our bots:
```
<...>scripts/asan_symbolize.py",
line 211, in symbolize
logging.debug("addr2line exited early (broken pipe), returncode=%d"
% self.pipe.poll())
TypeError: %d format: a number is required, not NoneType
```
Handle None by printing a message that we couldn't get the return
code.
Reviewed By: delcypher
Differential Revision: https://reviews.llvm.org/D101891
This is a follow up on D101524 which:
- simplifies cpu features detection and usage,
- flattens target dependent optimizations so it's obvious which implementations are generated,
- provides an implementation targeting the host (march/mtune=native) for the mem* functions,
- makes sure all implementations are unittested (provided the host can run them),
- makes sure all implementations are benchmarkable (provided the host can run them).
Differential Revision: https://reviews.llvm.org/D101895
Large loads on target that does not useFlatForGlobal have to be split
in regbankselect. This did not happen in case when destination had vgpr
bank and address had sgpr bank.
Instead of checking if address bank is sgpr check bank of the destination.
Differential Revision: https://reviews.llvm.org/D101992
Previously, the OpenMP to LLVM IR conversion was setting the alloca insertion
point to the same position as the main compuation when converting OpenMP
`parallel` operations. This is problematic if, for example, the `parallel`
operation is placed inside a loop and would keep allocating on stack on each
iteration leading to stack overflow.
Reviewed By: kiranchandramohan
Differential Revision: https://reviews.llvm.org/D101307
Previously clang would print a binary blob into the bundled file
for amdgcn. With this patch, it will instead print textual IR as
expected.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102065
This patch provides a way to specify the default target cpu optimizations to use when compiling llvm-libc.
This ensures we don't rely on current compiler's default and allows compiling and cross compiling for a particular target.
Differential Revision: https://reviews.llvm.org/D101991
This patch prevents runtime tests running on systems without amdgpu.
Reviewed By: protze.joachim, tianshilei1992
Differential Revision: https://reviews.llvm.org/D102054
This patch is suppose to fix the issue of hsa.h not found.
Issue was reported in D99949
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D102067
This patch extends VectorLegalizer::ExpandSELECT to permit expansion
also for scalable vector types. The only real change is conditionally
checking for BUILD_VECTOR or SPLAT_VECTOR legality depending on the
vector type.
We can use this to fix "cannot select" errors for scalable vector
selects on the RISCV target. Note that in future patches RISCV will
possibly custom-lower vector SELECTs to VSELECTs for branchless codegen.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102063
Inside a templated function, other class members need to be called with
this->.
Otherwise we get: explicit qualification required to use member
'setDebugName' from dependent base class.
Since index_vector is lowered into step_vector in D100816, we can just remove
index_vector, use step_vector for codegen directly.
Differential Revision: https://reviews.llvm.org/D101593
Generalizing this API allows work to be distributed more evenly. In particular,
query callbacks can now be dispatched (rather than running immediately on the
thread that satisfied the query). This avoids the pathalogical case where an
operation on one thread satisfies many queries simultaneously, causing large
amounts of work to be run on that thread while other threads potentially sit
idle.
Ignore ephemeral values (only feeding llvm.assume intrinsics) when
computing the instruction count to decide if a block is small enough for
threading. This is similar to the handling of these values in the
InlineCost computation. These instructions will eventually be removed
and shouldn't count against code size (similar to the existing ignoring
of phis).
Without this change, when enabling -fwhole-program-vtables, which causes
type test / assume sequences to be inserted by clang, we can get
different threading decisions. In particular, when building with
instrumentation FDO it can affect the optimizations decisions before FDO
matching, leading to some mismatches.
Differential Revision: https://reviews.llvm.org/D101494
We are able to bind the result from native function while rewriting
pattern. In matching pattern, if we want to get some values back, we can
do that by passing parameter as return value placeholder. Besides, add
the semantic of '$_self' in NativeCodeCall while matching, it'll be the
operation that defines certain operand.
Differential Revision: https://reviews.llvm.org/D100746
On a section with alignment of 16, subsections aligned to 16-byte
boundaries should keep their 16-byte alignment.
Fixes PR50274. (The same bug could have happened with -order_file
previously.)
Differential Revision: https://reviews.llvm.org/D102139
This would cause us to pull in symbols (and code) that should
be unused.
Reviewed By: #lld-macho, thakis
Differential Revision: https://reviews.llvm.org/D102137
As confirmed by exegesis measurements, and ref docs.
It does actually execute.
While there, bump latency for MULX32rr, that seems to match measurements.
applyLoopGuards() already combines conditions from multiple nested
guards. However, it cannot use multiple conditions on the same guard,
combined using and/or. Add support for this by recursing into either
`and` or `or`, depending on the direction of the branch.
Differential Revision: https://reviews.llvm.org/D101692
To the best of my knowledge, all instructions are modelled,
and have reasonable values to them; flipping the switch
doesn't cause any diff for MCA tests, so either we're good,
or we have test coverage gaps.
I'm not really sure why no other X86 sched model is marked as complete.