Recent code for folding MINVAL() didn't allow for architectures
whose C/C++ char type is unsigned, so the value of the maximum
Fortran character was incorrect. This was caught by the
folding20.f90 test. The fix is to avoid numeric_limits<> and
use hard values for max signed integers of various character kinds.
Pushing into llvm-project/main to restore ARM/POWER buildbots.
Currently the value is only used when calling `F->viewCFG()` which is missing out on its potential and usefulness.
So I added the check to the printer passes as well.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D102011
Use a "double-double" accumulator, a/k/a Kahan summation,
in the SUM intrinsic in the runtime for real & complex.
This seems to be the best-recommended technique for reducing
error, as opposed to the initial implementation of SUM's
distinct accumulators for positive and negative items.
Differential Revision: https://reviews.llvm.org/D104338
The `MockAllocator` used in `ScudoTSDTest` wasn't allocated
properly aligned, which resulted in the `TSDs` of the shared
registry not being aligned either. This lead to some failures
like: https://reviews.llvm.org/D103119#2822008
This changes how the `MockAllocator` is allocated, same as
Vitaly did in the combined tests, properly aligning it, which
results in the `TSDs` being aligned as well.
Add a `DCHECK` in the shared registry to check that it is.
Differential Revision: https://reviews.llvm.org/D104402
Implement constant folding for the reduction transformational
intrinsic functions MAXVAL and MINVAL.
In anticipation of more folding work to follow, with (I hope)
some common infrastructure, these two have been implemented in a
new header file.
Differential Revision: https://reviews.llvm.org/D104337
The test added in https://reviews.llvm.org/D104305 will only work with
the new driver and should be marked as such.
Sending this without a review as it's fairly straightforward and fixes
test failures for developers that don't want to build the new driver.
When a program attempts to put something like a subprogram
into an array constructor, emit an error rather than crashing.
Differential Revision: https://reviews.llvm.org/D104336
Two-level distributed barrier is a new experimental barrier designed
for Intel hardware that has better performance in some cases than the
default hyper barrier.
This barrier is designed to handle fine granularity parallelism where
barriers are used frequently with little compute and memory access
between barriers. There is no need to use it for codes with few
barriers and large granularity compute, or memory intensive
applications, as little difference will be seen between this barrier
and the default hyper barrier. This barrier is designed to work
optimally with a fixed number of threads, and has a significant setup
time, so should NOT be used in situations where the number of threads
in a team is varied frequently.
The two-level distributed barrier is off by default -- hyper barrier
is used by default. To use this barrier, you must set all barrier
patterns to use this type, because it will not work with other barrier
patterns. Thus, to turn it on, the following settings are required:
KMP_FORKJOIN_BARRIER_PATTERN=dist,dist
KMP_PLAIN_BARRIER_PATTERN=dist,dist
KMP_REDUCTION_BARRIER_PATTERN=dist,dist
Branching factors (set with KMP_FORKJOIN_BARRIER, KMP_PLAIN_BARRIER,
and KMP_REDUCTION_BARRIER) are ignored by the two-level distributed
barrier.
Differential Revision: https://reviews.llvm.org/D103121
Currently, `access` doesn't recognize a dereferenced smart pointer. So,
`access(e, "field")` where `e = *x`, yields:
* `x->field`, for normal-pointer x,
* `(*x).field`, for smart-pointer x.
This patch normalizes handling of smart pointer to match normal pointer, when
the smart pointer type supports `->`.
Differential Revision: https://reviews.llvm.org/D104390
Adds an integration test for the SPMM (sparse matrix multiplication) kernel, which multiplies a sparse matrix by a dense matrix, resulting in a dense matrix. This is just a simple modification on the existing matrix-vector multiplication kernel.
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D104334
Currently, `hasUnaryOperand` fails for the overloaded `operator*`. This patch fixes the bug and
adds tests for this case.
Differential Revision: https://reviews.llvm.org/D104389
Make store to load fwd condition for -memref-dataflow-opt less
conservative. Post dominance info is not really needed. Add additional
check for common cases.
Differential Revision: https://reviews.llvm.org/D104174
To control the number of outer parallel loops, we need to process the
outer loops first and hence pre-order walk fixes the issue.
Reviewed By: bondhugula
Differential Revision: https://reviews.llvm.org/D104361
This allows for dialects to do different post-processing depending on operations with the inliner (my use case requires different attribute propagation rules depending on call op). This hook runs before the regular processInlinedBlocks method.
Differential Revision: https://reviews.llvm.org/D104399
Put the dtor of mca::CustomBehaviour into the cpp file to avoid
undefined vtable when linking libLLVMMCACustomBehaviourAMDGPU as shared
library.
Differential Revision: https://reviews.llvm.org/D104401
In a region with multiple blocks the verifier will try to look for
dominance and may get successor list for blocks, even though a block
may be empty or does not end with a terminator.
Differential Revision: https://reviews.llvm.org/D104411
I removed them in rG5de7467e982 but @thakis pointed out that
they were useful to keep, so here they are again. I've also converted
the `!isCoalescedWeak()` asserts into `!shouldOmitFromOutput()` asserts,
since the latter check subsumes the former.
Reviewed By: #lld-macho, thakis
Differential Revision: https://reviews.llvm.org/D104169
Previously dangling samples were represented by INT64_MAX in sample profile while probes never executed were not reported. This was based on an observation that dangling probes were only at a smaller portion than zero-count probes. However, with compiler optimizations, dangling probes end up becoming at large portion of all probes in general and reporting them does not make sense from profile size point of view. This change flips sample reporting by reporting zero-count probes instead. This enabled dangling probe to be represented by none (missing entry in profile). This has a couple benefits:
1. Reducing sample profile size in optimize mode, even when the number of non-executed probes outperform the number of dangling probes, since INT64_MAX takes more space over 0 to encode.
2. Binary size savings. No need to encode dangling probe anymore, since missing probes are treated as dangling in the profile reader.
3. Reducing compiler work to track dangling probes. However, for probes that are real dead and removed, we still need the compiler to identify them so that they can be reported as zero-count, instead of mistreated as dangling probes.
4. Improving counts quality by respecting the counts already collected on the non-dangling copy of a probe. A probe, when duplicated, gets two copies at runtime. If one of them is dangling while the other is not, merging the two probes at profile generation time will cause the real samples collected on the non-dangling one to be discarded. Not reporting the dangling counterpart will keep the real samples.
5. Better readability.
6. Be consistent with non-CS dwarf line number based profile. Zero counts are trusted by the compiler counts inferencer while missing counts will be inferred by the compiler.
Note that the current patch does include any work for #3. There will be follow-up changes.
For #1, I've seen for a large Facebook service, the text profile is reduced by 7%. For extbinary profile, the size of LBRProfileSection is reduced by 35%.
For #4, I have seen general counts quality for SPEC2017 is improved by 10%.
Reviewed By: wenlei, wlei, wmi
Differential Revision: https://reviews.llvm.org/D104129
We have several ways of introducing a scalar invariant value into
linalg generic ops (should we limit this somewhat?). This revision
makes sure we handle all of them correctly in the sparse compiler.
Reviewed By: gysit
Differential Revision: https://reviews.llvm.org/D104335
I'm not sure what behavior we want if the FP environment is
not default (also not sure if there's a way to enumerate
the full list of intrinsics programmatically), but currently
these are all defaulting to 'false' (doesn't propagate).
When chasing down another unrelated bug, I noticed that the
implementations of various character intrinsic functions assume
that the lower bounds of (some of) their arguments were 1.
This isn't necessarily the case, so I've cleaned them up, tweaked
the unit tests to exercise the fix, and regularized the allocation
pattern used for results to use SetBounds() before Allocate() rather
than the old original Descriptor::Allocate() wrapper around
CFI_allocate().
Since there were few other remaining uses of the old original
Descriptor::Allocate() wrapper, I also converted them to the
new one and deleted the old one.
Differential Revision: https://reviews.llvm.org/D104325
When using FileIndexRecord with macros, symbol references can be seen
out of source order, which was causing a regression to insert the
symbols into a vector. Instead, we now lazily sort the vector. The
impact is small on most code, but in very large files with many macro
references (M) near the beginning of the file followed by many decl
references (D) it was O(M*D). A particularly bad protobuf-generated
header was observed with a 100% regression in practice.
rdar://78628133
There is no need to differentiate whether `UseSegments` is true or
false. Unifying the cases makes the behavior closer to BinaryWriter.
This improves compatibility with objcopy because SHF_ALLOC sections not in
a PT_LOAD will not be skipped. Such cases are usually erroneous input, though.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D104186
The original change was pushed in main as commit f7a23ecece.
It was then reverted by commit a04f01bab2 because it caused linker failures
on buildbots that don't build the AMDGPU target.
--
Some instructions are not defined well enough within the target’s scheduling
model for llvm-mca to be able to properly simulate its behaviour. The ideal
solution to this situation is to modify the scheduling model, but that’s not
always a viable strategy. Maybe other parts of the backend depend on that
instruction being modelled the way that it is. Or maybe the instruction is quite
complex and it’s difficult to fully capture its behaviour with tablegen. The
CustomBehaviour class (which I will refer to as CB frequently) is designed to
provide intuitive scaffolding for developers to implement the correct modelling
for these instructions.
More details are available in the original commit log message (f7a23ecece).
Differential Revision: https://reviews.llvm.org/D104149
This change-set removes libelf usage from elf_common part of the plugins.
libelf is still used in x86_64 generic plugin code and in some plugins
(e.g. amdgpu) - these will have to be cleaned up in separate checkins.
Differential Revision: https://reviews.llvm.org/D103545
We already have this fold:
fadd float poison, 1.0 --> poison
...via ConstantFolding, so this makes the behavior consistent
if the other operand(s) are non-constant.
The fold for undef was added before poison existed as a
value/type in IR.
This came up in D102673 / D103169
because we're trying to sort out the more complicated handling
for constrained math ops.
We should have the handling for the regular instructions done
first, so we can build on that (or diverge as needed).
Differential Revision: https://reviews.llvm.org/D104383
Currently, the implementation combines OOP and overloads, using a template to
tie the two together. In practice, this has proven confusing with no
benefits. This patch simplifies the code to use standard OOP design (a
collection of classes deriving from an interface).
Differential Revision: https://reviews.llvm.org/D104317
It's a warning in ld64. While having LLD be stricter would be nice, it
makes it harder for it to be a drop-in replacement into existing builds.
Reviewed By: #lld-macho, thakis
Differential Revision: https://reviews.llvm.org/D104333
This only applies to FastIsel. GlobalIsel seems to sidestep
the issue.
This fixes https://bugs.llvm.org/show_bug.cgi?id=46996
One of the things we do in llvm is decide if a type needs
consecutive registers. Previously, we just checked if it
was an array or not.
(plus an SVE specific check that is not changing here)
This causes some confusion when you arbitrary IR like:
```
%T1 = type { double, i1 };
define [ 1 x %T1 ] @foo() {
entry:
ret [ 1 x %T1 ] zeroinitializer
}
```
We see it is an array so we call CC_AArch64_Custom_Block
which bails out when it sees the i1, a type we don't want
to put into a block.
This leaves the location of the double in some kind of
intermediate state and leads to odd codegen. Which then crashes
the backend because it doesn't know how to implement
what it's been asked for.
You get this:
```
renamable $d0 = FMOVD0
$w0 = COPY killed renamable $d0
```
Rather than this:
```
$d0 = FMOVD0
$w0 = COPY $wzr
```
The backend knows how to copy 64 bit to 64 bit registers,
but not 64 to 32. It can certainly be taught how but the real
issue seems to be us even trying to assign a register block
in the first place.
This change makes the logic of
AArch64TargetLowering::functionArgumentNeedsConsecutiveRegisters
a bit more in depth. If we find an array, also check that all the
nested aggregates in that array have a single member type.
Then CC_AArch64_Custom_Block's assumption of a type that looks
like [ N x type ] will be valid and we get the expected codegen.
New tests have been added to exercise these situations. Note that
some of the output is not ABI compliant. The aim of this change is
to simply handle these situations and not to make our processing
of arbitrary IR ABI compliant.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D104123