Determined from llvm-mca analysis, AVX2+ capable targets have a higher throughput for VPBLENDVB and VPMOVZX ops, making it cheaper to perform shift+select patterns for vXi8 shifts or extend/shift/truncate for vXi16 shifts. Similarly AVX512BW can perform vXi8 as extend/shift/truncate patterns.
Prevent users of `iter_args` of an affine for loop from being hoisted
out of it. Otherwise, LICM leads to a violation of the SSA dominance
(as demonstrated in the added test case).
Fixes: https://bugs.llvm.org/show_bug.cgi?id=50103
Reviewed By: bondhugula, ayzhuang
Differential Revision: https://reviews.llvm.org/D102984
This previously handled memref::SubviewOp, but this can be extended to
all ops implementing the interface.
Differential Revision: https://reviews.llvm.org/D103076
Currently the vector load + extract gets lowered to a single scalar
store, not accounting for the fact that the index could be
out-of-bounds, which is poison, not UB.
See PR50382.
Clang adds a Decl in two phases to a DeclContext. First it adds it invisible and
then it makes it visible (which will add it to the lookup data structures). It's
important that we can't do lookups into the DeclContext we are currently adding
the Decl to during this process as once the Decl has been added, any lookup will
automatically build a new lookup map and add the added Decl to it. The second
step would then add the Decl a second time to the lookup which will lead to
weird errors later one. I made adding a Decl twice to a lookup an assertion
error in D84827.
In the first step Clang also does some computations on the added Decl if it's
for example a FieldDecl that is added to a RecordDecl.
One of these computations is checking if the FieldDecl is of a record type
and the record type has a deleted constexpr destructor which will delete
the constexpr destructor of the record that got the FieldDecl.
This can lead to a bug with the way we implement MinimalImport in LLDB
and the following code:
```
struct Outer {
typedef int HookToOuter;
struct NestedClass {
HookToOuter RefToOuter;
} NestedClassMember; // We are adding this.
};
```
1. We just imported `Outer` minimally so far.
2. We are now asked to add `NestedClassMember` as a FieldDecl.
3. We import `NestedClass` minimally.
4. We add `NestedClassMember` and clang does a lookup for a constexpr dtor in
`NestedClass`. `NestedClassMember` hasn't been added to the lookup.
5. The lookup into `NestedClass` will now load the members of `NestedClass`.
6. We try to import the type of `RefToOuter` which will try to import the `HookToOuter` typedef.
7. We import the typedef and while importing we check for conflicts in `Outer` via a lookup.
8. The lookup into `Outer` will cause the invisible `NestedClassMember` to be added to the lookup.
9. We continue normally until we get back to the `addDecl` call in step 2.
10. We now add `NestedClassMember` to the lookup even though we already did that in step 8.
The fix here is disabling the minimal import for RecordTypes from FieldDecls. We
actually already did this, but so far we only force the definition of the type
to be imported *after* we imported the FieldDecl. This just moves that code
*before* we import the FieldDecl so prevent the issue above.
Reviewed By: shafik, aprantl
Differential Revision: https://reviews.llvm.org/D102993
It's not clear why the whole test got disabled, but the linked bug report
has since been fixed and the only part of it that still fails is the test
for the too permissive lookup. This re-enables the test, rewrites it to use
the modern test functions we have and splits the failing part into its
own test that we can skip without disabling the rest.
The change is currently NFC, but exploited by the depending D102954.
Code to handle constants is borrowed from the general implementation
of Value::doRAUW().
Differential Revision: https://reviews.llvm.org/D103051
I really needed this, like, factually, yesterday,
when verifying dependency breaking idioms for AMD Zen 3 scheduler model.
Consider the following example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 }
error: ''
info: ''
assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3
...
```
What does it tell us?
So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle?
That doesn't seem right. That's even less than there are pipes supporting this type of op.
Now, second example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 }
error: ''
info: ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
Now that's just worse. Due to the looping, the throughput completely plummeted,
and now we can only do a single instruction/cycle!?
That's not great.
And final example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 }
error: ''
info: ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x
(loop-body-size/instruction count in snippet), and run a loop with 1000 iterations
over that duplicated/unrolled snippet, the measured throughput goes through the roof,
up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle!
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D102522
I cannot find documentation on this CPU, and it
is not supported by the Arm Compiler 5 product either.
It was likely a mistake or a different name for the
"ep9312", which is an Arm based Cirrus Logic chip.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D103024
Lower a 1D vector transfer op to LLVM if the last dim stride is 1. Also fixes a bug in the original unit stride computation.
Differential Revision: https://reviews.llvm.org/D102897
The D82085 "allow TRE for non-capturing calls" caused failure during bootstrap.
This patch does the same as D82085 plus fixes bootstrap error.
The problem with D82085 is that it does not create copies for byval
operands, while replacing function call with a branch.
Consider following example:
```
int zoo ( S p1 );
int foo ( int count, S p1 ) {
if ( count > 10 )
return zoo(p1);
// temporarily variable created for passing byvalue parameter
// p1 could be used when zoo(p1) is called(after TRE is done).
// lifetime.start p1.byvalue.temp
return foo(count+1, p1);
// lifetime.end p1.byvalue.temp
}
```
After recursive call to foo is replaced with a jump into
start of the function, its parameters could be passed to
zoo function. i.e. temporarily variable created for byvalue
parameter "p1" could be passed to zoo. Finally zoo receives
broken operand:
```
int foo ( int count, S p1 ) {
:tailrecurse
p1_tr = phi p1, p1.byvalue.temp
if ( count > 10 )
return zoo(p1_tr);
// temporarily variable created for passing byvalue parameter
// p1 could be used when zoo(p1) is called(after TRE is done).
lifetime.start p1.byvalue.temp
memcpy (p1.byvalue.temp, p1_tr)
count = count + 1
lifetime.end p1.byvalue.temp
br tailrecurse
}
```
To prevent using p1.byvalue.temp after its scope finished by
lifetime.end marker this patch copies value from p1.byvalue.temp
into another temporarily variable and then copies this variable
into the input parameter for next iteration.
This patch passes bootstrap build and bootstrap build with AddressSanitizer.
Differential Revision: https://reviews.llvm.org/D85614
..during on-demand parsing of CTU"
During CTU, the *on-demand parsing* will read and parse the invocation
list to know how to compile the file being imported. However, it seems
that the invocation list will be parsed again if a previous parsing
has failed.
Then, parse again and fail again. This patch tries to overcome the
problem by storing the error code during the first parsing, and
re-create the stored error during the later parsings.
Reland without test.
Reviewed By: steakhal
Patch By: OikawaKirie!
Differential Revision: https://reviews.llvm.org/D101763
During CTU, the *on-demand parsing* will read and parse the invocation
list to know how to compile the file being imported. However, it seems
that the invocation list will be parsed again if a previous parsing
has failed.
Then, parse again and fail again. This patch tries to overcome the
problem by storing the error code during the first parsing, and
re-create the stored error during the later parsings.
Reviewed By: steakhal
Patch By: OikawaKirie!
Differential Revision: https://reviews.llvm.org/D101763
This patch handles one particular case of one-iteration loops for which SCEV
cannot straightforwardly prove BECount = 1. The idea of the optimization is to
symbolically execute conditional branches on the 1st iteration, moving in topoligical
order, and only visiting blocks that may be reached on the first iteration. If we find out
that we never reach header via the latch, then the backedge can be broken.
Differential Revision: https://reviews.llvm.org/D102615
Reviewed By: reames
This patch introduces new operations on jitlink::Blocks: setMutableContent,
getMutableContent and getAlreadyMutableContent. The setMutableContent method
will set the block content data and size members and flag the content as
mutable. The getMutableContent method will return a mutable copy of the existing
content value, auto-allocating and populating a new mutable copy if the existing
content is marked immutable. The getAlreadyMutableMethod asserts that the
existing content is already mutable and returns it.
setMutableContent should be used when updating the block with totally new
content backed by mutable memory. It can be used to change the size of the
block. The argument value should *not* be shared with any other block.
getMutableContent should be used when clients want to modify the existing
content and are unsure whether it is mutable yet.
getAlreadyMutableContent should be used when clients want to modify the existing
content and know from context that it must already be immutable.
These operations reduce copy-modify-update boilerplate and unnecessary copies
introduced when clients couldn't me sure whether the existing content was
mutable or not.
GCC allows each target to define a set of non-letter and non-digit
escaped characters for inline assembly that will be replaced by another
string (They call this "punctuation" characters. The existing "%%" and
"%{" -- replaced by '%' and '{' at the end -- can be seen as special
cases shared by all targets).
This patch implements this feature by adding a new hook in `TargetInfo`.
Differential Revision: https://reviews.llvm.org/D103036
This fixes both https://bugs.llvm.org/show_bug.cgi?id=50309 and https://bugs.llvm.org/show_bug.cgi?id=50310.
Previously, lambdas inside functions would mark their own bodies for later analysis when encountering a potentially unavailable decl, without taking into consideration that the entire lambda itself might be correctly guarded inside an @available check. The same applied to inner class member functions. Blocks happened to work as expected already, since Sema::getEnclosingFunction() skips through block scopes.
This patch instead simply and conservatively marks the entire outermost function scope for search, and removes some special-case logic that prevented DiagnoseUnguardedAvailabilityViolations from traversing down into lambdas and nested functions. This correctly accounts for arbitrarily nested lambdas, inner classes, and blocks that may be inside appropriate @available checks at any ancestor level. It also treats all potential availability violations inside functions consistently, without being overly sensitive to the current DeclContext, which previously caused issues where e.g. nested struct members were warned about twice.
DiagnoseUnguardedAvailabilityViolations now has more work to do in some cases, particularly in functions with many (possibly deeply) nested lambdas and classes, but the big-O is the same, and the simplicity of the approach and the fact that it fixes at least two bugs feels like a strong win.
Differential Revision: https://reviews.llvm.org/D102338
Given the following scenario:
```
// Cat.cpp
struct Animal { virtual void makeNoise() const = 0; };
struct Cat : Animal { void makeNoise() const override; };
extern "C" int puts(char const *);
void Cat::makeNoise() const { puts("Meow"); }
void doThingWithCat(Animal *a) { static_cast<Cat *>(a)->makeNoise(); }
// CatUser.cpp
struct Animal { virtual void makeNoise() const = 0; };
struct Cat : Animal { void makeNoise() const override; };
void doThingWithCat(Animal *a);
void useDoThingWithCat() {
Cat *d = new Cat;
doThingWithCat(d);
}
// cat.ver
{
global: _Z17useDoThingWithCatv;
local: *;
};
$ clang++ Cat.cpp CatUser.cpp -fpic -flto=thin -fwhole-program-vtables
-shared -O3 -fuse-ld=lld -Wl,--lto-whole-program-visibility
-Wl,--version-script,cat.ver
```
We cannot devirtualize `Cat::makeNoise`. The issue is complex:
Due to `-fsplit-lto-unit` and usage of type metadata, we place the Cat
vtable declaration into module 0 and the Cat vtable definition with type
metadata into module 1, causing duplicate entries (Undefined followed by
Defined) in the `lto::InputFile::symbols()` output.
In `BitcodeFile::parse`, after processing the `Undefined` then the
`Defined`, the final state is `Defined`.
In `BitcodeCompiler::add`, for the first symbol, `computeBinding`
returns `STB_LOCAL`, then we reset it to `Undefined` because it is
prevailing (`versionId` is `preserved`). For the second symbol, because
the state is now `Undefined`, `computeBinding` returns `STB_GLOBAL`,
causing `ExportDynamic` to be true and suppressing devirtualization.
In D77280, the `computeBinding` change used a stricter `isDefined()`
condition to make weak``Lazy` symbol work.
This patch relaxes the condition to weaker `!isLazy()` to keep it
working while making the devirtualization work as well.
Differential Revision: https://reviews.llvm.org/D98686
Intrumentation callbacks are not made aware of LoopNest passes. From the loop pass manager, we can pass the outermost loop of the LoopNest to instrumentation in case of LoopNest passes.
The current patch made the change in two places in StandardInstrumentation.cpp. I will submit a proper patch where the OuterMostLoop is passed from the LoopPassManager to the call backs. That way we will avoid making changes at multiple places in StandardInstrumentation.cpp.
A testcase also will be submitted.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D102463
The Mach-O object file format is limited to 4GB because its used of
32-bit offsets in the header. It is possible for dsymutil to (silently)
emit an invalid binary. Instead of having consumers deal with this, emit
an error instead.
When trying to track down a vaddr-poisoning bug, I found that that the
secondary cache isn't emptied on test teardown. We should probably do
that to make the tests hermetic. Otherwise, repeating the tests lots of
times using --gtest_repeat fails after the mmap vaddr space is
exhausted.
To repro:
$ ninja check-scudo_standalone # build
$ ./projects/compiler-rt/lib/scudo/standalone/tests/ScudoUnitTest-x86_64-Test \
--gtest_filter=ScudoSecondaryTest.*:-ScudoSecondaryTest.SecondaryCombinations \
--gtest_repeat=10000
Reviewed By: cryptoad
Differential Revision: https://reviews.llvm.org/D102874
We are currently explicitly setting the flag solely based on the value of `-verify`, which ends up ignoring the situation where the user explicitly disabled this option from the command line.
Differential Revision: https://reviews.llvm.org/D102952