This is another step towards trying to re-apply D110170
by eliminating conflicting transforms that cause infinite loops.
a47c8e40c7 was a previous patch in this direction.
The diffs here are mostly cosmetic, but intentional:
1. The existing code that would handle this pattern in FoldShiftByConstant()
is limited to 'shl' only now. The formatting change to IsLeftShift shows
that we could move several transforms into visitShl() directly for
efficiency because they are not common shift transforms.
2. The tests are regenerated to show new instruction names to prove that
we are getting (almost) identical logic results.
3. The one case where we differ ("trunc_sandwich_small_shift1") shows that
we now use a narrow 'and' instruction. Previously, we relied on another
transform to do that, but it is limited to legal types. That seems to
be a legacy constraint from when IR analysis and codegen were less robust.
https://alive2.llvm.org/ce/z/JxyGA4
declare void @llvm.assume(i1)
define i8 @src(i32 %x, i32 %c0, i8 %c1) {
; The sum of the shifts must not overflow the source width.
%z1 = zext i8 %c1 to i32
%sum = add i32 %c0, %z1
%ov = icmp ult i32 %sum, 32
call void @llvm.assume(i1 %ov)
%sh1 = lshr i32 %x, %c0
%tr = trunc i32 %sh1 to i8
%sh2 = lshr i8 %tr, %c1
ret i8 %sh2
}
define i8 @tgt(i32 %x, i32 %c0, i8 %c1) {
%z1 = zext i8 %c1 to i32
%sum = add i32 %c0, %z1
%maskc = lshr i8 -1, %c1
%s = lshr i32 %x, %sum
%t = trunc i32 %s to i8
%a = and i8 %t, %maskc
ret i8 %a
}
It appears that this test assumes that the toolchain utilizes the integrated
assembler by default, since the expected output in the CHECKs are
compilation_database.o.
However, this test fails on AIX as AIX does not utilize the integrated assembler.
On AIX, the output instead is of the form /tmp/compilation_database-*.s.
Thus, this patch explicitly adds the -fintegrated-as option to match the
assumption that the integrated assembler is used by default.
Differential Revision: https://reviews.llvm.org/D110431
The trace tests crashed on darwin because of some thread
initialization issues (thread initialization is somewhat
different on darwin).
Instead of starting real threads, create a new ThreadState
in the main thread. This makes the tests more unit-testy
and hopefully won't crash on darwin (there is almost no
platform-specific code involved now).
This will also help with future trace tests that will need
more than 1 thread. Creating more than 1 real thread and
dispatching test actions across multiple threads in the
required deterministic order is painful.
Depends on D110539.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D110546
Currently detection of races with TLS/stack initialization
is broken because we imitate the write before thread initialization,
so it's modelled with a wrong thread/epoch.
Fix that and add a test.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D110538
KILL instructions are sometimes present and prevented hard
clauses from being formed.
Fix this by ignoring all meta instructions in clauses.
Differential Revision: https://reviews.llvm.org/D106042
We used to put the canonical spelling of flags after alias processing
on that line. For clang-cl in particular, that meant that we put flags
on that line that the clang-cl driver doesn't even accept, and the
"Driver args:" line wasn't usable.
Differential Revision: https://reviews.llvm.org/D110458
Add a convenience method to add supplementary registers that takes care
of adding invalidate_regs to all (potentially) overlapping registers.
Differential Revision: https://reviews.llvm.org/D110023
Function specialization was crashing on poison values and constexpr values.
The problem is that these values are not added to the solver, so it crashes
when a lookup is performed for these values. This fixes that by not
specialising on these values. For poison that is obvious, but for constexpr
this is a change in behaviour. Thus, in one way this is a bit of a stopgap, but
specialising on constexpr values wasn't done very intentionally, and need some
more work and tests if we wanted to support this.
As a follow up, we need to look if the solver should exit more gracefully and
return a "don't know", or that it should really support these constexprs.
This should fix PR51600 (https://bugs.llvm.org/show_bug.cgi?id=51600).
Differential Revision: https://reviews.llvm.org/D110529
This change is to add some missing details to the help text and command
guide:
- Added a note to the command guide that --debug-macro also dumps
.debug_macinfo.
- Added a note to the command guide that --debug-frame and --eh_frame
are aliases, and in cases where both sections are present one command
outputs both.
- Changed the wording in the help output for --ignore-case and --regex to
closer match the command guide.
The StringConvert API is no longer used anywhere but in debugserver.
Since debugserver does not use LLVM API, we cannot replace it with
llvm::to_integer() and llvm::to_float() there. Let's just move
the sources into debugserver.
Differential Revision: https://reviews.llvm.org/D110478
Refactor the XML converting attribute and text getters to use LLVM API.
While at it, remove some redundant error and missing XML support
handling, as the called base functions do that anyway. Add tests
for these methods.
Note that this patch changes the getter behavior to be IMHO more
correct. In particular:
- negative and overflowing integers are now reported as failures to
convert, rather than being wrapped over or capped
- digits followed by text are now reported as failures to convert
to double, rather than their numeric part being converted
Differential Revision: https://reviews.llvm.org/D110410
Use the in-project clang, llvm-link and opt if available and unless
CMake cache variables specify to use a different compiler. This applies
D101265 to the new DeviceRTL's CMakeLists.txt which was copied before
D101265 was applied.
Fixes the openmp-offloading-cuda-runtime builder which was failing
since D110006.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110251
Let the calling pass or pattern replace the uses of the original root operation. Internally, the tileAndFuse still replaces uses and updates operands but only of newly created operations.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D110169
When rebase_exec=true in DidAttach(), all modules are loaded
before the rendezvous breakpoint is set, which means the
LoadInterpreterModule() method is not called and m_interpreter_module
is not initialized.
This causes the very first rendezvous breakpoint hit with
m_initial_modules_added=false to accidentally unload the
module_sp that corresponds to the dynamic loader.
This bug (introduced in D92187) was causing the rendezvous
mechanism to not work in Android 28. The mechanism works
fine on older/newer versions of Android.
Test: Verified rendezvous on Android 28 and 29
Test: Added dlopen test
Reviewed By: labath
Differential Revision: https://reviews.llvm.org/D109797
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/q6GbK89br - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: <=7.0`
So pick cost of `18`.
For store we have:
https://godbolt.org/z/Yzfoo5TnW - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `8`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110507
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.5`
So pick cost of `9`.
For store we have:
https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110506
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/e5YE99a4P - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `6`.
For store we have:
https://godbolt.org/z/3vM4KsE1n - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `3`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110505
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/1j3nf3dro - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.
For store we have:
https://godbolt.org/z/4n1zvP37j - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5`
So pick cost of `1`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110504
There are 2 reasons to do this:
1. We place hot data in the first cache line of ThreadState,
this assumed that it's cache-line-aligned but we never actually
enforced it (or it was lost at some point).
2. The new vector clock uses vector instructions and requires
data alignment. Later the new vector clock will be embedded in
ThreadState, then ensuring vector clock alignment will be
impossible w/o ThreadState alignment.
Depends on D110519.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D110520
Currently the shadow stack is located in the trace memory mapping.
The new tsan runtime will remove the trace memory mapping.
Move the shadow stack into ThreadState as a preparation step.
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D110519
This patch adds a generic DAGCombine for vector-predicated (VP) nodes.
Those for which we can determine that no vector element is active can be
replaced by either undef or, for reductions, the start value.
This is tested rather trivially at the IR level, where it's possible
that we want to teach instcombine to perform this optimization.
However, we can also see the zero-evl case arise during SelectionDAG
legalization, when wide VP operations can be split into two and the
upper operation emerges as trivially false.
It's possible that we could perform this optimization "proactively"
(both on legal vectors and before splitting) and reduce the width of an
operation and insert it into a larger undef vector:
```
v8i32 vp_add x, y, mask, 4
->
v8i32 insert_subvector (v8i32 undef), (v4i32 vp_add xsub, ysub, mask, 4), i32 0
```
This is somewhat analogous to similar vector narrow/widening
optimizations, but it's unclear at this point whether that's beneficial
to do this for VP ops for any/all targets.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D109148
We align non-fallthrough branches under Cortex-M at O3 to lead to fewer
instruction fetches. This improves that for the block after a LE or
LETP. These blocks will still have terminating branches until the
LowOverheadLoops pass is run (as they are not handled by analyzeBranch,
the branch is not removed until later), so canFallThrough will return
false. These extra branches will eventually be removed, leaving a
fallthrough, so treat them as such and don't add unnecessary alignments.
Differential Revision: https://reviews.llvm.org/D107810
This revision adds a
```
FlatAffineValueConstraints(ValueRange ivs, ValueRange lbs, ValueRange ubs)
```
method and use it in hoist padding.
Differential Revision: https://reviews.llvm.org/D110427
During a backtrace the `.cfi_undefined` for a float register causes an assert in libunwind.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D110144
This revision extracts padding hoisting in a new file and cleans it up in prevision of future improvements and extensions.
Differential Revision: https://reviews.llvm.org/D110414
This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies.
Fixes PR51974
When splitting with linalg.copy, cannot write into the destination alloc directly. Instead, write into a subview of the alloc.
Differential Revision: https://reviews.llvm.org/D110512
Due to the way detecting the hard float ABI is currently
handled, clang fails to find the per target dir.
I am working to fix this but in the meantime disable it by
default on Arm Linux.
This particular case was creating a `VMSET_VL` using the old
fixed-length type in order to pass a mask to other custom nodes
operating on the scalable container type. This kind of thing wasn't
caught for us; I only noticed when experimenting with odd-length
vectors, where it was trying to generate an invalid `v3i1` MVT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D110420
Add a llvm::Split() implementation that can be used via range-for loop,
e.g.:
for (StringRef x : llvm::Split("foo,bar,baz", ','))
...
The implementation uses an additional SplittingIterator class that
uses StringRef::split() internally.
Differential Revision: https://reviews.llvm.org/D110496