If we are only extracting vector elements via EXTRACT_VECTOR_ELT(s) we may be able to use SimplifyDemandedVectorElts to avoid unnecessary vector ops.
Differential Revision: https://reviews.llvm.org/D49262
llvm-svn: 337258
The SPLICE instruction splices two vectors into one vector using a
predicate. It copies the active elements from the first vector, and
then fills the remaining elements with the low-numbered elements from
the second vector.
The instruction has the following form, e.g.
splice z0.b, p0, z0.b, z1.b
for 8-bit elements. It also supports 16, 32 and
64-bit elements.
llvm-svn: 337253
This patch adds an instruction that allows extracting
a vector from a pair of vectors, given an immediate index
that describes the element position to extract from.
The instruction has the following assembly:
ext z0.b, z0.b, z1.b, #imm
where #imm is an immediate between 0 and 255.
llvm-svn: 337251
The ta instruction will always trap, regardless of the value
of the integer condition codes. TRAPri is marked as using icc,
so we cannot use a pattern for TRAPri to implement ta 1, as
verify-machineinstrs can complain that icc is not defined.
Instead we implement ta 1 the same way as ta 5.
llvm-svn: 337236
This amounts to pretty ridiculous number of patterns. Ideally we'd canonicalize the X86ISD::VRNDSCALE earlier to reuse those patterns. I briefly looked into doing that, but some strict FP operations could still get converted to rint and nearbyint during isel. It's probably still worthwhile to look into. This patch is meant as a starting point to work from.
llvm-svn: 337234
This allows us to use 231 form to fold an insertelement on the add input to the fma. There is technically no software intrinsic that can use this until AVX512F, but it can be manually built up from other intrinsics.
llvm-svn: 337223
This support was partial and temporary. Now that we have
wasm object file support its no longer needed.
Differential Revision: https://reviews.llvm.org/D48744
llvm-svn: 337222
As discussed here:
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123292.htmlhttp://lists.llvm.org/pipermail/llvm-dev/2018-July/124400.html
We want to add rotate intrinsics because the IR expansion of that pattern is 4+ instructions,
and we can lose pieces of the pattern before it gets to the backend. Generalizing the operation
by allowing 2 different input values (plus the 3rd shift/rotate amount) gives us a "funnel shift"
operation which may also be a single hardware instruction.
Initially, I thought we needed to define new DAG nodes for these ops, and I spent time working
on that (much larger patch), but then I concluded that we don't need it. At least as a first
step, we have all of the backend support necessary to match these ops...because it was required.
And shepherding these through the IR optimizer is the primary concern, so the IR intrinsics are
likely all that we'll ever need.
There was also a question about converting the intrinsics to the existing ROTL/ROTR DAG nodes
(along with improving the oversized shift documentation). Again, I don't think that's strictly
necessary (as the test results here prove). That can be an efficiency improvement as a small
follow-up patch.
So all we're left with is documentation, definition of the IR intrinsics, and DAG builder support.
Differential Revision: https://reviews.llvm.org/D49242
llvm-svn: 337221
We are using i8 for these tests, and shifting by 4,
which is exactly the half of i8.
But as it is seen from the proofs https://rise4fun.com/Alive/mgu
KeptBits = bitwidth(%x) - MaskedBits,
so with using shifts by 4, we are not really testing that
we actually properly handle the other cases with shifts not by half...
llvm-svn: 337208
This patch is an update of an older patch that never landed
(see here: https://reviews.llvm.org/D42516)
Recently various users have run into this issue and it just 100%
has to be solved at this point. The main difference in this patch
is that I use gunzip instead of unzip which should hopefully allow
tests to pass. Please review this as if it is a new patch however.
I found some issues along the way and made some minor modifications.
The binary used in this patch for testing (a zip file to make it small)
can be found here:
https://drive.google.com/file/d/1UjsnTO9edLttZibbr-2T1bJl92KEQFAO/view?usp=sharing
Differential Revision: https://reviews.llvm.org/D49206
llvm-svn: 337204
Summary:
[[ https://bugs.llvm.org/show_bug.cgi?id=38149 | PR38149 ]]
As discussed in https://reviews.llvm.org/D49179#1158957 and later,
the IR for 'check for [no] signed truncation' pattern can be improved:
https://rise4fun.com/Alive/gBf
^ that pattern will be produced by Implicit Integer Truncation sanitizer,
https://reviews.llvm.org/D48958https://bugs.llvm.org/show_bug.cgi?id=21530
in signed case, therefore it is probably a good idea to improve it.
Proofs for this transform: https://rise4fun.com/Alive/mgu
This transform is surprisingly frustrating.
This does not deal with non-splat shift amounts, or with undef shift amounts.
I've outlined what i think the solution should be:
```
// Potential handling of non-splats: for each element:
// * if both are undef, replace with constant 0.
// Because (1<<0) is OK and is 1, and ((1<<0)>>1) is also OK and is 0.
// * if both are not undef, and are different, bailout.
// * else, only one is undef, then pick the non-undef one.
```
The DAGCombine will reverse this transform, see
https://reviews.llvm.org/D49266
Reviewers: spatel, craig.topper
Reviewed By: spatel
Subscribers: JDevlieghere, rkruppe, llvm-commits
Differential Revision: https://reviews.llvm.org/D49320
llvm-svn: 337190
trivially rematerializable.
We run into a case where machineLICM hoists a large number of live ranges
outside of a big loop because it thinks those live ranges are trivially
rematerializable. In regalloc, global splitting is tried out first for those
live ranges before they are spilled and rematerialized. Because the global
splitting algorithm is quadratic, increasing a lot of global splitting
candidates causes huge compile time increase (50s to 1400s on my local
machine when compiling a module).
However, we think for live ranges which are very large and are trivially
rematerialiable, it is better to just skip global splitting so as to save
compile time with little chance of sacrificing performance. We uses the
segment size of live range to indirectly evaluate whether the global
splitting of the live range can introduce high cost, and use an option
as a knob to adjust the size limit threshold.
Differential Revision: https://reviews.llvm.org/D49353
llvm-svn: 337186
This reverts commit r337081, therefore restoring r337050 (and fix in
r337059), with test fix for bot failure described after the original
description below.
In order to always import the same copy of a linkonce function,
even when encountering it with different thresholds (a higher one then a
lower one), keep track of the summary we decided to import.
This ensures that the backend only gets a single definition to import
for each GUID, so that it doesn't need to choose one.
Move the largest threshold the GUID was considered for import into the
current module out of the ImportMap (which is part of a larger map
maintained across the whole index), and into a new map just maintained
for the current module we are computing imports for. This saves some
memory since we no longer have the thresholds maintained across the
whole index (and throughout the in-process backends when doing a normal
non-distributed ThinLTO build), at the cost of some additional
information being maintained for each invocation of ComputeImportForModule
(the selected summary pointer for each import).
There is an additional map lookup for each callee being considered for
importing, however, this was able to subsume a map lookup in the
Worklist iteration that invokes computeImportForFunction. We also are
able to avoid calling selectCallee if we already failed to import at the
same or higher threshold.
I compared the run time and peak memory for the SPEC2006 471.omnetpp
benchmark (running in-process ThinLTO backends), as well as for a large
internal benchmark with a distributed ThinLTO build (so just looking at
the thin link time/memory). Across a number of runs with and without
this change there was no significant change in the time and memory.
(I tried a few other variations of the change but they also didn't
improve time or peak memory).
The new commit removes a test that no longer makes sense
(Transforms/FunctionImport/hotness_based_import2.ll), as exposed by the
reverse-iteration bot. The test depends on the order of processing the
summary call edges, and actually depended on the old problematic
behavior of selecting more than one summary for a given GUID when
encountered with different thresholds. There was no guarantee even
before that we would eventually pick the linkonce copy with the hottest
call edges, it just happened to work with the test and the old code, and
there was no guarantee that we would end up importing the selected
version of the copy that had the hottest call edges (since the backend
would effectively import only one of the selected copies).
Reviewers: davidxl
Subscribers: mehdi_amini, inglorion, llvm-commits
Differential Revision: https://reviews.llvm.org/D48670
llvm-svn: 337184
As suggested in the review for r337007, this makes cfi-verify abort on unsupported targets instead of producing incorrect results. It also updates the design document to reflect this.
Differential Revision: https://reviews.llvm.org/D49304
llvm-svn: 337181
invariant instructions to be both more correct and much more powerful.
While testing, I continued to find issues with sinking post-load
hardening. Unfortunately, it was amazingly hard to create any useful
tests of this because we were mostly sinking across copies and other
loading instructions. The fact that we couldn't sink past normal
arithmetic was really a big oversight.
So first, I've ported roughly the same set of instructions from the data
invariant loads to also have their non-loading varieties understood to
be data invariant. I've also added a few instructions that came up so
often it again made testing complicated: inc, dec, and lea.
With this, I was able to shake out a few nasty bugs in the validity
checking. We need to restrict to hardening single-def instructions with
defined registers that match a particular form: GPRs that don't have
a NOREX constraint directly attached to their register class.
The (tiny!) test case included catches all of the issues I was seeing
(once we can sink the hardening at all) except for the NOREX issue. The
only test I have there is horrible. It is large, inexplicable, and
doesn't even produce an error unless you try to emit encodings. I can
keep looking for a way to test it, but I'm out of ideas really.
Thanks to Ben for giving me at least a sanity-check review. I'll follow
up with Craig to go over this more thoroughly post-commit, but without
it SLH crashes everywhere so landing it for now.
Differential Revision: https://reviews.llvm.org/D49378
llvm-svn: 337177
Add code for selection of G_LOAD, G_STORE, G_GEP, G_FRAMEINDEX and
G_CONSTANT. Support loads and stores of i32 values.
Patch by Petar Avramovic.
Differential Revision: https://reviews.llvm.org/D48957
llvm-svn: 337168
Summary:
[[ https://bugs.llvm.org/show_bug.cgi?id=38149 | PR38149 ]]
As discussed in https://reviews.llvm.org/D49179#1158957 and later,
the IR for 'check for [no] signed truncation' pattern can be improved:
https://rise4fun.com/Alive/gBf
^ that pattern will be produced by Implicit Integer Truncation sanitizer,
https://reviews.llvm.org/D48958https://bugs.llvm.org/show_bug.cgi?id=21530
in signed case, therefore it is probably a good idea to improve it.
But the IR-optimal patter does not lower efficiently, so we want to undo it..
This handles the simple pattern.
There is a second pattern with predicate and constants inverted.
NOTE: we do not check uses here. we always do the transform.
Reviewers: spatel, craig.topper, RKSimon, javed.absar
Reviewed By: spatel
Subscribers: kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D49266
llvm-svn: 337166
Summary: These are the names used in libgcc.
Reviewers: venkatra, jyknight, ekedaigle
Reviewed By: jyknight
Subscribers: joerg, fedor.sergeev, jrtc27, llvm-commits
Differential Revision: https://reviews.llvm.org/D48915
llvm-svn: 337164
Summary: Software trap number one is the trap used for breakpoints
in the Sparc ABI.
Reviewers: jyknight, venkatra
Reviewed By: jyknight
Subscribers: fedor.sergeev, jrtc27, llvm-commits
Differential Revision: https://reviews.llvm.org/D48637
llvm-svn: 337163
Summary:
If the high part of the load is not used the offset to the next element
will not be set correctly.
For example, on Sparc V8, the following code will read val2 from offset 4
instead of 8.
```
int val = __builtin_va_arg(va, long long);
int val2 = __builtin_va_arg(va, int);
```
Reviewers: jyknight
Reviewed By: jyknight
Subscribers: fedor.sergeev, jrtc27, llvm-commits
Differential Revision: https://reviews.llvm.org/D48595
llvm-svn: 337161
Found cases that hit the assert I added. This patch factors the validity
checking into a nice helper routine and calls it when deciding to harden
post-load, and asserts it when doing so later.
I've added tests for the various ways of loading a floating point type,
as well as loading all vector permutations. Even though many of these go
to identical instructions, it seems good to somewhat comprehensively
test them.
I'm confident there will be more fixes needed here, I'll try to add
tests each time as I get this predicate adjusted.
llvm-svn: 337160
Memory legalizer, waitcnt, and shrink passes can perturb the instructions,
which means that the post-RA hazard recognizer pass should run after them.
Otherwise, one of those passes may invalidate the work done by the hazard
recognizer. Note that this has adverse side-effect that any consecutive
S_NOP 0's, emitted by the hazard recognizer, will not be shrunk into a
single S_NOP <N>. This should be addressed in a follow-on patch.
Differential Revision: https://reviews.llvm.org/D49288
llvm-svn: 337154
Bug fix for PR37808. The regression test is a reduced version of the
original reproducer attached to the bug report. As stated in the report,
the problem was that InsertedPHIs was keeping dangling pointers to
deleted Memory-Phis. MemoryPhis are created eagerly and sometimes get
zapped shortly afterwards. I've used WeakVH instead of an expensive
removal operation from the active workset.
Differential Revision: https://reviews.llvm.org/D48372
llvm-svn: 337149
This unfortunately requires a bunch of bitcasts to be added added to SUBREG_TO_REG, COPY_TO_REGCLASS, and instructions in output patterns. Otherwise tablegen seems to default to picking f128 and then we fail when something tries to get the register class for f128 which isn't always valid.
The test changes are because we were previously mixing fr128 and vr128 due to contrainRegClass finding FR128 first and passes like live range shrinking weren't handling that well.
llvm-svn: 337147
indices used by AVX2 and AVX-512 gather instructions.
The index vector is hardened by broadcasting the predicate state
into a vector register and then or-ing. We don't even have to worry
about EFLAGS here.
I've added a test for all of the gather intrinsics to make sure that we
don't miss one. A particularly interesting creation is the gather
prefetch, which needs to be marked as potentially "loading" to get the
correct behavior. It's a memory access in many ways, and is actually
relevant for SLH. Based on discussion with Craig in review, I've moved
it to be `mayLoad` and `mayStore` rather than generic side effects. This
matches how we model other prefetch instructions.
Many thanks to Craig for the review here.
Differential Revision: https://reviews.llvm.org/D49336
llvm-svn: 337144
AVX512F only has integer domain logic instructions. AVX512DQ added FP domain logic instructions.
Execution domain fixing runs before EVEX->VEX. So if we have AVX512F and not AVX512DQ we fail to do execution domain switching of the logic operations. This leads to mismatches in execution domain and more test differences.
This patch adds custom domain fixing that switches EVEX integer logic operations to VEX fp logic operations if XMM16-31 are not used.
llvm-svn: 337137
128-bit ops implicitly zero the upper bits. This should address the comment about domain crossing for the integer version without AVX2 since we can use a 128-bit VBLENDW without AVX2.
The only bad thing I see here is that we failed to reuse an vxorps in some of the tests, but I think that's already known issue.
llvm-svn: 337134
This is almost the same as an existing IR canonicalization in instcombine,
so I'm assuming this is a good early generic DAG combine too.
The motivation comes from reduced bit-hacking for select-of-constants in IR
after rL331486. We want to restore that functionality in the DAG as noted in
the commit comments for that change and the llvm-dev discussion here:
http://lists.llvm.org/pipermail/llvm-dev/2018-July/124433.html
The PPC and AArch tests show that those targets are already doing something
similar. x86 will be neutral in the minimal case and generally better when
this pattern is extended with other ops as shown in the signbit-shift.ll tests.
Note the asymmetry: we don't include the (extend (ifneg X)) transform because
it already exists in SimplifySelectCC(), and that is verified in the later
unchanged tests in the signbit-shift.ll files. Without the 'not' op, the
general transform to use a shift is always a win because that's a single
instruction.
Alive proofs:
https://rise4fun.com/Alive/ysli
Name: if pos, get -1
%c = icmp sgt i16 %x, -1
%r = sext i1 %c to i16
=>
%n = xor i16 %x, -1
%r = ashr i16 %n, 15
Name: if pos, get 1
%c = icmp sgt i16 %x, -1
%r = zext i1 %c to i16
=>
%n = xor i16 %x, -1
%r = lshr i16 %n, 15
Differential Revision: https://reviews.llvm.org/D48970
llvm-svn: 337130
This was improved with rL337127, but I missed the failure in this test.
I'm not sure what the expected result will be, so I've generalized it
and added a FIXME comment.
llvm-svn: 337128
This fold is repeated/misplaced in instcombine, but I'm
not sure if it's safe to remove that yet because some
other folds appear to be asserting that the transform
has occurred within instcombine itself.
This isn't the best fix for PR37776, but it probably
hides the bug with the given code example:
https://bugs.llvm.org/show_bug.cgi?id=37776
We have another test to demonstrate the more general bug.
llvm-svn: 337127
This isn't the best fix for PR37776, but it probably
hides the bug with the given code example:
https://bugs.llvm.org/show_bug.cgi?id=37776
We have another test to demonstrate the more general
bug.
llvm-svn: 337126
registers.
The goal of this patch is to improve the throughput analysis in llvm-mca for the
case where instructions perform partial register writes.
On x86, partial register writes are quite difficult to model, mainly because
different processors tend to implement different register merging schemes in
hardware.
When the code contains partial register writes, the IPC (instructions per
cycles) estimated by llvm-mca tends to diverge quite significantly from the
observed IPC (using perf).
Modern AMD processors (at least, from Bulldozer onwards) don't rename partial
registers. Quoting Agner Fog's microarchitecture.pdf:
" The processor always keeps the different parts of an integer register together.
For example, AL and AH are not treated as independent by the out-of-order
execution mechanism. An instruction that writes to part of a register will
therefore have a false dependence on any previous write to the same register or
any part of it."
This patch is a first important step towards improving the analysis of partial
register updates. It changes the semantic of RegisterFile descriptors in
tablegen, and teaches llvm-mca how to identify false dependences in the presence
of partial register writes (for more details: see the new code comments in
include/Target/TargetSchedule.h - class RegisterFile).
This patch doesn't address the case where a write to a part of a register is
followed by a read from the whole register. On Intel chips, high8 registers
(AH/BH/CH/DH)) can be stored in separate physical registers. However, a later
(dirty) read of the full register (example: AX/EAX) triggers a merge uOp, which
adds extra latency (and potentially affects the pipe usage).
This is a very interesting article on the subject with a very informative answer
from Peter Cordes:
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to
In future, the definition of RegisterFile can be extended with extra information
that may be used to identify delays caused by merge opcodes triggered by a dirty
read of a partial write.
Differential Revision: https://reviews.llvm.org/D49196
llvm-svn: 337123
The bot has a /b directory, so /? matches against that and gets expanded to it.
(Thanks to Hans's r187366, which solved the same problem for clang-cl a while
ago and which saved me much head scratching.)
llvm-svn: 337092
The MachineOutliner was doing an std::for_each from the call (inserted
before the outlined sequence) to the iterator at the end of the
sequence.
std::for_each needs the iterator past the end, so the last instruction
was not taken into account when propagating the liveness information.
This fixes the machine verifier issue in machine-outliner-disubprogram.ll.
Differential Revision: https://reviews.llvm.org/D49295
llvm-svn: 337090
no conditions.
This is only valid to do if we're hardening calls and rets with LFENCE
which results in an LFENCE guarding the entire entry block for us.
llvm-svn: 337089
The code tried to find the immediate by using getNumOperands() on the MachineInstr, but there might be implicit-defs after the immediate that get counted.
Instead use getNumOperands() from the instruction description which will only count the operands that are defined in the td file.
llvm-svn: 337088
AVX512 doesn't have an immediate controlled blend instruction. But blend throughput is still better than movss/sd on SKX.
This commit changes AVX512 to use the AVX blend instructions instead of MOVSS/MOVSD. This constrains the register allocation since it won't be able to use XMM16-31, but hopefully the increased throughput and reduced port 5 pressure makes up for that.
llvm-svn: 337083
This reverts commit r337021.
WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x1415cd65 in void write_signed<long>(llvm::raw_ostream&, long, unsigned long, llvm::IntegerStyle) /code/llvm-project/llvm/lib/Support/NativeFormatting.cpp:95:7
#1 0x1415c900 in llvm::write_integer(llvm::raw_ostream&, long, unsigned long, llvm::IntegerStyle) /code/llvm-project/llvm/lib/Support/NativeFormatting.cpp:121:3
#2 0x1472357f in llvm::raw_ostream::operator<<(long) /code/llvm-project/llvm/lib/Support/raw_ostream.cpp:117:3
#3 0x13bb9d4 in llvm::raw_ostream::operator<<(int) /code/llvm-project/llvm/include/llvm/Support/raw_ostream.h:210:18
#4 0x3c2bc18 in void printField<unsigned int, &(amd_kernel_code_s::amd_kernel_code_version_major)>(llvm::StringRef, amd_kernel_code_s const&, llvm::raw_ostream&) /code/llvm-project/llvm/lib/Target/AMDGPU/Utils/AMDKernelCodeTUtils.cpp:78:23
#5 0x3c250ba in llvm::printAmdKernelCodeField(amd_kernel_code_s const&, int, llvm::raw_ostream&) /code/llvm-project/llvm/lib/Target/AMDGPU/Utils/AMDKernelCodeTUtils.cpp:104:5
#6 0x3c27ca3 in llvm::dumpAmdKernelCode(amd_kernel_code_s const*, llvm::raw_ostream&, char const*) /code/llvm-project/llvm/lib/Target/AMDGPU/Utils/AMDKernelCodeTUtils.cpp:113:5
#7 0x3a46e6c in llvm::AMDGPUTargetAsmStreamer::EmitAMDKernelCodeT(amd_kernel_code_s const&) /code/llvm-project/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp:161:3
#8 0xd371e4 in llvm::AMDGPUAsmPrinter::EmitFunctionBodyStart() /code/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp:204:26
[...]
Uninitialized value was created by an allocation of 'KernelCode' in the stack frame of function '_ZN4llvm16AMDGPUAsmPrinter21EmitFunctionBodyStartEv'
#0 0xd36650 in llvm::AMDGPUAsmPrinter::EmitFunctionBodyStart() /code/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp:192
llvm-svn: 337079
If an HVX vector register is to be coalesced into a vector pair, make
sure that the vector pair will not have a function call in its live range,
unless it already had one. All HVX vector registers are volatile, so
any vector register live across a function call will have to be spilled.
If a vector needs to be spilled, and it's coalesced into a vector pair
then the whole pair will need to be spilled (even if only a part of it is
live), taking extra stack space.
llvm-svn: 337073
Summary:
By looking at the callers of getUse(), we can see that even though
IVUsers may offer uses, but they may not be interesting to
LSR. It's possible that none of them is interesting.
Reviewers: sanjoy
Subscribers: jlebar, hiraditya, bixia, llvm-commits
Differential Revision: https://reviews.llvm.org/D49049
llvm-svn: 337072
When we're linking an alias which will be defined later, we neeed to
build a GlobalAlias, or else we'll crash later in
IRLinker::linkGlobalValueBody.
clang sometimes constructs aliases like this for C++ destructors.
Differential Revision: https://reviews.llvm.org/D49316
llvm-svn: 337053
In order to always import the same copy of a linkonce function,
even when encountering it with different thresholds (a higher one then a
lower one), keep track of the summary we decided to import.
This ensures that the backend only gets a single definition to import
for each GUID, so that it doesn't need to choose one.
Move the largest threshold the GUID was considered for import into the
current module out of the ImportMap (which is part of a larger map
maintained across the whole index), and into a new map just maintained
for the current module we are computing imports for. This saves some
memory since we no longer have the thresholds maintained across the
whole index (and throughout the in-process backends when doing a normal
non-distributed ThinLTO build), at the cost of some additional
information being maintained for each invocation of ComputeImportForModule
(the selected summary pointer for each import).
There is an additional map lookup for each callee being considered for
importing, however, this was able to subsume a map lookup in the
Worklist iteration that invokes computeImportForFunction. We also are
able to avoid calling selectCallee if we already failed to import at the
same or higher threshold.
I compared the run time and peak memory for the SPEC2006 471.omnetpp
benchmark (running in-process ThinLTO backends), as well as for a large
internal benchmark with a distributed ThinLTO build (so just looking at
the thin link time/memory). Across a number of runs with and without
this change there was no significant change in the time and memory.
(I tried a few other variations of the change but they also didn't
improve time or peak memory).
Reviewers: davidxl
Subscribers: mehdi_amini, inglorion, llvm-commits
Differential Revision: https://reviews.llvm.org/D48670
llvm-svn: 337050
Summary:
Currently LowerTypeTests emits jumptable entries for all live external
and address-taken functions; however, we could limit the number of
functions that we emit entries for significantly.
For Cross-DSO CFI, we continue to emit jumptable entries for all
exported definitions. In the non-Cross-DSO CFI case, we only need to
emit jumptable entries for live functions that are address-taken in live
functions. This ignores exported functions and functions that are only
address taken in dead functions. This change uses ThinLTO summary data
(now emitted for all modules during ThinLTO builds) to determine
address-taken and liveness info.
The logic for emitting jumptable entries is more conservative in the
regular LTO case because we don't have summary data in the case of
monolithic LTO builds; however, once summaries are emitted for all LTO
builds we can unify the Thin/monolithic LTO logic to only use summaries
to determine the liveness of address taking functions.
This change is a partial fix for PR37474. It reduces the build size for
nacl_helper by ~2-3%, the reduction is due to nacl_helper compiling in
lots of unused code and unused functions that are address taken in dead
functions no longer being being considered live due to emitted jumptable
references. The reduction for chromium is ~0.1-0.2%.
Reviewers: pcc, eugenis, javed.absar
Reviewed By: pcc
Subscribers: aheejin, dexonsmith, dschuff, mehdi_amini, eraman, steven_wu, llvm-commits, kcc
Differential Revision: https://reviews.llvm.org/D47652
llvm-svn: 337038
For instance, When dumping .apple_types, the second atom represents the
DW_TAG. In addition to printing the raw value, we now also pretty print
the value if the ATOM tells us how.
llvm-svn: 337026
This was completely broken if there was ever a struct argument, as
this information is thrown away during the argument analysis.
The offsets as passed in to LowerFormalArguments are not useful,
as they partially depend on the legalized result register type,
and they don't consider the alignment in the first place.
Ignore the Ins array, and instead figure out from the raw IR type
what we need to do. This seems to fix the padding computation
if the DAG lowering is forced (and stops breaking arguments
following padded arguments if the arguments were only partially
lowered in the IR)
llvm-svn: 337021
This reverts commit r336419: use-after-free on CallGraph::FunctionMap elements
due to the use of a stale iterator in CGPassManager::runOnModule.
The iterator may be invalidated if a pass removes a function, ex.:
llvm::LegacyInlinerBase::inlineCalls
inlineCallsImpl
llvm::CallGraph::removeFunctionFromModule
llvm-svn: 337018
See D49247, D49266
I'm only adding the sane negative tests, and not
adding the one-use tests yet. Also, not adding
negative tests for the second pattern with inverted operands yet,
since it's handling will be added in later differential.
llvm-svn: 337014
Revision r322373 fixed a bug in how we materialize constants when the CR-field
needs to be set.
However the fix is overly conservative. It will only do the transform if
AND-ing the input with the new constant produces the same new constant.
This is of course correct, but not necessarily required.
If there are no futher uses of the constant, the constant can be changed.
If there are no uses of the GPR result, the final result of the materialization
isn't important other than it needs to compare to zero correctly (lt, gt, eq).
Differential revision: https://reviews.llvm.org/D42109
llvm-svn: 337008
This patch adds support for AArch64 to cfi-verify.
This required three changes to cfi-verify. First, it generalizes checking if an instruction is a trap by adding a new isTrap flag to TableGen (and defining it for x86 and AArch64). Second, the code that ensures that the operand register is not clobbered between the CFI check and the indirect call needs to allow a single dereference (in x86 this happens as part of the jump instruction). Third, we needed to ensure that return instructions are not counted as indirect branches. Technically, returns are indirect branches and can be covered by CFI, but LLVM's forward-edge CFI does not protect them, and x86 does not consider them, so we keep that behavior.
In addition, we had to improve AArch64's code to evaluate the branch target of a MCInst to handle calls where the destination is not the first operand (which it often is not).
Differential Revision: https://reviews.llvm.org/D48836
llvm-svn: 337007
Summary: The path to the python executable can contain spaces, so it should be specified with quotes.
Reviewers: asmith, simon_tatham
Reviewed By: simon_tatham
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D49258
llvm-svn: 337006
Summary:
This commit does two things:
1. modified the existing DivergenceAnalysis::dump() so it dumps the
whole function with added DIVERGENT: annotations;
2. added code to do that dump if the appropriate -debug-only option is
on.
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D47700
Change-Id: Id97b605aab1fc6f5a11a20c58a99bbe8c565bf83
llvm-svn: 336998
Spectre variant #1 for x86.
There is a lengthy, detailed RFC thread on llvm-dev which discusses the
high level issues. High level discussion is probably best there.
I've split the design document out of this patch and will land it
separately once I update it to reflect the latest edits and updates to
the Google doc used in the RFC thread.
This patch is really just an initial step. It isn't quite ready for
prime time and is only exposed via debugging flags. It has two major
limitations currently:
1) It only supports x86-64, and only certain ABIs. Many assumptions are
currently hard-coded and need to be factored out of the code here.
2) It doesn't include any options for more fine-grained control, either
of which control flow edges are significant or which loads are
important to be hardened.
3) The code is still quite rough and the testing lighter than I'd like.
However, this is enough for people to begin using. I have had numerous
requests from people to be able to experiment with this patch to
understand the trade-offs it presents and how to use it. We would also
like to encourage work to similar effect in other toolchains.
The ARM folks are actively developing a system based on this for
AArch64. We hope to merge this with their efforts when both are far
enough along. But we also don't want to block making this available on
that effort.
Many thanks to the *numerous* people who helped along the way here. For
this patch in particular, both Eric and Craig did a ton of review to
even have confidence in it as an early, rough cut at this functionality.
Differential Revision: https://reviews.llvm.org/D44824
llvm-svn: 336990
We currently only support binary instructions in the alternate opcode shuffles.
This patch is an initial attempt at adding cast instructions as well, this raises several issues that we probably want to address as we continue to generalize the alternate mechanism:
1 - Duplication of cost determination - we should probably add scalar/vector costs helper functions and get BoUpSLP::getEntryCost to use them instead of determining costs directly.
2 - Support alternate instructions with the same opcode (e.g. casts with different src types) - alternate vectorization of calls with different IntrinsicIDs will require this.
3 - Allow alternates to be a different instruction type - mixing binary/cast/call etc.
4 - Allow passthrough of unsupported alternate instructions - related to PR30787/D28907 'copyable' elements.
Reapplied with fix to only accept 2 different casts if they come from the same source type (PR38154).
Differential Revision: https://reviews.llvm.org/D49135
llvm-svn: 336989
flow patterns including forks, merges, and even cyles.
This tries to cover a reasonably comprehensive set of patterns that
still don't require PHIs or PHI placement. The coverage was inspired by
the amazing variety of patterns produced when copy EFLAGS and restoring
it to implement Speculative Load Hardening. Without this patch, we
simply cannot make such complex and invasive changes to x86 instruction
sequences due to EFLAGS.
I've added "just" one test, but this test covers many different
complexities and corner cases of this approach. It is actually more
comprehensive, as far as I can tell, than anything that I have
encountered in the wild on SLH.
Because the test is so complex, I've tried to give somewhat thorough
comments and an ASCII-art diagram of the control flows to make it a bit
easier to read and maintain long-term.
Differential Revision: https://reviews.llvm.org/D49220
llvm-svn: 336985
This patch adds support for the following unpack instructions:
- PUNPKLO, PUNPKHI Unpack elements from low/high half and
place into elements of twice their size.
e.g. punpklo p0.h, p0.b
- UUNPKLO, UUNPKHI Unpack elements from low/high half and
SUNPKLO, SUNPKHI place into elements of twice their size
after zero- or sign-extending the values.
e.g. uunpklo z0.h, z0.b
llvm-svn: 336982
As suggested by @efriedma on D49262 - changed the extractelement to a store to prevent SimplifyDemandedVectorElts from simplifying the build vectors - this keeps the immediate generation which was the point of the tests.
llvm-svn: 336981
During the execution of long functions or functions that have a lot of
inlined code it could come to the situation where tracked value could be
transferred from one register to another. The transfer is recognized only if
destination register is a callee saved register and if source register is
killed. We do not salvage caller-saved registers since there is a great
chance that killed register would outlive it.
Patch by Nikola Prica.
Differential Revision: https://reviews.llvm.org/D44016
llvm-svn: 336978
Summary:
llvm-xray changes:
- account-mode - process-id {...} shows after thread-id
- convert-mode - process {...} shows after thread
- parses FDR and basic mode pid entries
- Checks version number for FDR log parsing.
Basic logging changes:
- Update header version from 2 -> 3
FDR logging changes:
- Update header version from 2 -> 3
- in writeBufferPreamble, there is an additional PID Metadata record (after thread id record and tsc record)
Test cases changes:
- fdr-mode.cc, fdr-single-thread.cc, fdr-thread-order.cc modified to catch process id output in the log.
Reviewers: dberris
Reviewed By: dberris
Subscribers: hiraditya, llvm-commits, #sanitizers
Differential Revision: https://reviews.llvm.org/D49153
llvm-svn: 336974
This is not an optimization we should be doing in isel. This is more suitable for a DAG combine.
My main concern is a future time when we support more FPENV. Changing a packed op to a scalar op could cause us to miss some exceptions that should have occured if we had done a packed op. A DAG combine would be better able to manage this.
llvm-svn: 336971
-v prints all directive pattern matches.
-vv additionally prints info that might be noise to users but that can
be helpful to FileCheck developers.
To maximize code reuse and to make diagnostics more consistent, this
patch also adjusts and extends some of the existing diagnostics.
CHECK-NOT failures now report variables uses. Many more diagnostics
now report the check prefix and kind of directive.
Reviewed By: probinson
Differential Revision: https://reviews.llvm.org/D47114
llvm-svn: 336967
This bug was created by rL335258 because we used to always call instsimplify
after trying the associative folds. After that change it became possible
for subsequent folds to encounter unsimplified code (and potentially assert
because of it).
Instead of carrying changed state through instcombine, we can just return
immediately. This allows instsimplify to run, so we can continue assuming
that easy folds have already occurred.
llvm-svn: 336965
Summary:
This patch is crucial for proving equality laundered/stripped
pointers. eg:
bool foo(A *a) {
return a == std::launder(a);
}
Clang with -fstrict-vtable-pointers will emit something like:
define dso_local zeroext i1 @_Z3fooP1A(%struct.A* %a) {
entry:
%c = bitcast %struct.A* %a to i8*
%call = tail call i8* @llvm.launder.invariant.group.p0i8(i8* %c)
%0 = bitcast %struct.A* %a to i8*
%1 = tail call i8* @llvm.strip.invariant.group.p0i8(i8* %0)
%2 = tail call i8* @llvm.strip.invariant.group.p0i8(i8* %call)
%cmp = icmp eq i8* %1, %2
ret i1 %cmp
}
and because %2 can be replaced with @llvm.strip.invariant.group(%0)
and that %2 and %1 will produce the same value (because strip is readnone)
we can replace compare with true.
Reviewers: rsmith, hfinkel, majnemer, amharc, kuhar
Subscribers: llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D47423
llvm-svn: 336963
Not all programs want section ordering when compiled with LTO.
In particular, the Linux kernel is very sensitive when it comes to linking, and
doesn't boot when each function is placed in its own sections.
Reviewed By: pcc
Differential Revision: https://reviews.llvm.org/D48756
llvm-svn: 336943
The one I noticed is sqrtsss/sd, but there could be others.
I had to add a couple new tests that don't have insertelement in there to catch this on the fast-isel path. Otherwise we trigger an abort and use SelectionDAG.
llvm-svn: 336938
and no use of DW_FORM_rnglistx with the DW_AT_ranges attribute.
Reviewer: aprantl
Differential Revision: https://reviews.llvm.org/D49214
llvm-svn: 336927
Summary:
This option appears to have been dropped as part of the refactoring in
r331663. Unfortunately, if we want to use llvm-strip as a drop-in
replacement for strip, this option should still be available.
Reviewers: alexshap
Reviewed By: alexshap
Subscribers: meikeb, kongyi, chh, jakehehrlich, llvm-commits, pirama
Differential Revision: https://reviews.llvm.org/D49226
llvm-svn: 336921
While that fold is clearly not happening [anymore],
we do now have separate test cases for these cases,
so we should be ok to slightly adjust these tests
to not potentially loose test coverage.
As suggested by Hiroshi Yamauchi in
https://reviews.llvm.org/D49179#1159345
llvm-svn: 336912
We have located a bug in AssemblyWriter::printModuleSummaryIndex(). This
function outputs path strings incorrectly. Backslashes in the strings
are not correctly escaped.
Consequently, if a path name contains a backslash followed by two
hexadecimal characters, the sequence is incorrectly interpreted when the
output is read by another component. This mangles the path and results
in error.
This patch fixes this issue by calling printEscapedString() to output
the module paths.
Patch by Chris Jackson.
Differential Revision: https://reviews.llvm.org/D49090
llvm-svn: 336908
These tests would fail with -verify-machineinstrs because the MI
generated from the IR would be merged with the one already in the MIR
files, and we get the following error:
```
*** Bad machine code: Function has NoVRegs property but there are VReg operands ***
- function: f
```
Differential Revision: https://reviews.llvm.org/D49191
llvm-svn: 336907
canWidenShuffleElements can do a better job if given a mask with ZeroableElements info. Apparently, ZeroableElements was being only used to identify AllZero candidates, but possibly we could plug it into more shuffle matchers.
Original Patch by Zvi Rackover @zvi
Differential Revision: https://reviews.llvm.org/D42044
llvm-svn: 336903
Noticed while updating D42044, lowerV2X128VectorShuffle can improve the shuffle mask with the zeroable data to create a target shuffle mask to recognise more 'zero upper 128' patterns.
NOTE: lowerV4X128VectorShuffle could benefit as well but the code needs refactoring first to discriminate between SM_SentinelUndef and SM_SentinelZero for negative shuffle indices.
Differential Revision: https://reviews.llvm.org/D49092
llvm-svn: 336900
there for a long time.
The boolean tracking whether we saw a kill of the flags was supposed to
be per-block we are scanning and instead was outside that loop and never
cleared. It requires a quite contrived test case to hit this as you have
to have multiple levels of successors and interleave them with kills.
I've included such a test case here.
This is another bug found testing SLH and extracted to its own focused
patch.
llvm-svn: 336876
multiple successors where some of the uses end up killing the EFLAGS
register.
There was a bug where rather than skipping to the next basic block
queued up with uses once we saw a kill, we stopped processing the blocks
entirely. =/
Test case produces completely nonsensical code w/o this tiny fix.
This was found testing Speculative Load Hardening and split out of that
work.
Differential Revision: https://reviews.llvm.org/D49211
llvm-svn: 336874
This converts them to what clang is now using for codegen. Unfortunately, there seem to be a few kinks to work out still. I'll try to address with follow up patches.
llvm-svn: 336871
This changes `-print-*` from transformation passes to analysis passes so
that `-print-after-all` and `-print-before-all` don't trigger. This
avoids some redundant output.
Patch by Son Tuan Vu!
llvm-svn: 336869
This is marginally helpful for removing redundant extensions, and the
code is easier to read, so it seems like an all-around win. In the new
test i8-phi-ext.ll, we used to emit an AssertSext i8; now we emit an
AssertZext i2, which allows the extension of the return value to be
eliminated.
Differential Revision: https://reviews.llvm.org/D49004
llvm-svn: 336868
This commit suppresses turning loops like this into "(bitwidth - ctlz(input))".
unsigned foo(unsigned input) {
unsigned num = 0;
do {
++num;
input >>= 1;
} while (input != 0);
return num;
}
The loop version returns a value of 1 for both an input of 0 and an input of 1. Converting to a naive ctlz does not preserve that.
Theoretically we could do better if we checked isKnownNonZero or we could insert a select to handle the divergence. But until we have motivating cases for that, this is the easiest solution.
llvm-svn: 336864
This loop executes one iteration without checking the input value. This produces a count of 1 for an input of 0 and 1. We are turning this into 32 - ctlz(n), but that returns 0 if n is 0.
llvm-svn: 336862
Differential Revision: https://reviews.llvm.org/D47171
This contains the portions of that patch that could not be committed
using the git monorepo because of dos line ending problems.
llvm-svn: 336848
That is, make CHECK-DAG skip matches that overlap the matches of any
preceding consecutive CHECK-DAG directives. This change makes
CHECK-DAG more consistent with other directives, and there is evidence
it makes CHECK-DAG more intuitive and less error-prone. See the RFC
discussion starting at:
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123010.html
Moreover, this behavior enables CHECK-DAG groups for unordered,
non-unique strings or patterns. For example, it is useful for
verifying output or logs from a parallel program, such as the OpenMP
runtime.
This patch also implements the command-line option
-allow-deprecated-dag-overlap, which reverts CHECK-DAG to the old
overlapping behavior. This option should not be used in new tests.
It is meant only for the existing tests that are broken by this change
and that need time to update.
See the following bugzilla issue for tracking of such tests:
https://bugs.llvm.org/show_bug.cgi?id=37532
Patches to add -allow-deprecated-dag-overlap to those tests will
follow immediately.
Reviewed By: probinson
Differential Revision: https://reviews.llvm.org/D47106
llvm-svn: 336847
See https://reviews.llvm.org/D47106 for details.
Reviewed By: probinson
Differential Revision: https://reviews.llvm.org/D47171
This commit drops that patch's changes to:
llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
llvm/test/CodeGen/NVPTX/param-load-store.ll
For some reason, the dos line endings there prevent me from commiting
via the monorepo. A follow-up commit (not via the monorepo) will
finish the patch.
llvm-svn: 336843
Summary:
https://bugs.llvm.org/show_bug.cgi?id=38123
This pattern will be produced by Implicit Integer Truncation sanitizer,
https://reviews.llvm.org/D48958https://bugs.llvm.org/show_bug.cgi?id=21530
in unsigned case, therefore it is probably a good idea to improve it.
https://rise4fun.com/Alive/Rny
^ there are more opportunities for folds, i will follow up with them afterwards.
Caveat: this somehow exposes a missing opportunities
in `test/Transforms/InstCombine/icmp-logical.ll`
It seems, the problem is in `foldLogOpOfMaskedICmps()` in `InstCombineAndOrXor.cpp`.
But i'm not quite sure what is wrong, because it calls `getMaskedTypeForICmpPair()`,
which calls `decomposeBitTestICmp()` which should already work for these cases...
As @spatel notes in https://reviews.llvm.org/D49179#1158760,
that code is a rather complex mess, so we'll let it slide.
Reviewers: spatel, craig.topper
Reviewed By: spatel
Subscribers: yamauchi, majnemer, t.p.northover, llvm-commits
Differential Revision: https://reviews.llvm.org/D49179
llvm-svn: 336834