This is an initial patch to add the necessary support for a DemandedElts argument to SimplifyDemandedBits, more closely matching computeKnownBits and to help improve vector codegen.
I've added only a small amount of the changes necessary to get at least one test to update - a lot more can be done but I'd like to add these methodically with proper test coverage, at the same time the hope is to slowly move some/all of SimplifyDemandedVectorElts into SimplifyDemandedBits as well.
Differential Revision: https://reviews.llvm.org/D55768
llvm-svn: 349374
This requires the two callers to manifest a 0 to make EmitCmp call EmitTest.
I'm looking into changing how we combine TEST and flag setting instructions to not be part of lowering. And instead be part of DAG combine or isel. Which will mean EmitTest will probably become gutted and maybe disappear entirely.
llvm-svn: 349094
PR17686 demonstrates that for some targets FP exceptions can fire in cases where the FP_TO_UINT is expanded using a FP_TO_SINT instruction.
The existing code converts both the inrange and outofrange cases using FP_TO_SINT and then selects the result, this patch changes this for 'strict' cases to pre-select the FP_TO_SINT input and the offset adjustment.
The X87 cases don't need the strict flag but generates much nicer code with it....
Differential Revision: https://reviews.llvm.org/D53794
llvm-svn: 348251
This should likely be adjusted to limit this transform
further, but these diffs should be clear wins.
If we have blendv/conditional move, then we should assume
those are cheap ops. The loads become independent of the
compare, so those can be speculated before we need to use
the values in the blend/mov.
llvm-svn: 347526
Previously, the extend_vector_inreg opcode required their input register to be the same total width as their output. But this doesn't match up with how the X86 instructions are defined. For X86 the input just needs to be a legal type with at least enough elements to cover the output.
This patch weakens the check on these nodes and allows them to be used as long as they have more input elements than output elements. I haven't changed type legalization behavior so it will still create them with matching input and output sizes.
X86 will custom legalize these nodes by shrinking the input to be a 128 bit vector and once we've done that we treat them as legal operations. We still have one case during type legalization where we must custom handle v64i8 on avx512f targets without avx512bw where v64i8 isn't a legal type. In this case we will custom type legalize to a *extend_vector_inreg with a v16i8 input. After that the input is a legal type so type legalization should ignore the node and doesn't need to know about the relaxed restriction. We are no longer allowed to use the default expansion for these nodes during vector op legalization since the default expansion uses a shuffle which required the widths to match. Custom legalization for all types will prevent us from reaching the default expansion code.
I believe DAG combine works correctly with the released restriction because it doesn't check the number of input elements.
The rest of the patch is changing X86 to use either the vector_inreg nodes or the regular zero_extend/sign_extend nodes. I had to add additional isel patterns to handle any_extend during isel since simplifydemandedbits can create them at any time so we can't legalize to zero_extend before isel. We don't yet create any_extend_vector_inreg in simplifydemandedbits.
Differential Revision: https://reviews.llvm.org/D54346
llvm-svn: 346784
The main caller of this already has an MVT and several targets called getSimpleVT inside without checking isSimple. This makes the simpleness explicit.
llvm-svn: 346180
Add X86 SimplifyDemandedBitsForTargetNode and use it to simplify PMULDQ/PMULUDQ target nodes.
This enables us to repeatedly simplify the node's arguments after the previous approach had to be reverted due to PR39398.
Differential Revision: https://reviews.llvm.org/D53643
llvm-svn: 345182
When implementing memset's today we often see this pattern:
$x0 = MOV 0xXYXYXYXYXYXYXYXY
store $x0, ...
$w1 = MOV 0xXYXYXYXY
store $w1, ...
We first create a 64bit constant in a 64bit register with all bytes the
same and then create a 32bit constant with all bytes the same in a 32bit
register. In many targets we could just access the lower byte of the
64bit register instead.
- Ideally this would be handled by the ConstantHoist pass but it runs
too early when memset isn't expanded yet.
- The memset expansion code already had this optimization implemented,
however SelectionDAG constantfolding would constantfold the
"trunc(bigconstnat)" pattern to "smallconstant".
- This patch makes the memset expansion mark the constant as Opaque and
stop DAGCombiner from constant folding in this situation. (Similar to
how ConstantHoisting marks things as Opaque to avoid folding
ADD/SUB/etc.)
Differential Revision: https://reviews.llvm.org/D53181
llvm-svn: 345102
Summary:
As discussed in D52304 / IRC, we now have pattern matching for
'bit extract' in two places - tablegen and `X86DAGToDAGISel`.
There are 4 patterns.
And we will have a problem with `x & (-1 >> (32 - y))` pattern.
* If the mask is one-use, then it is always unfolded into `x << (32 - y) >> (32 - y)` first.
Thus, the existing test coverage is already broken.
* If it is not one-use, then it is not unfolded, and is matched as BZHI.
* If it is not one-use, we will not match it as BEXTR. And if it is one-use, it will have been unfolded already.
So we will either not handle that pattern for BEXTR, or not have test coverage for it.
This is bad.
As discussed with @craig.topper, let's unify this matching, and do everything in `X86DAGToDAGISel`.
Then we will not have code duplication, and will have proper test coverage.
This indeed does not affect any tests, and this is great.
It means that for these two patterns, the `X86DAGToDAGISel` is identical to the tablegen version.
Please review carefully, i'm not fully sure about that intrinsic change, and introduction of the new `X86ISD` opcode.
Reviewers: craig.topper, RKSimon, spatel
Reviewed By: craig.topper
Subscribers: llvm-commits, craig.topper
Differential Revision: https://reviews.llvm.org/D53164
llvm-svn: 344904
Summary:
These nodes exist to overcome an isel problem where we can generate a zero extend of an AH register followed by an extract subreg, and another zero extend. The first zero extend exists to avoid a partial register update copying the AH register into the low 8-bits. The second zero extend exists if the user wanted the remainder zero extended.
To make this work we had a DAG combine to morph the DIVREM opcode to a special opcode that included the extend. But then we had to add the new node to computeKnownBits and computeNumSignBits to process the extension portion.
This patch instead removes all of that and adds a late peephole to detect the two extends.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53449
llvm-svn: 344874
This patch adds an initial x86 SimplifyDemandedVectorEltsForTargetNode implementation to handle target shuffles.
Currently the patch only decodes a target shuffle, calls SimplifyDemandedVectorElts on its input operands and removes any shuffle that reduces to undef/zero/identity.
Future work will need to integrate this with combineX86ShufflesRecursively, add support for other x86 ops, etc.
NOTE: There is a minor regression that appears to be affecting further (extractelement?) combines which I haven't been able to solve yet - possibly something to do with how nodes are added to the worklist after simplification.
Differential Revision: https://reviews.llvm.org/D52140
llvm-svn: 342564
This is an alternative to D37896. I don't see a way to decompose multiplies
generically without a target hook to tell us when it's profitable.
ARM and AArch64 may be able to remove some duplicate code that overlaps with
this transform.
As a first step, we're only getting the most clear wins on the vector examples
requested in PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474
As noted in the code comment, it's likely that the x86 constraints are tighter
than necessary, but it may not always be a win to replace a pmullw/pmulld.
Differential Revision: https://reviews.llvm.org/D52195
llvm-svn: 342554
I noticed this along with the patterns in D51125, but when the index is variable,
we don't convert insertelement into a build_vector.
For x86, that means these get expanded at legalization time into the loading/spilling
code that we see in the tests. I think it's always better to avoid going to memory on
these, and we get the optimal 'broadcast' if it's available.
I suspect other targets may want to look at enabling the hook. AArch64 and AMDGPU have
regression tests that would be affected (although I did not check what would happen in
those cases). In the most basic cases shown here, AArch64 would probably do much
better with a splat.
Differential Revision: https://reviews.llvm.org/D51186
llvm-svn: 340705
Summary:
Previously the value being stored is the last operand in SDNode. This causes the type legalizer to visit the mask operand before the value operand. The type legalizer was more complicated because of this since we want the type of the value to drive the decisions.
This patch moves the value to be the first operand so we visit it first during type legalization. It also simplifies the type legalization code accordingly.
X86 is currently the only in tree target that uses this SDNode. Not sure if there are any users out of tree.
Reviewers: RKSimon, delena, hfinkel, eli.friedman
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D50402
llvm-svn: 340689
AVX512 added new versions of these intrinsics that take a rounding mode. If the rounding mode is 4 the new intrinsics are equivalent to the old intrinsics.
The AVX512 intrinsics were being lowered to ISD opcodes, but the legacy SSE intrinsics were left as intrinsics. This resulted in the AVX512 instructions needing separate patterns for the ISD opcodes and the legacy SSE intrinsics.
Now we convert SSE intrinsics and AVX512 intrinsics with rounding mode 4 to the same ISD opcode so we can share the isel patterns.
llvm-svn: 339749
This is an early step towards using SimplifyDemandedVectorElts for target shuffle combining - this merely moves the existing X86ISD::VBROADCAST simplification code to use the SimplifyDemandedVectorElts mechanism.
Adds X86TargetLowering::SimplifyDemandedVectorEltsForTargetNode to handle X86ISD::VBROADCAST - in time we can support all target shuffles (and other ops) here.
llvm-svn: 337547
Summary:
[[ https://bugs.llvm.org/show_bug.cgi?id=38149 | PR38149 ]]
As discussed in https://reviews.llvm.org/D49179#1158957 and later,
the IR for 'check for [no] signed truncation' pattern can be improved:
https://rise4fun.com/Alive/gBf
^ that pattern will be produced by Implicit Integer Truncation sanitizer,
https://reviews.llvm.org/D48958https://bugs.llvm.org/show_bug.cgi?id=21530
in signed case, therefore it is probably a good idea to improve it.
But the IR-optimal patter does not lower efficiently, so we want to undo it..
This handles the simple pattern.
There is a second pattern with predicate and constants inverted.
NOTE: we do not check uses here. we always do the transform.
Reviewers: spatel, craig.topper, RKSimon, javed.absar
Reviewed By: spatel
Subscribers: kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D49266
llvm-svn: 337166
We now use llvm.fma.f32/f64 or llvm.x86.fmadd.f32/f64 intrinsics that use scalar types rather than vector types. So we don't these special ISD nodes that operate on the lowest element of a vector.
llvm-svn: 336883
These ISD nodes try to select the MOVLPS and MOVLPD instructions which are special load only instructions. They load data and merge it into the lower 64-bits of an XMM register. They are logically equivalent to our MOVSD node plus a load.
There was only one place in X86ISelLowering that used MOVLPD and no places that selected MOVLPS. The one place that selected MOVLPD had to choose between it and MOVSD based on whether there was a load. But lowering is too early to tell if the load can really be folded. So in isel we have patterns that use MOVSD for MOVLPD if we can't find a load.
We also had patterns that select the MOVLPD instruction for a MOVSD if we can find a load, but didn't choose the MOVLPD ISD opcode for some reason.
So it seems better to just standardize on MOVSD ISD opcode and manage MOVSD vs MOVLPD instruction with isel patterns.
llvm-svn: 336728
Summary:
This adds a reverse transform for the instcombine canonicalizations
that were added in D47980, D47981.
As discussed later, that was worse at least for the code size,
and potentially for the performance, too.
https://rise4fun.com/Alive/Zmpl
Reviewers: craig.topper, RKSimon, spatel
Reviewed By: spatel
Subscribers: reames, llvm-commits
Differential Revision: https://reviews.llvm.org/D48768
llvm-svn: 336585
The intrinsics can be implemented with a f32/f64 llvm.fma intrinsic and an insert into a zero vector.
There are a couple regressions here due to SelectionDAG not being able to pull an fneg through an extract_vector_elt. I'm not super worried about this though as InstCombine should be able to do it before we get to SelectionDAG.
llvm-svn: 336416
I don't believe there is any real reason to have separate X86 specific opcodes for vector compares. Setcc has the same behavior just uses a different encoding for the condition code.
I had to change the CondCodeAction for SETLT and SETLE to prevent some transforms from changing SETGT lowering.
Differential Revision: https://reviews.llvm.org/D43608
llvm-svn: 335173
This is the new version of D46181, allowing setjmp/longjmp
to work correctly with the Intel CET shadow stack by storing
SSP on setjmp and fixing it on longjmp. The patch has been
updated to use the cf-protection-return module flag instead
of HasSHSTK, and the bug that caused D46181 to be reverted
has been fixed with the test expanded to track that fix.
patch by mike.dvoretsky
Differential Revision: https://reviews.llvm.org/D47311
llvm-svn: 333990
There seems to be no real reason to have these separate copies.
The existing implementations just copy each other for x86.
For Mips there is a subtle difference, which is just a bug
since it changes based on the context where which one was called.
Dropping this version, all tests pass. If I try to merge them
to match the removed version, a test fails.
llvm-svn: 333440
These do the same thing with the first and second sources swapped. They previously came from separate intrinsics that specified different masking behavior. But we can cover that with isel patterns and a single node.
This is a step towards reducing the number of intrinsics needed.
A bunch of tests change because we are now biased to choosing VPERMT over VPERMI when there is nothing to signal that commuting is beneficial.
llvm-svn: 333383
This patch adds a shadow stack fix when compiling
setjmp/longjmp with the shadow stack enabled. This
allows setjmp/longjmp to work correctly with CET.
Patch by mike.dvoretsky
Differential Revision: https://reviews.llvm.org/D46181
llvm-svn: 331748
Summary:
Split off from D46031.
In masked merge case, this degrades IPC by decreasing instruction count.
{F6108777}
The next patch should be able to recover and improve this.
This also affects the transform @spatel have added in D27489 / rL289738,
and the test coverage for X86 was missing.
But after i have added it, and looked at the changes in MCA, i'm somewhat confused.
{F6093591} {F6093592} {F6093593}
I'd say this regression is an improvement, since `IPC` increased in that case?
Reviewers: spatel, craig.topper
Reviewed By: spatel
Subscribers: andreadb, llvm-commits, spatel
Differential Revision: https://reviews.llvm.org/D46493
llvm-svn: 331684
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.
Patch produced by
for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done
Differential Revision: https://reviews.llvm.org/D46290
llvm-svn: 331272
Summary:
Previously the flag intrinsics always used the index instructions even if a mask instruction also exists.
To fix fix this I've created a single ISD node type that returns index, mask, and flags. The SelectionDAG CSE process will merge all flavors of intrinsics with the same inputs to a s ingle node. Then during isel we just have to look at which results are used to know what instruction to generate. If both mask and index are used we'll need to emit two instructions. But for all other cases we can emit a single instruction.
Since I had to do manual isel anyway, I've removed the pseudo instructions and custom inserter code that was working around tablegen limitations with multiple implicit defs.
I've also renamed the recently added sse42.ll test case to sttni.ll since it focuses on that subset of the sse4.2 instructions.
Reviewers: chandlerc, RKSimon, spatel
Reviewed By: chandlerc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D46202
llvm-svn: 331091
Three new instructions:
umonitor - Sets up a linear address range to be
monitored by hardware and activates the monitor.
The address range should be a writeback memory
caching type.
umwait - A hint that allows the processor to
stop instruction execution and enter an
implementation-dependent optimized state
until occurrence of a class of events.
tpause - Directs the processor to enter an
implementation-dependent optimized state
until the TSC reaches the value in EDX:EAX.
Also modifying the description of the mfence
instruction, as the rep prefix (0xF3) was allowed
before, which would conflict with umonitor during
disassembly.
Before:
$ echo 0xf3,0x0f,0xae,0xf0 | llvm-mc -disassemble
.text
mfence
After:
$ echo 0xf3,0x0f,0xae,0xf0 | llvm-mc -disassemble
.text
umonitor %rax
Reviewers: craig.topper, zvi
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D45253
llvm-svn: 330462
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
X86 Supports Indirect Branch Tracking (IBT) as part of Control-Flow Enforcement Technology (CET).
IBT instruments ENDBR instructions used to specify valid targets of indirect call / jmp.
The `nocf_check` attribute has two roles in the context of X86 IBT technology:
1. Appertains to a function - do not add ENDBR instruction at the beginning of the function.
2. Appertains to a function pointer - do not track the target function of this pointer by adding nocf_check prefix to the indirect-call instruction.
This patch implements `nocf_check` context for Indirect Branch Tracking.
It also auto generates `nocf_check` prefixes before indirect branchs to jump tables that are guarded by range checks.
Differential Revision: https://reviews.llvm.org/D41879
llvm-svn: 327767
The code to match and produce more x86 vector blends was enabled for all
architectures even though the transform may pessimize the code for other
architectures that do not provide a vector blend instruction.
Added an aarch64 testcase to check that a VZIP instruction is generated instead
of byte movs.
Differential Revision: https://reviews.llvm.org/D44118
llvm-svn: 327132
ISD::ADD implies individual vector element addition with no carries between elements. But for a vXi1 type that would be the same as XOR. And we already turn ISD::ADD into ISD::XOR for all vXi1 types during lowering. So the ISD::ADD pattern would never be able to match anyway.
KADD is different, it adds the elements but also propagates a carry between them. This just a way of doing an add in k-register without bitcasting to the scalar domain. There's still no way to match the pattern, but at least its not obviously wrong.
llvm-svn: 324861
Legalization is still biased to turn LT compares in to GT by swapping operands to avoid needing extra isel patterns to commute.
I'm hoping to remove TESTM/TESTNM next and this should simplify that by making EQ/NE more similar.
llvm-svn: 323604
Summary:
For the most part its better to keep v32i1 as a mask type of a narrower width than trying to promote it to a ymm register.
I had to add some overrides to the methods that get the types for the calling convention so that we still use v32i8 for argument/return purposes.
There are still some regressions in here. I definitely saw some around shuffles. I think we probably should move vXi1 shuffle from lowering to a DAG combine where I think the extend and truncate we have to emit would be better combined.
I think we also need a DAG combine to remove trunc from (extract_vector_elt (trunc))
Overall this removes something like 13000 CHECK lines from lit tests.
Reviewers: zvi, RKSimon, delena, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D42031
llvm-svn: 323201
Summary:
First, we need to explain the core of the vulnerability. Note that this
is a very incomplete description, please see the Project Zero blog post
for details:
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
The basis for branch target injection is to direct speculative execution
of the processor to some "gadget" of executable code by poisoning the
prediction of indirect branches with the address of that gadget. The
gadget in turn contains an operation that provides a side channel for
reading data. Most commonly, this will look like a load of secret data
followed by a branch on the loaded value and then a load of some
predictable cache line. The attacker then uses timing of the processors
cache to determine which direction the branch took *in the speculative
execution*, and in turn what one bit of the loaded value was. Due to the
nature of these timing side channels and the branch predictor on Intel
processors, this allows an attacker to leak data only accessible to
a privileged domain (like the kernel) back into an unprivileged domain.
The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In many
cases, the compiler can simply use directed conditional branches and
a small search tree. LLVM already has support for lowering switches in
this way and the first step of this patch is to disable jump-table
lowering of switches and introduce a pass to rewrite explicit indirectbr
sequences into a switch over integers.
However, there is no fully general alternative to indirect calls. We
introduce a new construct we call a "retpoline" to implement indirect
calls in a non-speculatable way. It can be thought of loosely as
a trampoline for indirect calls which uses the RET instruction on x86.
Further, we arrange for a specific call->ret sequence which ensures the
processor predicts the return to go to a controlled, known location. The
retpoline then "smashes" the return address pushed onto the stack by the
call with the desired target of the original indirect call. The result
is a predicted return to the next instruction after a call (which can be
used to trap speculative execution within an infinite loop) and an
actual indirect branch to an arbitrary address.
On 64-bit x86 ABIs, this is especially easily done in the compiler by
using a guaranteed scratch register to pass the target into this device.
For 32-bit ABIs there isn't a guaranteed scratch register and so several
different retpoline variants are introduced to use a scratch register if
one is available in the calling convention and to otherwise use direct
stack push/pop sequences to pass the target address.
This "retpoline" mitigation is fully described in the following blog
post: https://support.google.com/faqs/answer/7625886
We also support a target feature that disables emission of the retpoline
thunk by the compiler to allow for custom thunks if users want them.
These are particularly useful in environments like kernels that
routinely do hot-patching on boot and want to hot-patch their thunk to
different code sequences. They can write this custom thunk and use
`-mretpoline-external-thunk` *in addition* to `-mretpoline`. In this
case, on x86-64 thu thunk names must be:
```
__llvm_external_retpoline_r11
```
or on 32-bit:
```
__llvm_external_retpoline_eax
__llvm_external_retpoline_ecx
__llvm_external_retpoline_edx
__llvm_external_retpoline_push
```
And the target of the retpoline is passed in the named register, or in
the case of the `push` suffix on the top of the stack via a `pushl`
instruction.
There is one other important source of indirect branches in x86 ELF
binaries: the PLT. These patches also include support for LLD to
generate PLT entries that perform a retpoline-style indirection.
The only other indirect branches remaining that we are aware of are from
precompiled runtimes (such as crt0.o and similar). The ones we have
found are not really attackable, and so we have not focused on them
here, but eventually these runtimes should also be replicated for
retpoline-ed configurations for completeness.
For kernels or other freestanding or fully static executables, the
compiler switch `-mretpoline` is sufficient to fully mitigate this
particular attack. For dynamic executables, you must compile *all*
libraries with `-mretpoline` and additionally link the dynamic
executable and all shared libraries with LLD and pass `-z retpolineplt`
(or use similar functionality from some other linker). We strongly
recommend also using `-z now` as non-lazy binding allows the
retpoline-mitigated PLT to be substantially smaller.
When manually apply similar transformations to `-mretpoline` to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.
When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.
However, real-world workloads exhibit substantially lower performance
impact. Notably, techniques such as PGO and ThinLTO dramatically reduce
the impact of hot indirect calls (by speculatively promoting them to
direct calls) and allow optimized search trees to be used to lower
switches. If you need to deploy these techniques in C++ applications, we
*strongly* recommend that you ensure all hot call targets are statically
linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well
tuned servers using all of these techniques saw 5% - 10% overhead from
the use of retpoline.
We will add detailed documentation covering these components in
subsequent patches, but wanted to make the core functionality available
as soon as possible. Happy for more code review, but we'd really like to
get these patches landed and backported ASAP for obvious reasons. We're
planning to backport this to both 6.0 and 5.0 release streams and get
a 5.0 release with just this cherry picked ASAP for distros and vendors.
This patch is the work of a number of people over the past month: Eric, Reid,
Rui, and myself. I'm mailing it out as a single commit due to the time
sensitive nature of landing this and the need to backport it. Huge thanks to
everyone who helped out here, and everyone at Intel who helped out in
discussions about how to craft this. Also, credit goes to Paul Turner (at
Google, but not an LLVM contributor) for much of the underlying retpoline
design.
Reviewers: echristo, rnk, ruiu, craig.topper, DavidKreitzer
Subscribers: sanjoy, emaste, mcrosier, mgorny, mehdi_amini, hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D41723
llvm-svn: 323155
Summary:
This patch adds an implementation of targetShrinkDemandedConstant that tries to keep shrinkdemandedbits from removing bits that would otherwise have been recognized as a movzx.
We still need a follow patch to stop moving ands across srl if the and could be represented as a movzx before the shift but not after. I think this should help with some of the cases that D42088 ended up removing during isel.
Reviewers: spatel, RKSimon
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D42265
llvm-svn: 323048
Currently we infer the scale at isel time by analyzing whether the base is a constant 0 or not. If it is we assume scale is 1, else we take it from the element size of the pass thru or stored value. This seems a little weird and I think it makes more sense to make it explicit in the DAG rather than doing tricky things in the backend.
Most of this patch is just making sure we copy the scale around everywhere.
Differential Revision: https://reviews.llvm.org/D40055
llvm-svn: 322210
This is the last step needed to fix PR33325:
https://bugs.llvm.org/show_bug.cgi?id=33325
We're trading branch and compares for loads and logic ops.
This makes the code smaller and hopefully faster in most cases.
The 24-byte test shows an interesting construct: we load the trailing scalar
elements into vector registers and generate the same pcmpeq+movmsk code that
we expected for a pair of full vector elements (see the 32- and 64-byte tests).
Differential Revision: https://reviews.llvm.org/D41714
llvm-svn: 321934
We end up using an i8 load via an isel pattern from v8i1 anyway. This just makes it more explicit. This seems to improve codgen in some cases and I'd like to kill off some of the load patterns.
llvm-svn: 321598
The patch originally broke Chromium (crbug.com/791714) due to its failing to
specify that the new pseudo instructions clobber EFLAGS. This commit fixes
that.
> Summary: This strengthens the guard and matches MSVC.
>
> Reviewers: hans, etienneb
>
> Subscribers: hiraditya, JDevlieghere, vlad.tsyrklevich, llvm-commits
>
> Differential Revision: https://reviews.llvm.org/D40622
llvm-svn: 319824
Summary:
These instructions zero the non-scalar part of the lower 128-bits which makes them different than the FMA3 instructions which pass through the non-scalar part of the lower 128-bits.
I've only added fmadd because we should be able to derive all other variants using operand negation in the intrinsic header like we do for AVX512.
I think there are still some missed negate folding opportunities with the FMA4 instructions in light of this behavior difference that I hadn't noticed before.
I've split the tests so that we can use different intrinsics for scalar testing between the two. I just copied the tests split the RUN lines and changed out the scalar intrinsics.
fma4-fneg-combine.ll is a new test to make sure we negate the fma4 intrinsics correctly though there are a couple TODOs in it.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D39851
llvm-svn: 318984
(V)PHMINPOSUW determines the UMIN element in an v8i16 input, with suitable bit flipping it can also be used for SMAX/SMIN/UMAX cases as well.
This patch matches vXi16 SMAX/SMIN/UMAX/UMIN horizontal reductions and reduces the input down to a v8i16 vector before calling (V)PHMINPOSUW.
A later patch will use this for v16i8 reductions as well (PR32841).
Differential Revision: https://reviews.llvm.org/D39729
llvm-svn: 318917
This patch reverts change to X86TargetLowering::getScalarShiftAmountTy in
rL318727 and move the logic to DAGTypeLegalizer::SplitInteger.
The reason is that getScalarShiftAmountTy returns a shift amount type that
is suitable for common use cases in CodeGen. DAGTypeLegalizer::SplitInteger
is a rare situation which requires a shift amount type larger than what
getScalarShiftAmountTy. In this case, it is more reasonable to do special
handling of shift amount type in DAGTypeLegalizer::SplitInteger only. If
similar situations arises the logic may be moved to a separate function.
Differential Revision: https://reviews.llvm.org/D40320
llvm-svn: 318890
This makes the fact that X86 needs an explicit mask output not part of the type constraint for the ISD::MSCATTER.
This also gives the X86ISD::MGATHER/MSCATTER nodes a common base class simplifying the address selection code in X86ISelDAGToDAG.cpp
llvm-svn: 318823
DAGTypeLegalizer::SplitInteger uses default pointer size as shift amount constant type,
which causes less performant ISA in amdgcn---amdgiz target since the default pointer
type is i64 whereas the desired shift amount type is i32.
This patch fixes that by using TLI.getScalarShiftAmountTy in DAGTypeLegalizer::SplitInteger.
The X86 change is necessary since splitting i512 requires shifting amount of 256, which
cannot be held by i8.
Differential Revision: https://reviews.llvm.org/D40148
llvm-svn: 318727
All these headers already depend on CodeGen headers so moving them into
CodeGen fixes the layering (since CodeGen depends on Target, not the
other way around).
llvm-svn: 318490
The VRNDSCALE instructions implement a superset of the (V)ROUND instructions. They are equivalent if the upper 4-bits of the immediate are 0.
This patch lowers the legacy intrinsics to the VRNDSCALE ISD node and masks the upper bits of the immediate to 0. This allows us to take advantage of the larger register encoding space.
We should maybe consider converting VRNDSCALE back to VROUND in the EVEX to VEX pass if the extended registers are not being used.
I notice some load folding opportunities being missed for the VRNDSCALESS/SD instructions that I'll try to fix in future patches.
llvm-svn: 318008
I want to reuse the VRNDSCALE node for the legacy SSE rounding intrinsics so that those intrinsics can use EVEX instructions. All of these nodes share tablegen multiclasses so I split them all so that they all remain similar in their implementations.
llvm-svn: 318007
Next step is to use them for the legacy FMA scalar intrinsics as well. This will enable the legacy intrinsics to use EVEX encoded opcodes and the extended registers.
llvm-svn: 317453
Summary:
AVX512 added RCP14 and RSQRT instructions which improve accuracy over the legacy RCP and RSQRT instruction, but not enough accuracy to remove the need for a Newton Raphson refinement.
Currently we use these new instructions for the legacy packed SSE instrinics, but not the scalar instrinsics. And we use it for fast math optimization of division and reciprocal sqrt.
I think switching the legacy instrinsics maybe surprising to the user since it changes the answer based on which processor you're using regardless of any fastmath settings. It's also weird that we did something different between scalar and packed.
As far at the reciprocal estimation, I think it creates unnecessary deltas in our output behavior (and prevents EVEX->VEX). A little playing around with gcc and icc and godbolt suggest they don't change which instructions they use here.
This patch adds new X86ISD nodes for the RCP14/RSQRT14 and uses those for the new intrinsics. Leaving the old intrinsics to use the old instructions.
Going forward I think our focus should be on
-Supporting 512-bit vectors, which will have to use the RCP14/RSQRT14.
-Using RSQRT28/RCP28 to remove the Newton Raphson step on processors with AVX512ER
-Supporting double precision.
Reviewers: zvi, DavidKreitzer, RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D39583
llvm-svn: 317413
Summary:
INC/DEC don't update the carry flag so we need to make sure we don't try to use it.
This patch introduces new X86ISD opcodes for locked INC/DEC. Teaches lowerAtomicArithWithLOCK to emit these nodes if INC/DEC is not slow or the function is being optimized for size. An additional flag is added that allows the INC/DEC to be disabled if the caller determines that the carry flag is being requested.
The test_sub_1_cmp_1_setcc_ugt test is currently showing this bug. The other test case changes are recovering cases that were regressed in r316860.
This should fully fix PR35068 finishing the fix started in r316860.
Reviewers: RKSimon, zvi, spatel
Reviewed By: zvi
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D39411
llvm-svn: 316913
This makes sure the LSDA pointer isn't truncated to 32 bit.
Make LowerINTRINSIC_WO_CHAIN a member function instead of a static
function, so that it can use the getGlobalWrapperKind method.
This solves the second half of the issues mentioned in PR34720.
Differential Revision: https://reviews.llvm.org/D38343
llvm-svn: 314767
The XOP rotations act as ROTL with +ve values and ROTR with -ve values, which means that we can treat them all as ROTL with unsigned modulo. We already check that we're only trying to lower as ROTL for XOP rotations.
Differential Revision: https://reviews.llvm.org/D37949
llvm-svn: 314207
This required changing the ISD opcode for these instructions to have the commutable operands first and the addend last. This way tablegen can autogenerate the additional patterns for us.
llvm-svn: 314083
This allows vector-sized store merging of constants in DAGCombiner using the existing code in MergeConsecutiveStores().
All of the twisted logic that decides exactly what vector operations are legal and fast for each particular CPU are
handled separately in there using the appropriate hooks.
For the motivating tests in merge-store-constants.ll, we already produce the same vector code in IR via the SLP vectorizer.
So this is just providing a backend backstop for code that doesn't go through that pass (-O1). More details in PR24449:
https://bugs.llvm.org/show_bug.cgi?id=24449 (this change should be the last step to resolve that bug)
Differential Revision: https://reviews.llvm.org/D37451
llvm-svn: 313458
Recognizing this pattern during DAG combine hides information about the 'and' and the shift from other combines. I think it should be recognized at isel so its as late as possible. But it can't be done with table based isel because you need to be able to look at both immediates. This patch moves it to custom isel in X86ISelDAGToDAG.cpp.
This does break a couple tests in tbm_patterns because we are now emitting an and_flag node or (cmp and, 0) that we dont' recognize yet. We already had this problem for several other TBM patterns so I think this fine and we can address of them together.
I've also fixed a bug where the combine to BEXTR was preventing us from using a trick of zero extending AH to handle extracts of bits 15:8. We might still want to use BEXTR if it enables load folding. But honestly I hope we narrowed the load instead before got to isel.
I think we should probably also support matching BEXTR from (srl/srl (and mask << C), C). But that should be a different patch.
Differential Revision: https://reviews.llvm.org/D37592
llvm-svn: 313054
Summary:
r275950 added support for turning (trunc (X >> N) to i1) into BT(X, N). But that's no longer necessary now that i1 isn't legal.
This patch removes the support for that, but preserves some of the refactorings done in that commit.
Reviewers: guyblank, RKSimon, spatel, zvi
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D37673
llvm-svn: 312925
This goes back to a discussion about IR canonicalization. We'd like to preserve and convert
more IR to 'select' than we currently do because that's likely the best choice in IR:
http://lists.llvm.org/pipermail/llvm-dev/2016-September/105335.html
...but that's often not true for codegen, so we need to account for this pattern coming in
to the backend and transform it to better DAG ops.
Steps in this patch:
1. Add an EVT param to the existing convertSelectOfConstantsToMath() TLI hook to more finely
enable this transform. Other targets will probably want that anyway to distinguish scalars
from vectors. We're using that here to exclude AVX512 targets, but it may not be necessary.
2. Convert a vselect to ext+add. This eliminates a constant load/materialization, and the
vector ext is often free.
Implementing a more general fold using xor+and can be a follow-up for targets that don't have
a legal vselect. It's also possible that we can remove the TLI hook for the special case fold
implemented here because we're eliminating a constant, but it needs to be tested on other
targets.
Differential Revision: https://reviews.llvm.org/D36840
llvm-svn: 311731
There's no reason to have a target specific node with the same semantics as a target independent opcode.
This should simplify D36335 so that it doesn't need to touch X86ISelDAGToDAG.cpp
Differential Revision: https://reviews.llvm.org/D36983
llvm-svn: 311568
Summary:
Without the SrcVT its hard to know what is really being asked for. For example if your target has 128, 256, and 512 bit vectors. Maybe extracting 128 from 256 is cheap, but maybe extracting 128 from 512 is not.
For x86 we do support extracting a quarter of a 512-bit register. But for i1 vectors we don't have isel patterns for extracting arbitrary pieces. So we need this to have a correct implementation of isExtractSubvectorCheap for mask vectors.
Reviewers: RKSimon, zvi, efriedma
Reviewed By: RKSimon
Subscribers: aemerson, javed.absar, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D36649
llvm-svn: 310793
Move store merge to happen after intrinsic lowering to allow lowered
stores to be merged.
Some regressions due in MergeConsecutiveStores to missing
insert_subvector that are addressed in follow up patch.
Reviewers: craig.topper, efriedma, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34559
llvm-svn: 310710
Changing mask argument type from const SmallVectorImpl<int>& to
ArrayRef<int>.
This came up in D35700 where a mask is received as an ArrayRef<int> and
we want to pass it to TargetLowering::isShuffleMaskLegal().
Also saves a few lines of code.
llvm-svn: 309085
splitting patch D34601 into two part. This part changes the location of two functions.
The second part will be based on that patch. This was requested by @RKSimon.
Reviewers:
1. dorit
2. Farhana
3. RKSimon
4. guyblank
5. DavidKreitzer
llvm-svn: 309084
This patch makes LSR generate better code for SystemZ in the cases of memory
intrinsics, Load->Store pairs or comparison of immediate with memory.
In order to achieve this, the following common code changes were made:
* New TTI hook: LSRWithInstrQueries(), which defaults to false. Controls if
LSR should do instruction-based addressing evaluations by calling
isLegalAddressingMode() with the Instruction pointers.
* In LoopStrengthReduce: handle address operands of memset, memmove and memcpy
as address uses, and call isFoldableMemAccessOffset() for any LSRUse::Address,
not just loads or stores.
SystemZ changes:
* isLSRCostLess() implemented with Insns first, and without ImmCost.
* New function supportedAddressingMode() that is a helper for TTI methods
looking at Instructions passed via pointers.
Review: Ulrich Weigand, Quentin Colombet
https://reviews.llvm.org/D35262https://reviews.llvm.org/D35049
llvm-svn: 308729
Summary:
Add a combine for creating a truncate to replace a build_vector composed of extracts with
indices that form a stride-2^N series.
Example:
v8i32 V = ...
v4i32 build_vector((extract_elt V, 0), (extract_elt V, 2), (extract_elt V, 4), (extract_elt V, 6))
-->
v4i32 truncate (bitcast V to v4i64)
Related discussion in llvm-dev about canonicalizing shuffles to
truncates in LLVM IR:
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108936.html.
Reviewers: spatel, RKSimon, efriedma, igorb, craig.topper, wolfgangp, delena
Reviewed By: delena
Subscribers: guyblank, delena, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D34077
llvm-svn: 307036