I removed this in r316797 because the coverage report showed no coverage and I thought it should have been handled by the auto generated table. I now see that there is code that bypasses the table if the shift amount is out of bounds.
This adds back the code. We'll codegen out of bounds i8 shifts to effectively (amount & 0x1f). The 0x1f is a strange quirk of x86 that shift amounts are always masked to 5-bits(except 64-bits). So if the masked value is still out bounds the result will be 0.
Fixes PR36731.
llvm-svn: 327540
I had to modify the bswap recognition to allow unshrunk masks to make this work.
Fixes PR36689.
Differential Revision: https://reviews.llvm.org/D44442
llvm-svn: 327530
Support G_LSHR/G_ASHR/G_SHL. We have 3 variance for
shift instructions : shift gpr, shift imm, shift 1.
Currently GlobalIsel TableGen generate patterns for
shift imm and shift 1, but with shiftCount i8.
In G_LSHR/G_ASHR/G_SHL like LLVM-IR both arguments
has the same type, so for now only shift i8 can use
auto generated TableGen patterns.
The support of G_SHL/G_ASHR enables tryCombineSExt
from LegalizationArtifactCombiner.h to hit, which
results in different legalization for the following tests:
LLVM :: CodeGen/X86/GlobalISel/ext-x86-64.ll
LLVM :: CodeGen/X86/GlobalISel/gep.ll
LLVM :: CodeGen/X86/GlobalISel/legalize-ext-x86-64.mir
-; X64-NEXT: movsbl %dil, %eax
+; X64-NEXT: movl $24, %ecx
+; X64-NEXT: # kill: def $cl killed $ecx
+; X64-NEXT: shll %cl, %edi
+; X64-NEXT: movl $24, %ecx
+; X64-NEXT: # kill: def $cl killed $ecx
+; X64-NEXT: sarl %cl, %edi
+; X64-NEXT: movl %edi, %eax
..which is not optimal and should be addressed later.
Rework of the patch by igorb
Reviewed By: igorb
Differential Revision: https://reviews.llvm.org/D44395
llvm-svn: 327499
We now only create recursive concats if we have more than two non-zero values. This keeps our subvector broadcast DAG combine functioning.
llvm-svn: 327457
This better able to detect undef and zeros pieces in the concat. Or cases when only one subvector is non-zero. This allows us to avoid silly things like double inserts into progressively larger undefs.
This still builds 512 bit concats of 128 bits by building up through 256 bits first. But I don't know if that's best.
We probably want to merge this with the vXi1 concat code since they are very similar.
llvm-svn: 327454
Summary: Unless you were intentionally avoiding this syntax? I saw you mentioned makeArrayRef in your commit that added SplitOpsAndApply.
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D44403
llvm-svn: 327418
MVT belongs to the CodeGen layer, but ShuffleDecode is used by the X86 InstPrinter which is part of the MC layer. This only worked because MVT is completely implemented in a header file with no other library dependencies.
Differential Revision: https://reviews.llvm.org/D44353
llvm-svn: 327292
We called MaskedValueIsZero with two different masks, but underneath that calls computeKnownBits before applying the mask. This means we compute the same known bits twice due to the two calls. Instead just call computeKnownBits directly and apply the two masks ourselves.
llvm-svn: 327251
64-bit MMX vector generation usually ends up lowering into SSE instructions before being spilled/reloaded as a MMX type.
This patch creates a MMX vector from MMX source values, taking the lowest element from each source and constructing broadcasts/build_vectors with direct calls to the MMX PUNPCKL/PSHUFW intrinsics.
We're missing a few consecutive load combines that could be handled in a future patch if that would be useful - my main interest here is just avoiding a lot of the MMX/SSE crossover.
Differential Revision: https://reviews.llvm.org/D43618
llvm-svn: 327247
Same as the VPERMILPS/VPERMILPD approach for v8f32/v4f64 cases, rely on PSHUFB using bits[3:0] for indexing - we can ignore the sign bit (zero element) as those index vector values are considered undefined. The select between the lo/hi permute results based on the index size.
llvm-svn: 327242
As VPERMILPS/VPERMILPD only selects elements based on the bits[1:0]/bit[1] then we can permute both the (repeated) lo/hi 128-bit vectors in each case and then select between these results based on whether the index was for for lo/hi.
For v4i64/v4f64 this avoids some rather nasty v4i64 multiples on the AVX2 implementation, which seems to be worse than the extra port5 pressure from the additional shuffles/blends.
llvm-svn: 327239
Helper function to insert a subvector into the bottom elements of a larger zero/undef vector with the same scalar type.
I've converted a couple of INSERT_SUBVECTOR calls to use it, there are plenty more although in some cases I was worried it might make the code more ambiguous.
llvm-svn: 327236
Summary:
There are 3 different operand orders for FMA instructions so figuring out the exact operation being performed requires a lot of thought.
This patch adds a comment to the end of the assembly line to print the exact operation.
I think I've got all the instructions in here except the ones with builtin rounding.
I didn't update all tests, but I assume we can get them as we regenerate tests in the future.
Reviewers: spatel, v_klochkov, RKSimon
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D44345
llvm-svn: 327225
X86InstComments.h is used by tools that only have the MC layer. We shouldn't be importing a file from CodeGen into this.
X86InstrInfo.h isn't a great place, but I couldn't find a better one.
llvm-svn: 327202
r327171 "Improve Dependency analysis when doing multi-node Instruction Selection"
r328170 "[DAG] Enforce stricter NodeId invariant during Instruction selection"
Reverting patch as NodeId invariant change is causing pathological
increases in compile time on PPC
llvm-svn: 327197
Relanding after fixing NodeId Invariant.
Cleanup cycle/validity checks in ISel (IsLegalToFold,
HandleMergeInputChains) and X86 (isFusableLoadOpStore). Now do a full
search for cycles / dependencies pruning the search when topological
property of NodeId allows.
As part of this propogate the NodeId-based cutoffs to narrow
hasPreprocessorHelper searches.
Reviewers: craig.topper, bogner
Subscribers: llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D41293
llvm-svn: 327171
Instruction Selection makes use of the topological ordering of nodes
by node id (a node's operands have smaller node id than it) when doing
cycle detection. During selection we may violate this property as a
selection of multiple nodes may induce a use dependence (and thus a
node id restriction) between two unrelated nodes. If a selected node
has an unselected successor this may allow us to miss a cycle in
detection an invalid selection.
This patch fixes this by marking all unselected successors of a
selected node have negated node id. We avoid pruning on such negative
ids but still can reconstruct the original id for pruning.
In-tree targets have been updated to replace DAG-level replacements
with ISel-level ones which enforce this property.
This preemptively fixes PR36312 before triggering commit r324359 relands
Reviewers: craig.topper, bogner, jyknight
Subscribers: arsenm, nhaehnle, javed.absar, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D43198
llvm-svn: 327170
The retpoline mitigation for variant 2 of CVE-2017-5715 inhibits the
branch predictor, and as a result it can lead to a measurable loss of
performance. We can reduce the performance impact of retpolined virtual
calls by replacing them with a special construct known as a branch
funnel, which is an instruction sequence that implements virtual calls
to a set of known targets using a binary tree of direct branches. This
allows the processor to speculately execute valid implementations of the
virtual function without allowing for speculative execution of of calls
to arbitrary addresses.
This patch extends the whole-program devirtualization pass to replace
certain virtual calls with calls to branch funnels, which are
represented using a new llvm.icall.jumptable intrinsic. It also extends
the LowerTypeTests pass to recognize the new intrinsic, generate code
for the branch funnels (x86_64 only for now) and lay out virtual tables
as required for each branch funnel.
The implementation supports full LTO as well as ThinLTO, and extends the
ThinLTO summary format used for whole-program devirtualization to
support branch funnels.
For more details see RFC:
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120672.html
Differential Revision: https://reviews.llvm.org/D42453
llvm-svn: 327163
The code to match and produce more x86 vector blends was enabled for all
architectures even though the transform may pessimize the code for other
architectures that do not provide a vector blend instruction.
Added an aarch64 testcase to check that a VZIP instruction is generated instead
of byte movs.
Differential Revision: https://reviews.llvm.org/D44118
llvm-svn: 327132
Previously we unpacked the even bytes of each input into the high byte of 16-bit elements then did an v8i16 arithmetic shift right by 8 bits to fill the upper bits of each word with sign bits. Then we did the v8i16 multiply and then masked to zero the upper 8-bits of each result. The similar was done for all the odd bytes. The results are then packed together with packuswb
Since we are masking each multiply result element to 8-bits, and those 8-bits are determined only by the lower 8-bits of each of the inputs, we don't need to fill the upper bits with sign bits. So we can just unpack into the low byte of each element and treat the upper bits as garbage. This is what gcc also does.
Differential Revision: https://reviews.llvm.org/D44267
llvm-svn: 327093
This instruction can be thought of as reading either the even elements of a vXi32 input or the lower half of each element of a vXi64 input. We currently use the vXi32 interpretation, but vXi64 matches better with its broadcast behavior in EVEX.
I'm looking at moving MULDQ/MULUDQ creation to a DAG combine so we can do it when AVX512DQ is enabled without having to go through Custom lowering. But in some of the test cases we failed to use a broadcast load due to the size difference. This should help with that.
I'm also wondering if we can model these instructions in native IR and remove the intrinsics and I think using a vXi64 type will work better with that.
llvm-svn: 326991
These patterns weren't checking the alignment of the load, but were using the aligned instructions. This will cause a GP fault if the data isn't aligned.
I believe these were introduced in r312450.
llvm-svn: 326967
The v8i32 conversion on AVX1 targets was only working after LowerMUL splits 256-bit vectors.
While I was there I've also made it so we don't have to check for AVX2 and BWI directly and instead just ask if the type is legal.
Differential Revision: https://reviews.llvm.org/D44190
llvm-svn: 326917
Summary:
Only IMUL16rri uses an extra P0156. IMUL32* and IMUL16rr only use
P1.
This was computed using https://github.com/google/EXEgesis/blob/master/exegesis/tools/compute_itineraries.cc
This can easily be validated by running perf on the following code:
```
int main(int argc, char**argv) {
int a = argc;
int b = argc;
int c = argc;
int d = argc;
for (int i = 0; i < LOOP_ITERATIONS; ++i) {
asm volatile(
R"(
.rept 10000
imull $0x2, %%edx, %%eax
imull $0x2, %%ecx, %%ebx
imull $0x2, %%eax, %%edx
imull $0x2, %%ebx, %%ecx
.endr
)"
: "+a"(a), "+b"(b), "+c"(c), "+d"(d)
:
:);
}
return a+b+c+d;
}
```
-> test.cc
perf stat -x, -e cycles --pfm-events=uops_executed_port:port_0:u,uops_executed_port:port_1:u,uops_executed_port:port_2:u,uops_executed_port:port_3:u,uops_executed_port:port_4:u,uops_executed_port:port_5:u,uops_executed_port:port_6:u,uops_executed_port:port_7:u test
Reviewers: craig.topper, RKSimon, gadi.haber
Subscribers: llvm-commits, gchatelet, chandlerc
Differential Revision: https://reviews.llvm.org/D43460
llvm-svn: 326877
The code checks Level == AfterLegalizeDAG which is the fourth and last of the possible DAG combine stages that we have.
There is a Level called AfterLegalVectorOps, but that's the third DAG combine and it doesn't always run.
A function called isAfterLegalVectorOps should imply it returns true in either of the DAG combines that runs after the legalize vector ops stage, but that's not what this function does.
llvm-svn: 326832
EAX can turn out to be alive here, when shrink wrapping is done
(which is allowed when using dwarf exceptions, contrary to the
normal case with WinCFI).
This fixes PR36487.
Differential Revision: https://reviews.llvm.org/D43968
llvm-svn: 326764
Almost none of these usages were FP specific. And we had no clear guideliness on when to use hasAVX vs hasFP256.
I might also remove hasInt256 too since its an alias for hasAVX2.
llvm-svn: 326682
rL322525 - mmx zero constant support
rL322553 - mmx i32 zero extended value
rL326497 - mmx i64 general constant handling
Not all constants are folded, we generate some on the GPRs (similar to SSE build vector) where appropriate
llvm-svn: 326673
We were previously doing this with isel patterns. Moving it to op legalization gives us chance to see the required bitcast earlier. And it lets us remove some isel patterns.
llvm-svn: 326669
These instructions are double-pumped, split into 2 128-bit ops and then passing through either FPU pipe.
Found while testing llvm-mca (D43951)
llvm-svn: 326597
64-bit MMX constant generation usually ends up lowering into SSE instructions before being spilled/reloaded as a MMX type.
This patch bitcasts the constant to a double value to allow correct loading directly to the MMX register.
I've added MMX constant asm comment support to improve testing, it's better to always print the double values as hex constants as MMX is mainly an integer unit (and even with 3DNow! its just floats).
Differential Revision: https://reviews.llvm.org/D43616
llvm-svn: 326497
Emulated TLS is enabled by llc flag -emulated-tls,
which is passed by clang driver.
When llc is called explicitly or from other drivers like LTO,
missing -emulated-tls flag would generate wrong TLS code for targets
that supports only this mode.
Now use useEmulatedTLS() instead of Options.EmulatedTLS to decide whether
emulated TLS code should be generated.
Unit tests are modified to run with and without the -emulated-tls flag.
Differential Revision: https://reviews.llvm.org/D42999
llvm-svn: 326341
An extract_element where the result type is larger than the scalar element type is semantically an any_extend of from the scalar element type to the result type. If we expect zeroes in the upper bits of the i8/i32 we need to mae sure those zeroes are explicit in the DAG.
For these cases the best way to accomplish this is use an insert_subvector to pad zeroes to the upper bits of the v1i1 first. We extend to either v16i1(for i32) or v8i1(for i8). Then bitcast that to a scalar and finish with a zero_extend up to i32 if necessary. We can't extend past v16i1 because that's the largest mask size on KNL. But isel is smarter enough to know that a zext of a bitcast from v16i1 to i16 can use a KMOVW instruction. The insert_subvectors will be dropped during isel because we can determine that the producing instruction already zeroed the upper bits of the k-register.
llvm-svn: 326308
While the description for the instruction does mention OR, its talking about how the individual classification test results are ORed together.
The incoming mask is used as a zeroing write mask. If the bit is 1 the classification is written to the output. The bit is 0 the output is 0. This equivalent to an AND.
Here is pseudocode from the intrinsics guide
FOR j := 0 to 1
i := j*64
IF k1[j]
k[j] := CheckFPClass_FP64(a[i+63:i], imm8[7:0])
ELSE
k[j] := 0
FI
ENDFOR
k[MAX:2] := 0
llvm-svn: 326306
These tables add 3000 lines to X86InstrInfo.cpp. And if we ever manage to auto generate them they'll be a separate file anyway.
Differential Revision: https://reviews.llvm.org/D43806
llvm-svn: 326225
Currently we assert that only non target specific opcodes can have
missing RegisterClass constraints in the MCDesc. The backend can have
instructions with register operands but don't have RegisterClass
constraints (say using unknown_class) in which case the instruction
defining the register will constrain it.
Change the assert to only fire if a def has no regclass.
https://reviews.llvm.org/D43409
llvm-svn: 326142
Agner's tables indicate that for SSE42+ targets (Core2 and later) we can reduce the FADD/FSUB/FMUL costs down to 1, which should fix the Himeno benchmark.
Note: the AVX512 FDIV costs look rather dodgy, but this isn't part of this patch.
Differential Revision: https://reviews.llvm.org/D43733
llvm-svn: 326133
There's still some shortcoming in our ability to combine binops of constants with different sizes separated by an extend. I'll try to look at that next.
llvm-svn: 326128
Summary:
We have an early DAG combine to turn these patterns into MOVMSK, but that combine doesn't work if the vXi1 type has more elements than the widest legal vXi8 type. Type legalization will eventually split it down to v16i1 or v32i1 and then the bitcast gets legalized to a truncstore and a scalar load. The truncstore will get lowered to a series of extracts and bit math.
This patch adds a custom legalization to use a sign extend and MOVMSK instead. This prevents the eventual scalarization.
Reviewers: spatel, RKSimon, zvi
Reviewed By: RKSimon
Subscribers: mgorny, llvm-commits
Differential Revision: https://reviews.llvm.org/D43593
llvm-svn: 326119
This code seemed to try to widen to 128, 256, or 512 bit vectors, but we only create X86ISD::AVG with a power of 2 number of elements. This means the only nodes that need to be legalized are less than 128-bits and need to be widened up to 128 bits.
llvm-svn: 326064
Which types are considered 'simple' is a function of the requirements of all targets that LLVM supports. That shouldn't directly affect what types we are able to handle. The remainder of this code checks that the number of elements is a power of 2 and takes care of splitting down to a legal size.
llvm-svn: 326063
Our UMIN/UMAX, vector truncation and shuffle combining is good enough to efficiently handle v8i64 with the number of leading zeros that are necessary for PSUBUS.
llvm-svn: 326034
Now that UMIN etc are Legal/Custom for SSE2+, we can efficiently match SUBUS v8i32 cases from SSSE3 which can perform efficient truncation with PSHUFB.
llvm-svn: 326033
This portion can be matched by other patterns. We don't need it to make the larger pattern valid. It's sufficient to have a v1i1 mask input without caring where it came from.
llvm-svn: 325999
This has the advantage of making release only builds more warning
free and there's no need to make this routine a class function if
it isn't using class members anyhow.
llvm-svn: 325967
These can be created by type legalization promoting the inputs to select to match scalar boolean contents.
We were trying to pattern match them away during isel, but its better to just remove them from the DAG.
I've cleaned up some patterns to not check for this 'and' anymore. But I suspect this has also opened up opportunities for pattern removal.
llvm-svn: 325949
The test changes you can see are related to the changes in ReplaceNodeResults. Though shuffle-vs-trunc-512.ll does have a test that exercises the code in LowerBITCAST. Looks like the test output didn't change because DAG combining is able to clean up the resulting type legalization. Adding the custom hook just makes type legalization work less hard.
Differential Revision: https://reviews.llvm.org/D43447
llvm-svn: 325933
Summary:
Add a target option AllowRegisterRenaming that is used to opt in to
post-register-allocation renaming of registers. This is set to 0 by
default, which causes the hasExtraSrcRegAllocReq/hasExtraDstRegAllocReq
fields of all opcodes to be set to 1, causing
MachineOperand::isRenamable to always return false.
Set the AllowRegisterRenaming flag to 1 for all in-tree targets that
have lit tests that were effected by enabling COPY forwarding in
MachineCopyPropagation (AArch64, AMDGPU, ARM, Hexagon, Mips, PowerPC,
RISCV, Sparc, SystemZ and X86).
Add some more comments describing the semantics of the
MachineOperand::isRenamable function and how it is set and maintained.
Change isRenamable to check the operand's opcode
hasExtraSrcRegAllocReq/hasExtraDstRegAllocReq bit directly instead of
relying on it being consistently reflected in the IsRenamable bit
setting.
Clear the IsRenamable bit when changing an operand's register value.
Remove target code that was clearing the IsRenamable bit when changing
registers/opcodes now that this is done conservatively by default.
Change setting of hasExtraSrcRegAllocReq in AMDGPU target to be done in
one place covering all opcodes that have constant pipe read limit
restrictions.
Reviewers: qcolombet, MatzeB
Subscribers: aemerson, arsenm, jyknight, mcrosier, sdardis, nhaehnle, javed.absar, tpr, arichardson, kristof.beyls, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, jordy.potman.lists, apazos, sabuasal, niosHD, escha, nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D43042
llvm-svn: 325931
We won't be able to fold the constant pool load, but its still better than materialing ones and xoring for the invert if we used PCMPEQ.
This will fix another regression from D42948.
llvm-svn: 325845
Previously this code overrode the flags and opcode used by the later code in LowerVSETCC. This makes the code difficult to read and follow.
This patch moves all the SUBUS code into its own function and makes it responsible for creating its own SDNodes on success.
Differential Revision: https://reviews.llvm.org/D43530
llvm-svn: 325827
Cannon Lake does not support CLWB, therefore it
does not include all features listed under SKX anymore.
Instead, enumerate all SKX features with the exception of CLWB.
Patch by Gabor Buella
Differential Revision: https://reviews.llvm.org/D43380
llvm-svn: 325654
SimplifyDemandedBits forces the demanded mask to all 1s if the node has multiple uses, unless the AssumeSingleUse flag is set.
So previously we were only really likely to simplify something if the condition had a single use. And on the off chance we did simplify with multiple uses the demanded mask being used was all ones so there was no reason to create a shrunkblend.
This patch now checks that the condition is only used by selects first, and then sets the AssumeSingleUse flag for the simplifcation. Then we convert the selects to shrunkblend, and finally replace condition.
Differential Revision: https://reviews.llvm.org/D43446
llvm-svn: 325604
This allows us to avoid an opsize prefix. And forcing some move immediates to i32 avoids a length changing prefix on those instructions.
This mostly replaces the existing combine we had for zext/sext+cmov of constants. I left in a case for sign extending a 32 bit cmov of constants to 64 bits.
Differential Revision: https://reviews.llvm.org/D43327
llvm-svn: 325601
For instructions like call foo and jmp foo patch changes
relocation produced from R_X86_64_PC32 to R_X86_64_PLT32.
Relocation can be used as a marker for 32-bit PC-relative branches.
Linker will reduce PLT32 relocation to PC32 if function is defined locally.
Differential revision: https://reviews.llvm.org/D43383
llvm-svn: 325569
The 128 and 256 bit versions were already not used by clang. This adds an equivalent unmasked 512 bit version. Then autoupgrades all sizes to use unmasked intrinsics plus select.
llvm-svn: 325559
Previously we used vptestmd, but the scheduling data for SKX says vpmovq2m/vpmovd2m is lower latency. We already used vpmovb2m/vpmovw2m for byte/word truncates. So this is more consistent anyway.
llvm-svn: 325534
We swapped the operands and used setle, but I don't see any reason to do that. I think this is a holdover from SSE where we swap and the invert to use pcmpgt. But with AVX512 we don't want an invert so we won't use pcmpgt. So there's no need to swap.
llvm-svn: 325527
Canonicalize EQ/NE PCMPM to have build vector all zeros on the RHS so we don't have to pattern match it in both locations. This significantly reduces the number of isel patterns needed since we also had to multiply it out with loads being in either operand of the 'and' input node and in the 'and' masking node.
This removes over 24000 bytes from the isel table.
llvm-svn: 325526
We're accidentally checking that the same node is a constant twice instead of checking the other node.
This isn't a functional problem since we didn't do anything below that explicitly requires constants. It just means we may have introduced a sign_extend or zero_extend that won't fold out.
llvm-svn: 325469
Previously we used the immediate encoding if the load was in operand 0 and the short encoding if the load was in operand 1.
This added an insane number of bytes to the size of the isel table. I'm wondering if we should always use the immediate form during isel and change to the short form during emission. This would remove the need to pattern match every combination for both the immediate form and the short form during isel. We could do the same with vpcmpgt
llvm-svn: 325456
Sadly, r324359 caused at least PR36312. There is a patch out for review
but it seems to be taking a bit and we've already had these crashers in
tree for too long. We're hitting this PR in real code now and are
blocked on shipping new compilers as a consequence so I'm reverting us
back to green.
Sorry for the churn due to the stacked changes that I had to revert. =/
llvm-svn: 325420
Summary:
Currently we convert to shuffles during lowering. This moves it to DAG combine so hopefully we can get it done before type legalization has to extend the condition.
I believe in some cases we're creating SHRUNKBLENDs that end up with constant conditions because we see the extended on the condition and think its a dynamic selelect before DAG combine gets a chance to constant fold the extend. We could add combines to turn SHRUNKBLENDs with constant condition back to vselect. But it seemed like it might be better to just send them to shuffles as early as possible so they never get a chance to become SHRUNKBLENDs. This the reason some tests went from blends controlled by a constant pool load to just move.
Some of the constant pool entries changed because the sign_extend introduced by type legalization turned undef elements in select condition into 0s. While the select->shuffle used -1 in the shuffle mask. So now the shuffle lowering can do what it wants with them.
I'll remove the lowering code as a follow up. We might be able to simplify some of the pre-checks for SHRUNKBLEND as the FIXME there says.
Reviewers: spatel, RKSimon, efriedma, zvi, andreadb
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D43367
llvm-svn: 325417
Undef in select condition means we should pick the element from one side or the other. An undef in a shuffle mask means pick any element from either source or worse.
I suspect by the time we get here most of the undefs in a constant vector have been removed by other things, but doing this for safety.
llvm-svn: 325394
This seems to interfere with a target independent brcond combine that looks for the (srl (and X, C1), C2) pattern to enable TEST instructions. Once we flip, that combine doesn't fire and we end up exposing it to the X86 specific BT combine which causes us to emit a BT instruction. BT has lower throughput than TEST.
We could try to make the brcond combine aware of the alternate pattern, but since the flip was just a code size reduction and not likely to enable other combines, it seemed easier to just delay it until after lowering.
Differential Revision: https://reviews.llvm.org/D43201
llvm-svn: 325371
Summary:
This patch makes the decoder understand old AMD 3DNow!
instructions that have never been properly supported in the X86
disassembler, despite being supported in other subsystems. Hopefully
this should make the X86 decoder more complete with respect to binaries
containing legacy code.
Reviewers: craig.topper
Reviewed By: craig.topper
Subscribers: llvm-commits, maksfb, bruno
Differential Revision: https://reviews.llvm.org/D43311
llvm-svn: 325295
We already do this for 64-bit when it won't fit into a 64-bit AND/TEST's immediate field. This adds an additional qualifier to do it for any single bit constant larger than 8-bits under optsize
Differential Revision: https://reviews.llvm.org/D43346
llvm-svn: 325290
We can't fold a large immediate into a 64-bit operation. But if we know we're only operating on a single bit we can use the bit instructions.
For now only do this for optsize.
Differential Revision: https://reviews.llvm.org/D37418
llvm-svn: 325287
We can use PACKSS to saturate each stage of the chain: PACKSSDW down to [-32768,32767] and then PACKSSWB to [-128,127].
PACKUS is a little trickier and will be handled in a separate patch.
llvm-svn: 325235
The bound instruction does not have reversed operands in gas.
Fixes PR27653.
Patch by Maya Madhavan.
Differential Revision: https://reviews.llvm.org/D43243
llvm-svn: 325178
Try to keep PACK*SDW/PACK*SWB as wide as possible, this helps ComputeNumSignBits as it can only peek through bitcasts to wider types, pre-AVX2 codegen was already doing this as it could peek through bitcasts/subvectors more easily than AVX2 could through shuffles.
This shouldn't affect existing results as calls to truncateVectorWithPACK ensure we have enough sign bits to pack to the same value, but it should make it possible to use truncateVectorWithPACK chains to perform saturation in combineTruncateWithSat with a future patch.
llvm-svn: 325149
The prologue-end line record must be emitted after the last
instruction that is part of the function frame setup code and before
the instruction that marks the beginning of the function body.
Patch by Carlos Alberto Enciso!
Differential Revision: https://reviews.llvm.org/D41762
llvm-svn: 325143
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.
This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.
The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.
Change-Id: Ic41aa9ade6512e0478db66e07e2fde41b4fb35f9
llvm-svn: 325128
While the AVX512 VTRUNCS/VTRUNCUS instructions require legal types, truncateVectorWithPACK handles cases with multiples of legal types through splitting/concatenation. So we just need to ensure that the src/dst scalar types are correct and leave truncateVectorWithPACK to handle the rest of it.
llvm-svn: 325127
Summary:
Instead of solving the hard problem of how to pass the callee to the indirect
jump thunk without a register, just use a CSR. At a call boundary, there's
nothing stopping us from using a CSR to hold the callee as long as we save and
restore it in the prologue.
Also, add tests for this mregparm=3 case. I wrote execution tests for
__llvm_retpoline_push, but they never got committed as lit tests, either
because I never rewrote them or because they got lost in merge conflicts.
Reviewers: chandlerc, dwmw2
Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D43214
llvm-svn: 325049
Expand existing SchedRW to encompass these like it did for the other memory offset movs - added comments to closing braces to keep track of def scopes.
We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).
llvm-svn: 324910
Tag AVX512 variants to match SSE/AVX originals.
We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).
llvm-svn: 324901
We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).
AMD targets can perform these a lot quicker than WriteMicrocoded so will need an override in the models.
llvm-svn: 324897
It asserts building Chromium; see PR36346.
(This also reverts the follow-up r324836.)
> If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
> A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
> The estimated penalty for a store forward block is ~13 cycles.
>
> This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
> of a load and a store.
>
> The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
> breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.
llvm-svn: 324887
I don't believe we ever create an X86ISD::SUB with a 0 constant which is what the TEST handling needs. The ternary operator at the end of this code shows up as only going one way in the llvm-cov report from the bots.
llvm-svn: 324865
ISD::ADD implies individual vector element addition with no carries between elements. But for a vXi1 type that would be the same as XOR. And we already turn ISD::ADD into ISD::XOR for all vXi1 types during lowering. So the ISD::ADD pattern would never be able to match anyway.
KADD is different, it adds the elements but also propagates a carry between them. This just a way of doing an add in k-register without bitcasting to the scalar domain. There's still no way to match the pattern, but at least its not obviously wrong.
llvm-svn: 324861
Previously we just emitted this as a MOV8rm which would likely get folded during the peephole pass anyway. This just makes it explicit earlier.
The gpr-to-mask.ll test changed because the kaddb instruction has no memory form.
llvm-svn: 324860
Summary:
Currently we only use min/max to help with ule/uge compares because it removes an invert of the result that would otherwise be needed. But we can also use it for ult/ugt compares if it will prevent the need for a sign bit flip needed to use pcmpgt at the cost of requiring an invert after the compare.
I also refactored the code so that the max/min code is self contained and does its own return instead of setting up a flag to manipulate the rest of the function's behavior.
Most of the test cases look ok with this. I did notice that we added instructions when one of the operands being sign flipped is a constant vector that we were able to constant fold the flip into.
I also noticed that sometimes the SSE min/max clobbers a register that is needed after the compare. This resulted in an extra move being inserted before the min/max to preserve the register. We could try to detect this and switch from min to max and change the compare operands to use the operand that gets reused in the compare.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D42935
llvm-svn: 324842
This allows us to recognise more saturation patterns and also simplify some MINMAX codegen that was failing to combine CMPGE comparisons to a legal CMPGT.
Differential Revision: https://reviews.llvm.org/D43014
llvm-svn: 324837
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.
This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.
The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.
Change-Id: I620b6dc91583ad9a1444591e3ddc00dd25d81748
llvm-svn: 324835
This patch adds a new function attribute "required-vector-width" that can be set by the frontend to indicate the maximum vector width present in the original source code. The idea is that this would be set based on ABI requirements, intrinsics or explicit vector types being used, maybe simd pragmas, etc. The backend will then use this information to determine if its save to make 512-bit vectors illegal when the preference is for 256-bit vectors.
For code that has no vectors in it originally and only get vectors through the loop and slp vectorizers this allows us to generate code largely similar to our AVX2 only output while still enabling AVX512 features like mask registers and gather/scatter. The loop vectorizer doesn't always obey TTI and will create oversized vectors with the expectation the backend will legalize it. In order to avoid changing the vectorizer and potentially harm our AVX2 codegen this patch tries to make the legalizer behavior similar.
This is restricted to CPUs that support AVX512F and AVX512VL so that we have good fallback options to use 128 and 256-bit vectors and still get masking.
I've qualified every place I could find in X86ISelLowering.cpp and added tests cases for many of them with 2 different values for the attribute to see the codegen differences.
We still need to do frontend work for the attribute and teach the inliner how to merge it, etc. But this gets the codegen layer ready for it.
Differential Revision: https://reviews.llvm.org/D42724
llvm-svn: 324834
We promote these via a DAG combine now before lowering gets the chance.
Also remove the v2i1 custom handling since it will no longer be triggered.
llvm-svn: 324833
These were added as part of the refactoring for prefer vector width. At the time I thought the hasAVX512 here would be replaced with "allow 512 bit vectors" so that it would read "allow 512 bit vectors OR VLX". But now the plan is to only give the option of disabling 512 bit vectors when VLX is enabled. So we don't need this qualification at all
llvm-svn: 324831
Summary:
This patch changes the signature of the avx512 packed fp compare intrinsics to return a vXi1 vector and no longer take a mask as input. The casts to scalar type will now need to be explicit in the IR. The masking node will now be an explicit and in the IR.
This makes the intrinsic look much more similar to an fcmp instruction that we wish we could use for these but can't. We already use icmp instructions for integer compares.
Previously the lowering step of isel would turn the intrinsic into an X86 specific ISD node and a emit the masking nodes as well as some bitcasts. This means DAG combines can't see the vXi1 type until somewhat late, making it more difficult to combine out gpr<->mask transition sequences. By exposing the vXi1 type explicitly in the IR and initial SelectionDAG we give earlier DAG combines and even InstCombine the chance to see it and optimize it.
This should make any issues with gpr<->mask sequences the same between integer and fp. Meaning we only have to fix them once.
Reviewers: spatel, delena, RKSimon, zvi
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D43137
llvm-svn: 324827
Undef VLX, getSetCCResultType returns v2i1/v4i1 for v2f32/v4f32 so default type legalization will end up changing the setcc result type back to vXi1 if it had been extended. The resulting extend gets messed up further by type legalization and is difficult to recombine back to (v4i32 (setcc (v4f32))) after legalization.
I went ahead and enabled this for SSE2 and later since its always the result we want and this helps type legalization get there in less steps.
llvm-svn: 324822
This prevents extends of masks being introduced during lowering where it become difficult to combine them out.
There are a few oddities in here.
We sometimes concatenate two k-registers produced by two compares, sign_extend the combined pair, then extract two halves. This worked better previously because the sign_extend wasn't created until after the fp_to_sint was split which led to a split sign_extend being created.
We probably also need to custom type legalize (v2i32 (sext v2i1)) via widening.
llvm-svn: 324820
This avoids a constant pool load to create 1.
The int->float are showing converts to mask and back. We probably need to widen inputs to sint_to_fp/uint_to_fp before type legalization.
llvm-svn: 324805