Summary:
I've noticed that the bitcasts we introduce for these make computeKnownBits and computeNumSignBits not work well in LegalizeVectorOps. LegalizeVectorOps legalizes bottom up while LegalizeDAG legalizes top down. The bottom up strategy for LegalizeVectorOps means operands are legalized before their uses. So we promote and/or/xor before we legalize the operands that use them making computeKnownBits/computeNumSignBits in places like LowerTruncate suboptimal. I looked at changing LegalizeVectorOps to be top down as well, but that was more disruptive and caused some regressions. I also looked at just moving promotion of binops to LegalizeDAG, but that had a few issues one around matching AND,ANDN,OR into VSELECT because I had to create ANDN as vXi64, but the other nodes hadn't legalized yet, I didn't look too hard at fixing that.
This patch seems to produce better results overall than my other attempts. We now form broadcasts of constants better in some cases. For at least some of them the AND was being introduced in LegalizeDAG, promoted to vXi64, and the BUILD_VECTOR was also legalized there. I think we got bad ordering of that. Now the promotion is out of the legalizer so we handle this better.
In the longer term I think we really should evaluate whether we should be doing this promotion at all. It's really there to reduce isel pattern count, but I'm wondering if we'd be better served just eating the pattern cost or doing C++ based isel for vector and/or/xor in X86ISelDAGToDAG. The masked and/or/xor will definitely be difficult in patterns if a bitcast gets between the vselect and the and/or/xor node. That becomes a lot of permutations to cover.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D53107
llvm-svn: 344487
Enable enableMultipleCopyHints() on X86.
Original Patch by @jonpa:
While enabling the mischeduler for SystemZ, it was discovered that for some reason a test needed one extra seemingly needless COPY (test/CodeGen/SystemZ/call-03.ll). The handling for that is resulted in this patch, which improves the register coalescing by providing not just one copy hint, but a sorted list of copy hints. On SystemZ, this gives ~12500 less register moves on SPEC, as well as marginally less spilling.
Instead of improving just the SystemZ backend, the improvement has been implemented in common-code (calculateSpillWeightAndHint(). This gives a lot of test failures, but since this should be a general improvement I hope that the involved targets will help and review the test updates.
Differential Revision: https://reviews.llvm.org/D38128
llvm-svn: 342578
Summary:
This fixes most of the scheduling info for SKX vector operations.
I had to split a lot of the YMM/ZMM classes into separate classes for YMM and ZMM.
The before/after llvm-exegesis analysis are in the phabricator diff.
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D47721
llvm-svn: 334407
BtVer2 - fix NumMicroOp and account for the Lat+6cy GPR->XMM and Lat+1cy XMm->GPR delays (see rL332737)
The high number of MOVD/MOVQ equivalent instructions meant that there were a number of missed patterns in SNB/Znver1:
SNB - add missing GPR<->MMX costs (taken from Agner / Intel AOM)
Znver1 - add missing GPR<->XMM MOVQ costs (taken from Agner)
llvm-svn: 332745
BtVer2 - Fixes schedules for (V)CVTPS2PD instructions
A lot of the Intel models still have too many InstRW overrides for these new classes - this needs cleaning up but I wanted to get the classes in first
llvm-svn: 332376
These are more like cross-lane shuffles than regular shuffles - we already do this for AVX512 equivalents.
Differential Revision: https://reviews.llvm.org/D46229
llvm-svn: 331659
WriteFRcp/WriteFRsqrt are split to support scalar, XMM and YMM/ZMM instructions.
WriteFSqrt is split into single/double/long-double sizes and scalar, XMM, YMM and ZMM instructions.
This removes all InstrRW overrides for these instructions.
NOTE: There were a couple of typos in the Znver1 model - notably a 1cy throughput for SQRT that is highly unlikely and doesn't tally with Agner.
NOTE: I had to add Agner's numbers for several targets for WriteFSqrt80.
llvm-svn: 331629
This patch aims to provide correct dwarf unwind information in function
epilogue for X86.
It consists of two parts. The first part inserts CFI instructions that set
appropriate cfa offset and cfa register in emitEpilogue() in
X86FrameLowering. This part is X86 specific.
The second part is platform independent and ensures that:
* CFI instructions do not affect code generation (they are not counted as
instructions when tail duplicating or tail merging)
* Unwind information remains correct when a function is modified by
different passes. This is done in a late pass by analyzing information
about cfa offset and cfa register in BBs and inserting additional CFI
directives where necessary.
Added CFIInstrInserter pass:
* analyzes each basic block to determine cfa offset and register are valid
at its entry and exit
* verifies that outgoing cfa offset and register of predecessor blocks match
incoming values of their successors
* inserts additional CFI directives at basic block beginning to correct the
rule for calculating CFA
Having CFI instructions in function epilogue can cause incorrect CFA
calculation rule for some basic blocks. This can happen if, due to basic
block reordering, or the existence of multiple epilogue blocks, some of the
blocks have wrong cfa offset and register values set by the epilogue block
above them.
CFIInstrInserter is currently run only on X86, but can be used by any target
that implements support for adding CFI instructions in epilogue.
Patch by Violeta Vukobrat.
Differential Revision: https://reviews.llvm.org/D42848
llvm-svn: 330706
Split the fp and integer vector logical instruction scheduler classes - older CPUs especially often handled these on different pipes.
This unearthed a couple of things that are also handled in this patch:
(1) We were tagging avx512 fp logic ops as WriteFAdd, probably because of the lack of WriteFLogic
(2) SandyBridge had integer logic ops only using Port5, when afaict they can use Ports015.
(3) Cleaned up x86 FCHS/FABS scheduling as they are typically treated as fp logic ops.
Differential Revision: https://reviews.llvm.org/D45629
llvm-svn: 330480
As mentioned on D44647, this patch increases the default memory latency to +5cy , which more closely matches what most custom cases are doing for reg-mem instructions.
I've bumped LoadLatency, ReadAfterLd and WriteLoad values to 5cy to be consistent.
As Sandy Bridge is currently our default generic model, this affects a lot of scheduling tests...
Differential Revision: https://reviews.llvm.org/D44654
llvm-svn: 329388
We were forcing the latency of these instructions to 5 cycles, but every other scheduler model had them as 1 cycle. I'm sure I didn't get everything, but this gets a big portion.
llvm-svn: 329339
It's failing on the bots and I'm not sure why.
This reverts:
[X86] Synchronize the SchedRW on some EVEX instructions with their VEX equivalents.
[X86] Use WriteFShuffle256 for VEXTRACTF128 to be consistent with VEXTRACTI128 which uses WriteShuffle256.
[X86] Remove some InstRWs for plain store instructions on Sandy Bridge.
[X86] Auto-generate complete checks. NFC
llvm-svn: 329256
Also restrict to port 0 and 1 for SkylakeClient. It looks like the scheduler models don't account for client not having a full vector ALU on port 5 like server.
Fixes PR36808.
llvm-svn: 328061
An extract_element where the result type is larger than the scalar element type is semantically an any_extend of from the scalar element type to the result type. If we expect zeroes in the upper bits of the i8/i32 we need to mae sure those zeroes are explicit in the DAG.
For these cases the best way to accomplish this is use an insert_subvector to pad zeroes to the upper bits of the v1i1 first. We extend to either v16i1(for i32) or v8i1(for i8). Then bitcast that to a scalar and finish with a zero_extend up to i32 if necessary. We can't extend past v16i1 because that's the largest mask size on KNL. But isel is smarter enough to know that a zext of a bitcast from v16i1 to i16 can use a KMOVW instruction. The insert_subvectors will be dropped during isel because we can determine that the producing instruction already zeroed the upper bits of the k-register.
llvm-svn: 326308
This is split off from D42948 and includes just the cases that constant fold to true or false. It also includes some refactoring to keep predicate checks together.
This supports things like
(setcc uge X, 0) -> true
Differential Revision: https://reviews.llvm.org/D43489
llvm-svn: 325627
This allows us to avoid an opsize prefix. And forcing some move immediates to i32 avoids a length changing prefix on those instructions.
This mostly replaces the existing combine we had for zext/sext+cmov of constants. I left in a case for sign extending a 32 bit cmov of constants to 64 bits.
Differential Revision: https://reviews.llvm.org/D43327
llvm-svn: 325601
Previously we used vptestmd, but the scheduling data for SKX says vpmovq2m/vpmovd2m is lower latency. We already used vpmovb2m/vpmovw2m for byte/word truncates. So this is more consistent anyway.
llvm-svn: 325534
Undef VLX, getSetCCResultType returns v2i1/v4i1 for v2f32/v4f32 so default type legalization will end up changing the setcc result type back to vXi1 if it had been extended. The resulting extend gets messed up further by type legalization and is difficult to recombine back to (v4i32 (setcc (v4f32))) after legalization.
I went ahead and enabled this for SSE2 and later since its always the result we want and this helps type legalization get there in less steps.
llvm-svn: 324822
This prevents extends of masks being introduced during lowering where it become difficult to combine them out.
There are a few oddities in here.
We sometimes concatenate two k-registers produced by two compares, sign_extend the combined pair, then extract two halves. This worked better previously because the sign_extend wasn't created until after the fp_to_sint was split which led to a split sign_extend being created.
We probably also need to custom type legalize (v2i32 (sext v2i1)) via widening.
llvm-svn: 324820
This avoids a constant pool load to create 1.
The int->float are showing converts to mask and back. We probably need to widen inputs to sint_to_fp/uint_to_fp before type legalization.
llvm-svn: 324805
Most vxi1 constant build vectors have to be implemented in the scalar domain anyway so we'll probably end up with a cast there later. But by then its too late to do the combine to get rid of it.
llvm-svn: 324662
The KTEST instruction sets the C flag if the result of anding both operands together is all 1s. We can use this to lower (icmp eq/ne (bitcast (vXi1 X), -1)
Differential Revision: https://reviews.llvm.org/D42772
llvm-svn: 324577
These used things like unsigned less than zero, which is always false because there is no unsigned number less than zero.
I plan to teach DAG combine to optimize these so need to stop using them.
llvm-svn: 324315
We now allow all signed comparisons and not equal. The complement that needs to be added for this is no worse than the extend. And the vector output forms of pcmpeq/pcmpgt have better latency than the k-register version on SKX.
llvm-svn: 324294
This allows the immediate to folded into the and instead of being forced to move into a register. This can sometimes result in shorter encodings since the and can sign extend an immediate.
This also allows us to match an and to a movzx after a not.
This can cause an extra move if the input to the separate NOT has an additional user which requires a copy before the NOT.
llvm-svn: 324260