The corresponding CFE patch replaces these intrinsics with vector initializers
in avxintrin.h. This patch removes the LLVM intrinsics from the backend.
We now stop lowering at X86ISD::VBROADCAST custom node rather than lowering
that further to the intrinsics.
The patch only changes VBROADCASTS* and leaves VBROADCAST[FI]128 to continue
to use intrinsics. As explained in the CFE patch, the reason is that we
currently don't generate as good code for them without the intrinsics.
CodeGen/X86/avx-vbroadcast.ll already provides coverage for this change. It
checks that for a series of insertelements we generate the appropriate
vbroadcast instruction.
Also verified that there was no assembly change in the test-suite before and
after this patch.
llvm-svn: 209864
Summary:
When inserting an element that's coming from a vector load or a broadcast
of a vector (or scalar) load, combine the load into the insertps
instruction.
Added PerformINSERTPSCombine for the case where we need to fix the load
(load of a vector + insertps with a non-zero CountS).
Added patterns for the broadcasts.
Also added tests for SSE4.1, AVX, and AVX2.
Reviewers: delena, nadav, craig.topper
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3581
llvm-svn: 209156
Previously, TableGen assumed that every aliased operand consumed precisely 1
MachineInstr slot (this was reasonable because until a couple of days ago,
nothing more complicated was eligible for printing).
This allows a couple more ARM64 aliases to print so we can remove the special
code.
On the X86 side, I've gone for explicit AT&T size specifiers as the default, so
turned off a few of the aliases that would have just started printing.
llvm-svn: 208880
Summary:
The INSERTPS pattern fragment was called insrtps (mising 'e'), which
would make it harder to grep for the patterns related to this instruction.
Renaming it to use the proper instruction name.
Reviewers: nadav
CC: llvm-commits
Differential Revision: http://reviews.llvm.org/D3443
llvm-svn: 206779
AVX supports logical operations using an operand from memory. Unfortunately
because integer operations were not added until AVX2 the AVX1 logical
operation's types were preventing the isel from folding the loads. In a limited
number of cases the peephole optimizer would fold the loads, but most were
missed. This patch adds explicit patterns with appropriate casts in order for
these loads to be folded.
The included test cases run on reduced examples and disable the peephole
optimizer to ensure the folds are being pattern matched.
Patch by Louis Gerbarg <lgg@apple.com>
rdar://16355124
llvm-svn: 205938
Those patterns are used when the load cannot be folded into the related broadcast
during the select phase.
This happens when the load gets additional uses that were not anticipated during
the previous lowering phases (constant vector to constant load, then constant
load reused) or when selection DAG is not able to prove that folding the load
will not create a cycle in the DAG.
<rdar://problem/16074331>
llvm-svn: 204631
The patch defines new or refines existing generic scheduling classes to match
the behavior of the SSE instructions.
It also maps those scheduling classes on the related SSE instructions.
<rdar://problem/15607571>
llvm-svn: 202065
Original commits messages:
Add MRMXr/MRMXm form to X86 for use by instructions which treat the 'reg' field of modrm byte as a don't care value. Will allow for simplification of disassembler code.
Simplify a bunch of code by removing the need for the x86 disassembler table builder to know about extended opcodes. The modrm forms are sufficient to convey the information.
llvm-svn: 201065
r201059 appears to cause a crash in a bootstrapped build of clang. Craig
isn't available to look at it right now, so I'm reverting it while he
investigates.
llvm-svn: 201064
Generalize the AArch64 .td nodes for AssertZext and AssertSext. Use
them to match the relevant pextr store instructions.
The test widen_load-2.ll requires a slight change because with the
stores gone, the remaining instructions are scheduled in a different
order.
Add test cases for SSE4 and AVX variants.
Resolves rdar://13414672.
Patch by Adam Nemet <anemet@apple.com>.
llvm-svn: 200957
I believe VZEXT_MOVL means "zero all vector elements except the first" (and
should have identical input & output types) whereas VZEXT means "zero extend
each element of a vector (discarding higher elements if necessary)".
For example:
(v4i32 (vzext (v16i8 ...)))
should zero extend the low 4 bytes of the incoming vector to 32-bits,
discarding higher bytes.
However, somewhere in the past, these two concepts had become confused, even
leading to a nonsensical VSEXT_MOVL.
This re-merges the nodes where appropriate (all VSEXT_MOVL -> VSEXT, VZEXT_MOVL
-> VZEXT when it's an actual extension).
rdar://problem/15981990
llvm-svn: 200918
That's what it actually means, and with 16-bit support it's going to be
a little more relevant since in a few corner cases we may actually want
to distinguish between 16-bit and 32-bit mode (for example the bare 'push'
aliases to pushw/pushl etc.)
Patch by David Woodhouse
llvm-svn: 197768
Added scalar compare VCMPSS, VCMPSD.
Implemented LowerSELECT for scalar FP operations.
I replaced FSETCCss, FSETCCsd with one node type FSETCCs.
Node extract_vector_elt(v16i1/v8i1, idx) returns an element of type i1.
llvm-svn: 197384
a vector packed single/double fp operation followed by a vector insert.
The effect is that the backend coverts the packed fp instruction
followed by a vectro insert into a SSE or AVX scalar fp instruction.
For example, given the following code:
__m128 foo(__m128 A, __m128 B) {
__m128 C = A + B;
return (__m128) {c[0], a[1], a[2], a[3]};
}
previously we generated:
addps %xmm0, %xmm1
movss %xmm1, %xmm0
we now generate:
addss %xmm1, %xmm0
llvm-svn: 197145
immediately after SSE scalar fp instructions like addss or mulss.
Added patterns to select SSE scalar fp arithmetic instructions from a scalar
fp operation followed by a blend.
For example, given the following code:
__m128 foo(__m128 A, __m128 B) {
A[0] += B[0];
return A;
}
previously we generated:
addss %xmm0, %xmm1
movss %xmm1, %xmm0
now we generate:
addss %xmm1, %xmm0
llvm-svn: 196925
On sandy bridge (PR17654) we now get
vpxor %xmm1, %xmm1, %xmm1
vpunpckhbw %xmm1, %xmm0, %xmm2
vpunpcklbw %xmm1, %xmm0, %xmm0
vinsertf128 $1, %xmm2, %ymm0, %ymm0
On haswell it's a simple
vpmovzxbw %xmm0, %ymm0
There is a maze of duplicated and dead transforms and patterns in this
area. Remove the dead custom lowering of zext v8i16 to v8i32, that's
already handled by LowerAVXExtend.
llvm-svn: 193262
the instruction defenitions and ISEL reflect this.
Prior to this patch these instructions took an i32i8imm, and the high bits were
dropped during encoding. This led to incorrect behavior for shifts by
immediates higher than 255. This patch fixes that issue by detecting large
immediate shifts and returning constant zero (for logical shifts) or capping
the shift amount at an encodable value (for arithmetic shifts).
Fixes <rdar://problem/14968098>
llvm-svn: 193096
This allows the instruction to be encoded using the 2-byte VEX form instead of the 3-byte VEX form. The GNU assembler has similar behavior and instruction selection already does this.
llvm-svn: 192088
Add llvm.x86.* intrinsics for all of the Intel SHA Extensions instructions, as
well as tests. Also remove mayLoad and hasSideEffects, which can be inferred
from the instruction patterns.
llvm-svn: 190864
Implements Instruction scheduler latencies for Silvermont,
using latencies from the Intel Silvermont Optimization Guide.
Auto detects SLM.
Turns on post RA scheduler when generating code for SLM.
llvm-svn: 190717
Add basic assembly/disassembly support for the first Intel SHA
instruction 'sha1rnds4'. Also includes feature flag, and test cases.
Support for the remaining instructions will follow in a separate patch.
llvm-svn: 190611
All insertf*/extractf* functions replaced with insert/extract since we have insertf and inserti forms.
Added lowering for INSERT_VECTOR_ELT / EXTRACT_VECTOR_ELT for 512-bit vectors.
Added lowering for EXTRACT/INSERT subvector for 512-bit vectors.
Added a test.
llvm-svn: 187491
Add earlyclobber constaints to prevent input register being allocated as
the output register because, according to Intel spec [1], "If any pair
of the index, mask, or destination registers are the same, this
instruction results a UD fault."
---
[1] http://software.intel.com/sites/default/files/319433-014.pdf
llvm-svn: 183327
Now all x86 instructions that have itinerary classes also have SchedRW
lists. This is required before the new scheduling models can be used.
There are still unannotated instructions remaining, but they don't have
itinerary classes either.
llvm-svn: 178051
All the instructions tagged with IIC_DEFAULT had nothing in common, and
we already have a NoItineraries class to represent untagged
instructions.
llvm-svn: 177937
We hitch a ride with the existing OpndItins class that was used to add
instruction itinerary classes in the many multiclasses in this file.
Use the link provided by the X86FoldableSchedWrite.Folded to find the
right SchedWrite for folded loads.
llvm-svn: 177326
cvtsi2* should parse with an 'l' or 'q' suffix or no suffix at all. No suffix should be treated the same as 'l' suffix. Printing should always print a suffix. Previously we didn't parse or print an 'l' suffix.
cvtt*2si/cvt*2si should parse with an 'l' or 'q' suffix or not suffix at all. No suffix should use the destination register size to choose encoding. Printing should not print a suffix.
Original 'l' suffix issue with cvtsi2* pointed out by Michael Kuperstein.
llvm-svn: 171668
We match the pattern "x >= y ? x-y : 0" into "subus x, y" and two special cases
if y is a constant. DAGCombiner canonicalizes those so we first have to undo the
canonicalization for those cases. The pattern occurs in gzip when the loop
vectorizer is enabled. Part of PR14613.
llvm-svn: 170273
the first source operand is tied to the destination operand.
This is to accurately model the corresponding instructions where the upper
bits are unmodified.
rdar://12558838
PR14221
llvm-svn: 167064
- As there's no 64-bit GPRs in 32-bit mode, a custom conversion from v2u32 to
v2f32 is added to improve the efficiency of the code generated.
llvm-svn: 166545
- Due to the current matching vector elements constraints in
ISD::FP_ROUND, rounding from v2f64 to v4f32 (after legalization from
v2f32) is scalarized. Add a customized v2f32 widening to convert it
into a target-specific X86ISD::VFPROUND to work around this
constraints.
llvm-svn: 165631
It had patterns for zext-loading and extending. This commit adds patterns for loading a wide type, performing a bitcast,
and extending. This is an odd pattern, but it is commonly used when writing code with intrinsics.
rdar://11897677
llvm-svn: 163995
The assembly string for the VMOVPQIto64rr instruction incorrectly lacked the 'v'
prefix, resulting in mis-assembly of the vanilla movd instruction.
llvm-svn: 162963
- Add 'UseSSEx' to force SSE legacy insn not being selected when AVX is
enabled.
As the penalty of inter-mixing SSE and AVX instructions, we need
prevent SSE legacy insn from being generated except explicitly
specified through some intrinsics. For patterns supported by both
SSE and AVX, so far, we force AVX insn will be tried first relying on
AddedComplexity or position in td file. It's error-prone and
introduces bugs accidentally.
'UseSSEx' is disabled when AVX is turned on. For SSE insns inherited
by AVX, we need this predicate to force VEX encoding or SSE legacy
encoding only.
For insns not inherited by AVX, we still use the previous predicates,
i.e. 'HasSSEx'. So far, these insns fall into the following
categories:
* SSE insns with MMX operands
* SSE insns with GPR/MEM operands only (xFENCE, PREFETCH, CLFLUSH,
CRC, and etc.)
* SSE4A insns.
* MMX insns.
* x87 insns added by SSE.
2 test cases are modified:
- test/CodeGen/X86/fast-isel-x86-64.ll
AVX code generation is different from SSE one. 'vcvtsi2sdq' cannot be
selected by fast-isel due to complicated pattern and fast-isel
fallback to materialize it from constant pool.
- test/CodeGen/X86/widen_load-1.ll
AVX code generation is different from SSE one after fixing SSE/AVX
inter-mixing. Exec-domain fixing prefers 'vmovapd' instead of
'vmovaps'.
llvm-svn: 162919
this allows for better code generation.
Added a new DAGCombine transformation to convert FMAX and FMIN to FMANC and
FMINC, which are commutative.
For example:
movaps %xmm0, %xmm1
movsd LC(%rip), %xmm0
minsd %xmm1, %xmm0
becomes:
minsd LC(%rip), %xmm0
llvm-svn: 162187
- FP_EXTEND only support extending from vectors with matching elements.
This results in the scalarization of extending to v2f64 from v2f32,
which will be legalized to v4f32 not matching with v2f64.
- add X86-specific VFPEXT supproting extending from v4f32 to v2f64.
- add BUILD_VECTOR lowering helper to recover back the original
extending from v4f32 to v2f64.
- test case is enhanced to include different vector width.
llvm-svn: 161894
The (COPY_TO_REGCLASS GR32:$src, VR128) pattern looks odd, but
copyPhysReg does the right thing with it. (The old pattern would
eventually produce the same cross-class copy).
llvm-svn: 160830
The SUBREG_TO_REG instruction has magic semantics asserting that the
source value was defined by an instruction that cleared the high half of
the register. Those semantics are never actually exploited for xmm
registers.
llvm-svn: 160818
These idempotent sub-register indices don't do anything --- They simply
map XMM registers to themselves. They no longer affect register classes
either since the SubRegClasses field has been removed from Target.td.
This patch replaces XMM->XMM EXTRACT_SUBREG and INSERT_SUBREG patterns
with COPY_TO_REGCLASS patterns which simply become COPY instructions.
The number of IMPLICIT_DEF instructions before register allocation is
reduced, and that is the cause of the test case changes.
llvm-svn: 160816
load source operand is used by multiple nodes. The v2i64 broadcast was emulated
by shuffling the two lower i32 elements to the upper two.
We had a bug in the immediate used for the broadcast.
Replacing 0 to 0x44.
0x44 means [01|00|01|00] which corresponds to the correct lane.
Patch by Michael Kuperstein.
llvm-svn: 160430
Allow the folding of vbroadcastRR to vbroadcastRM, where the memory operand is a spill slot.
PR12782.
Together with Michael Kuperstein <michael.m.kuperstein@intel.com>
llvm-svn: 160230
Corrected type for index of llvm.x86.avx2.gather.d.pd.256
from 256-bit to 128-bit.
Corrected types for src|dst|mask of llvm.x86.avx2.gather.q.ps.256
from 256-bit to 128-bit.
Support the following intrinsics:
llvm.x86.avx2.gather.d.q, llvm.x86.avx2.gather.q.q
llvm.x86.avx2.gather.d.q.256, llvm.x86.avx2.gather.q.q.256
llvm.x86.avx2.gather.d.d, llvm.x86.avx2.gather.q.d
llvm.x86.avx2.gather.d.d.256, llvm.x86.avx2.gather.q.d.256
llvm-svn: 159402
This required light surgery on the assembler and disassembler
because the instructions use an uncommon encoding. They are
the only two instructions in x86 that use register operands
and two immediates.
llvm-svn: 157634
using the pattern (vbroadcast (i32load src)). In some cases, after we generate
this pattern new users are added to the load node, which prevent the selection
of the blend pattern. This commit provides fallback patterns which perform
in-vector broadcast (using in-vector vbroadcast in AVX2 and pshufd on AVX1).
llvm-svn: 155437
Original message:
Modify the code that lowers shuffles to blends from using blendvXX to vblendXX.
blendV uses a register for the selection while Vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
llvm-svn: 154483
blendv uses a register for the selection while vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
llvm-svn: 154396
This results in things such as
vmovups 16(%rdi), %xmm0
vinsertf128 $1, %xmm0, %ymm0, %ymm0
to be combined to
vinsertf128 $1, 16(%rdi), %ymm0, %ymm0
rdar://11076953
llvm-svn: 153092