This instructions doesn't have intrincis.
Added tests for lowering and encoding.
Differential Revision: http://reviews.llvm.org/D12317
llvm-svn: 249688
The C standard has historically not specified whether or not these functions should raise the inexact flag. Traditionally on Darwin, these functions *did* raise inexact, and the llvm lowerings followed that conventions. n1778 (C bindings for IEEE-754 (2008)) clarifies that these functions should not set inexact. This patch brings the lowerings for arm64 and x86 in line with the newly specified behavior. This also lets us fold some logic into TD patterns, which is nice.
Differential Revision: http://reviews.llvm.org/D12969
llvm-svn: 248266
Fixes PR23464: one way to use the broadcast intrinsics is:
_mm256_broadcastw_epi16(_mm_cvtsi32_si128(*(int*)src));
We don't currently fold this, but now that we use native IR for
the intrinsics (r245605), we can look through one bitcast to find
the broadcast scalar.
Differential Revision: http://reviews.llvm.org/D10557
llvm-svn: 245613
Since r245605, the clang headers don't use these anymore.
r245165 updated some of the tests already; update the others, add
an autoupgrade, remove the intrinsics, and cleanup the definitions.
Differential Revision: http://reviews.llvm.org/D10555
llvm-svn: 245606
COMISD should receive QWORD because it is defined as
(V)COMISD xmm1, xmm2/m64
COMISS should receive DWORD because it is defined as
(V)COMISS xmm1, xmm2/m32
Differential Revision: http://reviews.llvm.org/D11712
llvm-svn: 245551
This is a follow-up to the FIXME that was added with D7474 ( http://reviews.llvm.org/rL229531 ).
I thought this load folding bug had been made hard-to-hit, but it turns out to be very easy
when targeting 32-bit x86 and causes a miscompile/crash in Wine:
https://bugs.winehq.org/show_bug.cgi?id=38826https://llvm.org/bugs/show_bug.cgi?id=22371#c25
The quick fix is to simply remove the scalar FP logical instructions from the load folding table
in X86InstrInfo, but that causes us to miss load folds that should be possible when lowering fabs,
fneg, fcopysign. So the majority of this patch is altering those lowerings to use *vector* FP
logical instructions (because that's all x86 gives us anyway). That lets us do the load folding
legally.
Differential Revision: http://reviews.llvm.org/D11477
llvm-svn: 243361
SKX supports conversion for all FP types. Integer types include doublewords and quardwords.
I added "Legal" status for these nodes and a bunch of tests.
I added "NoVLX" for AVX DAG selection to force VLX instructions selection when VLX is supported.
Differential Revision: http://reviews.llvm.org/D11255
llvm-svn: 242637
This patch adds support for v8i16 and v16i8 shuffle lowering using the immediate versions of the SSE4A EXTRQ and INSERTQ instructions. Although rather limited (they can only act on the lower 64-bits of the source vectors, leave the upper 64-bits of the result vector undefined and don't have VEX encoded variants), the instructions are still useful for the zero extension of any lane (EXTRQ) or inserting a lane into another vector (INSERTQ). Testing demonstrated that it wasn't typically worth it to use these instructions for v2i64 or v4i32 vector shuffles although they are capable of it.
As well as adding specific pattern matching for the shuffles, the patch uses EXTRQ for zero extension cases where SSE41 isn't available and its more efficient than the SSE2 'unpack' default approach. It also adds shuffle decode support for the EXTRQ / INSERTQ cases when the instructions are handling full byte-sized extractions / insertions.
From this foundation, future patches will be able to make use of the instructions for situations that use their ability to extract/insert at the bit level.
Differential Revision: http://reviews.llvm.org/D10146
llvm-svn: 241508
With the completion of D9746 there is now a common implementation of integer signed/unsigned min/max nodes, removing the need for the equivalent X86 specific implementations.
This patch removes the old X86ISD nodes, legalizes the relevant SSE2/SSE41/AVX2/AVX512 instructions for the ISD versions and converts the small amount of existing X86 code.
Differential Revision: http://reviews.llvm.org/D10947
llvm-svn: 241506
We used to erroneously match:
(v4i64 shuffle (v2i64 load), <0,0,0,0>)
Whereas vbroadcasti128 is more like:
(v4i64 shuffle (v2i64 load), <0,1,0,1>)
This problem doesn't exist for vbroadcastf128, which kept matching
the intrinsic after r231182. We should perhaps re-introduce the
intrinsic here as well, but that's a separate issue still being
discussed.
While there, add some proper vbroadcastf128 tests. We don't currently
match those, like for loading vbroadcastsd/ss on AVX (the reg-reg
broadcasts where added in AVX2).
Fixes PR23886.
llvm-svn: 240488
This patch enables support for the conversion of v2i32 to v2f64 to use the CVTDQ2PD xmm instruction and stay on the SSE unit instead of scalarizing, sign extending to i64 and using CVTSI2SDQ scalar conversions.
Differential Revision: http://reviews.llvm.org/D10433
llvm-svn: 239855
This patch removes the old X86ISD::FSRL op - which allowed float vectors to use the byte right shift operations (causing a domain switch....).
Since the refactoring of the shuffle lowering code this no longer has any use.
Differential Revision: http://reviews.llvm.org/D10169
llvm-svn: 238906
in-register LUT technique.
Summary:
A description of this technique can be found here:
http://wm.ite.pl/articles/sse-popcount.html
The core of the idea is to use an in-register lookup table and the
PSHUFB instruction to compute the population count for the low and high
nibbles of each byte, and then to use horizontal sums to aggregate these
into vector population counts with wider element types.
On x86 there is an instruction that will directly compute the horizontal
sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.
Various tricks are used to get vNi32 and vNi16 from the vNi8 that the
LUT computes.
The base implemantion of this, and most of the work, was done by Bruno
in a follow up to D6531. See Bruno's detailed post there for lots of
timing information about these changes.
I have extended Bruno's patch in the following ways:
0) I committed the new tests with baseline sequences so this shows
a diff, and regenerated the tests using the update scripts.
1) Bruno had noticed and mentioned in IRC a redundant mask that
I removed.
2) I introduced a particular optimization for the i32 vector cases where
we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD
+ PSADBW to compute doubled high i32 popcounts. This takes advantage
of the fact that to line up the high i32 popcounts we have to shift
them anyways, and we can shift them by one fewer bit to effectively
divide the count by two. While the PSHUFD based horizontal add is no
faster, it doesn't require registers or load traffic the way a mask
would, and provides more ILP as it happens on different ports with
high throughput.
3) I did some code cleanups throughout to simplify the implementation
logic.
4) I refactored it to continue to use the parallel bitmath lowering when
SSSE3 is not available to preserve the performance of that version on
SSE2 targets where it is still much better than scalarizing as we'll
still do a bitmath implementation of popcount even in scalar code
there.
With #1 and #2 above, I analyzed the result in IACA for sandybridge,
ivybridge, and haswell. In every case I measured, the throughput is the
same or better using the LUT lowering, even v2i64 and v4i64, and even
compared with using the native popcnt instruction! The latency of the
LUT lowering is often higher than the latency of the scalarized popcnt
instruction sequence, but I think those latency measurements are deeply
misleading. Keeping the operation fully in the vector unit and having
many chances for increased throughput seems much more likely to win.
With this, we can lower every integer vector popcount implementation
using the LUT strategy if we have SSSE3 or better (and thus have
PSHUFB). I've updated the operation lowering to reflect this. This also
fixes an issue where we were scalarizing horribly some AVX lowerings.
Finally, there are some remaining cleanups. There is duplication between
the two techniques in how they perform the horizontal sum once the byte
population count is computed. I'm going to factor and merge those two in
a separate follow-up commit.
Differential Revision: http://reviews.llvm.org/D10084
llvm-svn: 238636
Predicate UseAVX depricates pattern selection on AVX-512.
This predicate is necessary for DAG selection to select EVEX form.
But mapping SSE intrinsics to AVX-512 instructions is not ready yet.
So I replaced UseAVX with HasAVX for intrinsics patterns.
llvm-svn: 237903
This is a follow-on to r236740 where I took Andrea's advice
in D9504 to remove a redundant pattern...except that I removed
the wrong pattern!
AFAICT, there is no change in the final code produced because
subsequent passes would clean up the extra instructions created
by the more complicated pattern.
llvm-svn: 236743
Finish the job that was abandoned in D6958 following the refactoring in
http://reviews.llvm.org/rL230221:
1. Uncomment the intrinsic def for the AVX r_Int instruction.
2. Add missing r_Int entries to the load folding tables; there are already
tests that check these in "test/Codegen/X86/fold-load-unops.ll", so I
haven't added any more in this patch.
3. Add patterns to solve PR21507 ( https://llvm.org/bugs/show_bug.cgi?id=21507 ).
So instead of this:
movaps %xmm0, %xmm1
rcpss %xmm1, %xmm1
movss %xmm1, %xmm0
We should now get:
rcpss %xmm0, %xmm0
And instead of this:
vsqrtss %xmm0, %xmm0, %xmm1
vblendps $1, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm1[0],xmm0[1,2,3]
We should now get:
vsqrtss %xmm0, %xmm0, %xmm0
Differential Revision: http://reviews.llvm.org/D9504
llvm-svn: 236740
We don't need codegen-only intrinsic instructions for the vector forms of these instructions.
This makes the reciprocal estimate instruction lowering identical to how we handle normal
square roots: (V)SQRTPS / (V)SQRTPD.
No existing regression tests fail with this patch.
Differential Revision: http://reviews.llvm.org/D9301
llvm-svn: 236013
This fixes a regression introduced at revision 218263.
On AVX, if we optimize for size, a splat build_vector of a load
is lowered into a VBROADCAST node. This is done even if the value type of the
splat build_vector node is v2i64.
Since AVX doesn't support v2f64/v2i64 broadcasts, revision 218263 added two
extra tablegen patterns to allow selecting a VMOVDDUPrm from an X86VBroadcast
where the scalar element comes from a loadi64/loadf64.
However, revision 218263 forgot to add an extra fallback pattern for the case
where we have a X86VBroadcast of a loadi64 with multiple uses.
This patch adds the missing tablegen pattern in X86InstrSSE.td.
This patch also adds an extra test to 'splat-for-size.ll' to verify that ISel
doesn't crash with a 'fatal error in the backend' due to a missing AVX pattern
to select v2i64 X86ISD::BROADCAST nodes.
llvm-svn: 235509
This is an updated version of Chandler's patch D7402 that got accepted but never committed, and has bit-rotted a bit since.
I've updated the execution domain declarations to match the approach of the packed templates and also added some extra scalar unary tests.
Differential Revision: http://reviews.llvm.org/D9095
llvm-svn: 235372
For code like this:
define <8 x i32> @load_v8i32() {
ret <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
}
We produce this AVX code:
_load_v8i32: ## @load_v8i32
movl $7, %eax
vmovd %eax, %xmm0
vxorps %ymm1, %ymm1, %ymm1
vblendps $1, %ymm0, %ymm1, %ymm0 ## ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7]
retq
There are at least 2 bugs in play here:
We're generating a blend when a move scalar does the same job using 2 less instruction bytes (see FIXMEs).
We're not matching an existing pattern that would eliminate the xor and blend entirely. The zero bytes are free with vmovd.
The 2nd fix involves an adjustment of "AddedComplexity" [1] and mostly masks the 1st problem.
[1] AddedComplexity has close to no documentation in the source.
The best we have is this comment: "roughly corresponds to the number of nodes that are covered".
It appears that x86 has bastardized this definition by inflating its values for some other
undocumented reason. For example, we have a pattern with "AddedComplexity = 400" (!).
I searched my way to this page:
https://groups.google.com/forum/#!topic/llvm-dev/5UX-Og9M0xQ
Differential Revision: http://reviews.llvm.org/D8794
llvm-svn: 233931
We used to miss non-Q YMM integer vectors, and, non-Q/D XMM integer
vectors.
While there, change the v4i32 patterns to prefer MOVNTDQ.
llvm-svn: 233668
This should complete the job started in r231794 and continued in r232045:
We want to replace as much custom x86 shuffling via intrinsics
as possible because pushing the code down the generic shuffle
optimization path allows for better codegen and less complexity
in LLVM.
AVX2 introduced proper integer variants of the hacked integer insert/extract
C intrinsics that were created for this same functionality with AVX1.
This should complete the removal of insert/extract128 intrinsics.
The Clang precursor patch for this change was checked in at r232109.
llvm-svn: 232120
This is a follow-up to r231182. This adds the "vbroadcasti128" instruction
back, but without the intrinsic mapping. Also add a test to check the
instriction encoding.
This is related to rdar://problem/18742778.
llvm-svn: 231945
The intrinsic is no longer generated by the front-end. Remove the intrinsic and
auto-upgrade it to a vector shuffle.
Reviewed by Nadav
This is related to rdar://problem/18742778.
llvm-svn: 231182
I made the templates general, no need to define pattern separately for each instruction/intrinsic.
Now only need to add r_Int pattern for AVX.
llvm-svn: 230221
This canonicalization step saves us 3 pattern matching possibilities * 4 math ops
for scalar FP math that uses xmm regs. The backend can re-commute the operands
post-instruction-selection if that makes register allocation better.
The tests in llvm/test/CodeGen/X86/sse-scalar-fp-arith.ll cover this scenario already,
so there are no new tests with this patch.
Differential Revision: http://reviews.llvm.org/D7777
llvm-svn: 230024
Change the memory operands in sse12_fp_packed_scalar_logical_alias from scalars to vectors.
That's what the hardware packed logical FP instructions define: 128-bit memory operands.
There are no scalar versions of these instructions...because this is x86.
Generating the wrong code (folding a scalar load into a 128-bit load) is still possible
using the peephole optimization pass and the load folding tables. We won't completely
solve this bug until we either fix the lowering in fabs/fneg/fcopysign and any other
places where scalar FP logic is created or fix the load folding in foldMemoryOperandImpl()
to make sure it isn't changing the size of the load.
Differential Revision: http://reviews.llvm.org/D7474
llvm-svn: 229531
Patch to explicitly add the SSE MOVQ (rr,mr,rm) instructions to SSEPackedInt domain - prevents a number of costly domain switches.
Differential Revision: http://reviews.llvm.org/D7600
llvm-svn: 229439