Commit Graph

4 Commits

Author SHA1 Message Date
Chandler Carruth 0a8151e69a [x86] Revert my over-eager commit in r217332.
I hadn't actually run all the tests yet and these combines have somewhat
surprisingly far reaching effects.

llvm-svn: 217333
2014-09-07 12:37:11 +00:00
Chandler Carruth 8405e8fff9 [x86] Tweak the rules surrounding 0,0 and 1,1 v2f64 shuffles and add
support for MOVDDUP which is really important for matrix multiply style
operations that do lots of non-vector-aligned load and splats.

The original motivation was to add support for MOVDDUP as the lack of it
regresses matmul_f64_4x4 by 5% or so. However, all of the rules here
were somewhat suspicious.

First, we should always be using the floating point domain shuffles,
regardless of how many copies we have to make as a movapd is *crazy*
faster than the domain switching cost on some chips. (Mostly because
movapd is crazy cheap.) Because SHUFPD can't do the copy-for-free trick
of the PSHUF instructions, there is no need to avoid canonicalizing on
UNPCK variants, so do that canonicalizing. This also ensures we have the
chance to form MOVDDUP. =]

Second, we assume SSE2 support when doing any vector lowering, and given
that we should just use UNPCKLPD and UNPCKHPD as they can operate on
registers or memory. If vectors get spilled or come from memory at all
this is going to allow the load to be folded into the operation. If we
want to optimize for encoding size (the only difference, and only
a 2 byte difference) it should be done *much* later, likely after RA.

llvm-svn: 217332
2014-09-07 12:02:14 +00:00
Chandler Carruth e020f117ce [x86] Teach lots of the new vector shuffle lowering to use UNPCK
instructions for blend operations at 128 bits. This was a serious hole
in our prior blend lowering.

llvm-svn: 215819
2014-08-16 09:42:15 +00:00
Chandler Carruth 83860cfcfa [x86] Begin a significant overhaul of how vector lowering is done in the
x86 backend.

This sketches out a new code path for vector lowering, hidden behind an
off-by-default flag while it is under development. The fundamental idea
behind the new code path is to aggressively break down the problem space
in ways that ease selecting the odd set of instructions available on
x86, and carefully avoid scalarizing code even when forced to use older
ISAs. Notably, this starts off restricting itself to SSE2 and implements
the complete vector shuffle and blend space for 128-bit vectors in SSE2
without scalarizing. The plan is to layer on top of this ISA extensions
where we can bail out of the complex SSE2 lowering and opt for
a cheaper, specialized instruction (or set of instructions). It also
needs to be generalized to AVX and AVX512 vector widths.

Currently, this does a decent but not perfect job for SSE2. There are
some specific shortcomings that I plan to address:
- We need a peephole combine to fold together shuffles where possible.
  There are cases where a previous shuffle could be modified slightly to
  arrange for elements to be in the correct position and a later shuffle
  eliminated. Doing this eagerly added quite a bit of complexity, and
  so my plan is to combine away these redundancies afterward.
- There are a lot more clever ways to use unpck and pack that need to be
  added. This is essential for real world shuffles as it turns out...

Once SSE2 is polished a bit I should be able to get interesting numbers
on performance improvements on benchmarks conducive to vectorization.
All of this will be off by default until it is functionally equivalent
of course.

Differential Revision: http://reviews.llvm.org/D4225

llvm-svn: 211888
2014-06-27 11:23:44 +00:00