Newer ppc supports unaligned memory access, it reduces the cost of unaligned memory access significantly. This patch handles this case in PPCTTIImpl::getMemoryOpCost.
This patch fixes pr31492.
Differential Revision: https://reviews.llvm.org/D28630
This is resubmit of r292680, which was reverted by r293092. The internal application failures were actually caused by a source code bug.
llvm-svn: 295506
This reverts commit r292680. It is causing significantly worse
performance and test timeouts in our internal builds. I have already
routed reproduction instructions your way.
llvm-svn: 293092
Newer ppc supports unaligned memory access, it reduces the cost of unaligned memory access significantly. This patch handles this case in PPCTTIImpl::getMemoryOpCost.
This patch fixes pr31492.
Differential Revision: https://reviews.llvm.org/D28630
llvm-svn: 292680
VSX has instructions lxsiwax/lxsdx that can load 32/64 bit value into VSX register cheaply. That patch makes it known to memory cost model, so the vectorization of the test case in pr30990 is beneficial.
Differential Revision: https://reviews.llvm.org/D26713
llvm-svn: 288560
The cost model should not assume vector casts get completely scalarized, since
on targets that have vector support, the common case is a partial split up to
the legal vector size. So, when a vector cast gets split, the resulting casts
end up legal and cheap.
Instead of pessimistically assuming scalarization, base TTI can use the costs
the concrete TTI provides for the split vector, plus a fudge factor to account
for the cost of the split itself. This fudge factor is currently 1 by default,
except on AMDGPU where inserts and extracts are considered free.
Differential Revision: http://reviews.llvm.org/D21251
llvm-svn: 274642
This patch corresponds to review:
http://reviews.llvm.org/D20443
It changes the legalization strategy for illegal vector types from integer
promotion to widening. This only applies for vectors with elements of width
that is a multiple of a byte since we have hardware support for vectors with
1, 2, 3, 8 and 16 byte elements.
Integer promotion for vectors is quite expensive on PPC due to the sequence
of breaking apart the vector, extending the elements and reconstituting the
vector. Two of these operations are expensive.
This patch causes between minor and major improvements in performance on most
benchmarks. There are very few benchmarks whose performance regresses. These
regressions can be handled in a subsequent patch with a DAG combine (similar
to how this patch handles int -> fp conversions of illegal vector types).
llvm-svn: 274535
PPC has a vector popcount, this lets the vectorizer use the correct cost
for it. Tweak X86 test to use an intrinsic that's actually scalarized (we
have a somewhat efficient lowering for vector popcount using SSE, the
cost model finds that now).
llvm-svn: 265005
Pre-P8, when we generate code for unaligned vector loads (for Altivec and QPX
types), even when accounting for the combining that takes place for multiple
consecutive such loads, there is at least one load instructions and one
permutation for each load. Make sure the cost reported reflects the cost of the
permutes as well.
llvm-svn: 246807
I'm adding a regression test to better cover code generation for unaligned
vector loads and stores, but there's no functional change to the code
generation here. There is an improvement to the cost model for unaligned vector
loads and stores, mostly for QPX (for which we were not previously accounting
for the permutation-based loads), and the cost model implementation is cleaner.
llvm-svn: 246712
Essentially the same as the GEP change in r230786.
A similar migration script can be used to update test cases, though a few more
test case improvements/changes were required this time around: (r229269-r229278)
import fileinput
import sys
import re
pat = re.compile(r"((?:=|:|^)\s*load (?:atomic )?(?:volatile )?(.*?))(| addrspace\(\d+\) *)\*($| *(?:%|@|null|undef|blockaddress|getelementptr|addrspacecast|bitcast|inttoptr|\[\[[a-zA-Z]|\{\{).*$)")
for line in sys.stdin:
sys.stdout.write(re.sub(pat, r"\1, \2\3*\4", line))
Reviewers: rafael, dexonsmith, grosser
Differential Revision: http://reviews.llvm.org/D7649
llvm-svn: 230794
The default implementation of getCmpSelInstrCost, which provides the cost of
icmp/fcmp/select instructions, did not deal sensibly with illegal vector types
that were scalarized. We'd ask for the legalization cost of the vector type,
which would return something like (4, f64) given an input of <4 x double>, and
we'd then check the TLI status of the ISD opcode on that scalar type. This would
result in querying (ISD::VSELECT, f64), for example. Amusingly enough,
ISD::VSELECT on scalar types is marked as Legal by default (as with most other
operations), and most backends never change this because VSELECT is never
generated on scalars. However, seeing the resulting operation as Legal, we'd
neglect to add the scalarization cost before returning. The result is that we'd
grossly under-estimate the cost of cmps/selects on illegal vector types.
Now, if type legalization clearly results in scalarization, we skip the early
return and add the scalarization cost.
llvm-svn: 217859
BasicTTI::getMemoryOpCost must explicitly check for non-simple types; setting
AllowUnknown=true with TLI->getSimpleValueType is not sufficient because, for
example, non-power-of-two vector types return non-simple EVTs (not MVT::Other).
llvm-svn: 206150
This provides more realistic costs for the insert/extractelement instructions
(which are load/store pairs), accounts for the cheap unaligned Altivec load
sequence, and for unaligned VSX load/stores.
Bad news:
MultiSource/Applications/sgefa/sgefa - 35% slowdown (this will require more investigation)
SingleSource/Benchmarks/McGill/queens - 20% slowdown (we no longer vectorize this, but it was a constant store that was scalarized)
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2 - 2% slowdown
Good news:
SingleSource/Benchmarks/Shootout/ary3 - 54% speedup
SingleSource/Benchmarks/Shootout-C++/ary - 40% speedup
MultiSource/Benchmarks/Ptrdist/ks/ks - 35% speedup
MultiSource/Benchmarks/FreeBench/neural/neural - 30% speedup
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt - 20% speedup
Unfortunately, estimating the costs of the stack-based scalarization sequences
is hard, and adjusting these costs is like a game of whac-a-mole :( I'll
revisit this again after we have better codegen for vector extloads and
truncstores and unaligned load/stores.
llvm-svn: 205658
When a vector type legalizes to a larger vector type, and the target does not
support the associated extending load (or truncating store), then legalization
will scalarize the load (or store) resulting in an associated scalarization
cost. BasicTTI::getMemoryOpCost needs to account for this.
Between this, and r205487, PowerPC on the P7 with VSX enabled shows:
MultiSource/Benchmarks/PAQ8p/paq8p: 43% speedup
SingleSource/Benchmarks/BenchmarkGame/puzzle: 51% speedup
SingleSource/UnitTests/Vectorizer/gcc-loops 28% speedup
(some of these are new; some of these, such as PAQ8p, just reverse regressions
that VSX support would trigger)
llvm-svn: 205495
For an cast (extension, etc.), the currently logic predicts a low cost if the
associated operation (keyed on the destination type) is legal (or promoted).
This is not true when the number of values required to legalize the type is
changing. For example, <8 x i16> being sign extended by <8 x i32> is not
generically cheap on PPC with VSX, even though sign extension to v4i32 is
legal, because two output v4i32 values are required compared to the single
v8i16 input value, and without custom logic in the target, this conversion will
scalarize.
llvm-svn: 205487
- Instead of setting the suffixes in a bunch of places, just set one master
list in the top-level config. We now only modify the suffix list in a few
suites that have one particular unique suffix (.ml, .mc, .yaml, .td, .py).
- Aside from removing the need for a bunch of lit.local.cfg files, this enables
4 tests that were inadvertently being skipped (one in
Transforms/BranchFolding, a .s file each in DebugInfo/AArch64 and
CodeGen/PowerPC, and one in CodeGen/SI which is now failing and has been
XFAILED).
- This commit also fixes a bunch of config files to use config.root instead of
older copy-pasted code.
llvm-svn: 188513
Thanks to help from Nadav and Hal, I have a more reasonable (and even
correct!) approach. This specifically penalizes the insertelement
and extractelement operations for the performance hit that will occur
on PowerPC processors.
llvm-svn: 174725
This provides a place to add customized operation cost information and
control some other target-specific IR-level transformations.
The only non-trivial logic in this checkin assigns a higher cost to
unaligned loads and stores (covered by the included test case).
llvm-svn: 173520