There are a couple of potential improvements as seen in the IR and asm:
1. We're unnecessarily extending to a larger type to compare values.
2. The codegen for (select cond, 1, -1) could avoid a cmov.
(or we could change the order of the compares, so we have a select with 0 operand)
llvm-svn: 305802
This seems to be interacting badly with ASan somehow, causing false reports of
heap-buffer overflows: PR33514.
> Summary:
> The patch makes instruction count the highest priority for
> LSR solution for X86 (previously registers had highest priority).
>
> Reviewers: qcolombet
>
> Differential Revision: http://reviews.llvm.org/D30562
>
> From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 305720
Summary:
The patch makes instruction count the highest priority for
LSR solution for X86 (previously registers had highest priority).
Reviewers: qcolombet
Differential Revision: http://reviews.llvm.org/D30562
From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 304824
Summary:
Expanding the loop idiom test for memcpy to also recognize
unordered atomic memcpy. The only difference for recognizing
an unordered atomic memcpy and instead of a normal memcpy is
that the loads and/or stores involved are unordered atomic operations.
Background: http://lists.llvm.org/pipermail/llvm-dev/2017-May/112779.html
Patch by Daniel Neilson!
Reviewers: reames, anna, skatkov
Reviewed By: reames, anna
Subscribers: llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D33243
llvm-svn: 304806
Account for subvector extraction/insertion, helps prevent the vectorizers from selecting 256-bit vectors that will have to be split anyhow on AVX1 targets.
llvm-svn: 302378
getArithmeticInstrCost(), getShuffleCost(), getCastInstrCost(),
getCmpSelInstrCost(), getVectorInstrCost(), getMemoryOpCost(),
getInterleavedMemoryOpCost() implemented.
Interleaved access vectorization enabled.
BasicTTIImpl::getCastInstrCost() improved to check for legal extending loads,
in which case the cost of the z/sext instruction becomes 0.
Review: Ulrich Weigand, Renato Golin.
https://reviews.llvm.org/D29631
llvm-svn: 300052
Summary:
LSV wants to know the maximum size that can be loaded to a vector register.
On X86, this always matches the maximum register width. Implement this
accordingly and add a test to make sure that LSV can vectorize up to the
maximum permissible width on X86.
Reviewers: delena, arsenm
Reviewed By: arsenm
Subscribers: wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D31504
llvm-svn: 299589
getIntrinsicInstrCost() used to only compute scalarization cost based on types.
This patch improves this so that the actual arguments are checked when they are
available, in order to handle only unique non-constant operands.
Tests updates:
Analysis/CostModel/X86/arith-fp.ll
Transforms/LoopVectorize/AArch64/interleaved_cost.ll
Transforms/LoopVectorize/ARM/interleaved_cost.ll
The improvement in getOperandsScalarizationOverhead() to differentiate on
constants made it necessary to update the interleaved_cost.ll tests even
though they do not relate to intrinsics.
Review: Hal Finkel
https://reviews.llvm.org/D29540
llvm-svn: 297705
Refactoring to remove duplications of this method.
New method getOperandsScalarizationOverhead() that looks at the present unique
operands and add extract costs for them. Old behaviour was to just add extract
costs for one operand of the type always, which still happens in
getArithmeticInstrCost() if no operands are provided by the caller.
This is a good start of improving on this, but there are more places
that can be improved by using getOperandsScalarizationOverhead().
Review: Hal Finkel
https://reviews.llvm.org/D29017
llvm-svn: 293155
updated instructions:
pmulld, pmullw, pmulhw, mulsd, mulps, mulpd, divss, divps, divsd, divpd, addpd and subpd.
special optimization case which replaces pmulld with pmullw\pmulhw\pshuf seq.
In case if the real operands bitwidth <= 16.
Differential Revision: https://reviews.llvm.org/D28104
llvm-svn: 291657
The 'fast' costs should only work for shifts by uniform constants (uniform non-constant are lowered using the slow default implementation).
Logical shifts were not taking into account that we must mask the psrlw result, so the costs needed to be doubled.
Added missing AVX2/AVX512BW costs as well.
llvm-svn: 291391
This code seems to be target dependent which may not be the same for all targets.
Passed the decision whether the given stride is complex or not to the target by sending stride information via SCEV to getAddressComputationCost instead of 'IsComplex'.
Specifically at X86 targets we dont see any significant address computation cost in case of the strided access in general.
Differential Revision: https://reviews.llvm.org/D27518
llvm-svn: 291106
X86 target does not provide any target specific cost calculation for interleave patterns.It uses the common target-independent calculation, which gives very high numbers. As a result, the scalar version is chosen in many cases. The situation on AVX-512 is even worse, since we have 3-src shuffles that significantly reduce the cost.
In this patch I calculate the cost on AVX-512. It will allow to compare interleave pattern with gather/scatter and choose a better solution (PR31426).
* Shiffle-broadcast cost will be changed in Simon's upcoming patch.
Differential Revision: https://reviews.llvm.org/D28118
llvm-svn: 290810
Add explicit v16i16/v32i8 ADD/SUB costs, matching the costs of v4i64/v8i32 - they were missing for some reason.
This has side effects on the LV max bandwidth tests (AVX1 now prefers 128-bit vectors vs AVX2 which still prefers 256-bit)
llvm-svn: 286832
This patch avoids scalarization of CTLZ by instead expanding to use CTPOP (ref: "Hacker's Delight") when the necessary operations are available.
This also adds the necessary cost models for X86 SSE2 targets (the main beneficiary) to ensure vectorization only happens when its useful.
Differential Revision: https://reviews.llvm.org/D25910
llvm-svn: 286233
There is a bug describing poor cost model for floating point operations:
Bug 29083 - [X86][SSE] Improve costs for floating point operations. This
patch is the second one in series of patches dealing with cost model.
Differential Revision: https://reviews.llvm.org/D25722
llvm-svn: 285564
With DQI but without VLX, lower v2i64 and v4i64 MUL operations with v8i64 MUL (vpmullq).
Updated cost table accordingly.
Differential Revision: https://reviews.llvm.org/D26011
llvm-svn: 285304
We were defaulting to SSE2 costs which weren't taking into account the availability of PBLENDW/PBLENDVB to improve merging of per-element shift results.
llvm-svn: 284939
We weren't accounting for legal types on every subtarget, meaning that many of the costs were using defaults.
We still don't correctly cost (or test) the 512-bit sdiv/udiv by uniform const cases, nor the power-of-2 cases.
llvm-svn: 284744
As discussed on PR28461 we currently miss the chance to lower "fptosi <2 x double> %arg to <2 x i32>" to cvttpd2dq due to its use of illegal types.
This patch adds support for fptosi to 2i32 from both 2f64 and 2f32.
It also recognises that cvttpd2dq zeroes the upper 64-bits of the xmm result (similar to D23797) - we still don't do this for the cvttpd2dq/cvttps2dq intrinsics - this can be done in a future patch.
Differential Revision: https://reviews.llvm.org/D23808
llvm-svn: 284459
The current Cost Model implementation is very inaccurate and has to be
updated, improved, re-implemented to be able to take into account the
concrete CPU models and the concrete targets where this Cost Model is
being used. For example, the Latency Cost Model should be differ from
Code Size Cost Model, etc.
This patch is the first step to launch the developing and implementation
of a new Cost Model generation.
Differential Revision: https://reviews.llvm.org/D25186
llvm-svn: 284012
This is a mechanical change of comments in switches like fallthrough,
fall-through, or fall-thru to use the LLVM_FALLTHROUGH macro instead.
llvm-svn: 278902
This reverts commit r278048. Something changed between the last time I
built this--it takes awhile on my ridiculously slow and ancient
computer--and now that broke this.
llvm-svn: 278053
Summary:
Based on two patches by Michael Mueller.
This is a target attribute that causes a function marked with it to be
emitted as "hotpatchable". This particular mechanism was originally
devised by Microsoft for patching their binaries (which they are
constantly updating to stay ahead of crackers, script kiddies, and other
ne'er-do-wells on the Internet), but is now commonly abused by Windows
programs to hook API functions.
This mechanism is target-specific. For x86, a two-byte no-op instruction
is emitted at the function's entry point; the entry point must be
immediately preceded by 64 (32-bit) or 128 (64-bit) bytes of padding.
This padding is where the patch code is written. The two byte no-op is
then overwritten with a short jump into this code. The no-op is usually
a `movl %edi, %edi` instruction; this is used as a magic value
indicating that this is a hotpatchable function.
Reviewers: majnemer, sanjoy, rnk
Subscribers: dberris, llvm-commits
Differential Revision: https://reviews.llvm.org/D19908
llvm-svn: 278048
Shifts with a uniform but non-constant count were considered very expensive to
vectorize, because the splat of the uniform count and the shift would tend to
appear in different blocks. That made the splat invisible to ISel, and we'd
scalarize the shift at codegen time.
Since r201655, CodeGenPrepare sinks those splats to be next to their use, and we
are able to select the appropriate vector shifts. This updates the cost model to
to take this into account by making shifts by a uniform cheap again.
Differential Revision: https://reviews.llvm.org/D23049
llvm-svn: 277782
This patch adds costs for the vectorized implementations of CTPOP, the default values were seriously underestimating the cost of these and was encouraging vectorization on targets where serialized use of POPCNT would be much better.
Differential Revision: https://reviews.llvm.org/D22456
llvm-svn: 276104
Make some AVX and AVX512 cast costs more precise.
Based on part of a patch by Elena Demikhovsky (D15604).
Differential Revision: http://reviews.llvm.org/D22064
llvm-svn: 275106
This is "cvtdq2ps" which does not appear to be particularly slow on any CPU
according to Agner's tables. Choosing "5" as a cost here as suggested in:
https://llvm.org/bugs/show_bug.cgi?id=21356
...but it seems very conservative given that the instruction is fully pipelined,
and I think these costs are supposed to model throughput.
Note that related costs are also most likely too high, but this fixes PR21356
and partly fixes PR28434.
llvm-svn: 274658
Cast cost tables are now sorted, for each cast type, lexicographically on
[source base type, source vector width, dest base type, base vector width].
llvm-svn: 274653
The BSWAP of vector types is quite efficiently implemented using vector shuffles on SSE/AVX targets, we should reflect the typical cost of this to encourage vectorization.
Differential Revision: http://reviews.llvm.org/D21521
llvm-svn: 273217
The costs are somewhat hand-wavy, but should be much closer to the truth
than what we get from BasicTTI.
Differential Revision: http://reviews.llvm.org/D21156
llvm-svn: 272406
By making pointer extraction from a vector more expensive in the cost model,
we avoid the vectorization of a loop that is very likely to be memory-bound:
https://llvm.org/bugs/show_bug.cgi?id=27826
There are still bugs related to this, so we may need a more general solution
to avoid vectorizing obviously memory-bound loops when we don't have HW gather
support.
Differential Revision: http://reviews.llvm.org/D20601
llvm-svn: 270729
As discussed on PR24888, until SSE42 we don't have access to PCMPGTQ for v2i64 comparisons, but the cost models don't reflect this, resulting in over-optimistic vectorizaton.
This patch adds SSE2 'base level' costs that match what a typical target is capable of and only reduces the v2i64 costs at SSE42.
Technically SSE41 provides a PCMPEQQ v2i64 equality test, but as getCmpSelInstrCost doesn't give us a way to discriminate between comparison test types we can't easily make use of this, otherwise we could split the cost of integer equality and greater-than tests to give better costings of each.
Differential Revision: http://reviews.llvm.org/D20057
llvm-svn: 268972
Summary:
rL256194 transforms truncations between vectors of integers into PACKUS/PACKSS
operations during DAG combine. This generates better code for truncate, so cost
of truncate needs to be changed but looks like it got changed only in SSE2 table
Whereas this change is also applicable for SSE4.1, so the cost of truncate needs
to be changed for that as well. Cost of “TRUNCATE v16i32 to v16i8” & “TRUNCATE
v16i16 to v16i8” should be same in SSE4.1 & SSE2 table. Removing their cost from
SSE4.1, so it will fall back to SSE2.
Reviewers: Simon Pilgrim
llvm-svn: 267123
This code was creating a new type in the global context, regardless
of which context the user is sitting in, what can possibly go wrong?
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 266275
The irony of this patch is that one CPU that is affected is AMD Jaguar, and Jaguar
has a completely double-pumped AVX implementation. But getting the cost model to
reflect that is a much bigger problem. The small goal here is simply to improve on
the lie that !AVX2 == SandyBridge.
Differential Revision: http://reviews.llvm.org/D18000
llvm-svn: 263069
Use AVX1 FP instructions (vmaskmovps/pd) in place of the AVX2 int instructions (vpmaskmovd/q).
Differential Revision: http://reviews.llvm.org/D16528
llvm-svn: 258675
The cost is calculated for all X86 targets. When gather/scatter instruction
is not supported we calculate the cost of scalar sequence.
Differential revision: http://reviews.llvm.org/D15677
llvm-svn: 256519
This patch transforms truncation between vectors of integers into
X86ISD::PACKUS/PACKSS operations during DAG combine. We don't do it in
lowering phase because after type legalization, the original truncation
will be turned into a BUILD_VECTOR with each element that is extracted
from a vector and then truncated, and from them it is difficult to do
this optimization. This greatly improves the performance of truncations
on some specific types.
Cost table is updated accordingly.
Differential revision: http://reviews.llvm.org/D14588
llvm-svn: 256194