Summary:
Extend AArch64RedundantCopyElimination to catch cases where the register
that is known to be zero is COPY'd in the predecessor block. Before
this change, this pass would catch cases like:
CBZW %W0, <BB#1>
BB#1:
%W0 = COPY %WZR // removed
After this change, cases like the one below are also caught:
%W0 = COPY %W1
CBZW %W1, <BB#1>
BB#1:
%W0 = COPY %WZR // removed
This change results in a 4% increase in static copies removed by this
pass when compiling the llvm test-suite. It also fixes regressions
caused by doing post-RA copy propagation (a separate change to be put up
for review shortly).
Reviewers: junbuml, mcrosier, t.p.northover, qcolombet, MatzeB
Subscribers: aemerson, rengolin, llvm-commits
Differential Revision: https://reviews.llvm.org/D30113
llvm-svn: 295863
LLVM CodeGen emits references to external symbols that are never declared in
LLVM IR level, so they have no declared signature. However, WebAssembly requires
all functions be declared with signatures. This patch adds a table for providing
signatures for known runtime libcalls that will be used in subsequent patches to
emit declarations for such functions.
llvm-svn: 295857
Minor optimization, don't create temporary mask APInts that are just going to be OR'd into the accumulate masks - insert directly instead.
llvm-svn: 295848
The pass tries to fix a spill of LR that turns out to be unnecessary.
So it removes the tPOP but forgets to remove tPUSH.
This causes the stack be misaligned upon returning the function.
Thus, remove the tPUSH as well in this case.
Differential Revision: https://reviews.llvm.org/D30207
llvm-svn: 295816
This patch adds missing sched classes for Thumb2 instructions.
This has been missing so far, and as a consequence, machine
scheduler models for individual sub-targets have tended to
be larger than they needed to be. These patches should help
write schedulers better and faster in the future
for ARM sub-targets.
Reviewer: Diana Picus
Differential Revision: https://reviews.llvm.org/D29953
llvm-svn: 295811
This patch introduces new X86ISD::FMAXS and X86ISD::FMINS opcodes. The legacy intrinsics now lower to this node. As do the AVX-512 masked intrinsics when the rounding mode is CUR_DIRECTION.
I've merged a copy of the tablegen multiclass avx512_fp_scalar into avx512_fp_scalar_sae. avx512_fp_scalar still needs to support CUR_DIRECTION appearing as a rounding mode for X86ISD::FADD_ROUND and others.
Differential revision: https://reviews.llvm.org/D30186
llvm-svn: 295810
This just adds the basic skeleton for supporting a new object file format.
All of the actual encoding will be implemented in followup patches.
Differential Revision: https://reviews.llvm.org/D26722
llvm-svn: 295803
Change implementation to use max instead of add.
min/max/med3 do not flush denormals regardless of the mode,
so it is OK to use it whether or not they are enabled.
Also allow using clamp with f16, and use knowledge
of dx10_clamp.
llvm-svn: 295788
Original code only used vector loads/stores for explicit vector arguments.
It could also do more loads/stores than necessary (e.g v5f32 would
touch 8 f32 values). Aggregate types were loaded one element at a time,
even the vectors contained within.
This change attempts to generalize (and simplify) parameter space
loads/stores so that vector loads/stores can be used more broadly.
Functionality of the patch has been verified by compiling thrust
test suite and manually checking the differences between PTX
generated by llvm with and without the patch.
General algorithm:
* ComputePTXValueVTs() flattens input/output argument into a flat list
of scalars to load/store and returns their types and offsets.
* VectorizePTXValueVTs() uses that data to create vectorization plan
which returns an array of flags marking boundaries of vectorized
load/stores. Scalars are represented as 1-element vectors.
* Code that generates loads/stores implements a simple state machine
that constructs a vector according to the plan.
Differential Revision: https://reviews.llvm.org/D30011
llvm-svn: 295784
Before frame offsets are calculated, try to eliminate the
frame indexes used by SGPR spills. Then we can delete them
after.
I think for now we can be sure that no other instruction
will be re-using the same frame indexes. It should be easy
to notice if this assumption ever breaks since everything
asserts if it tries to use a dead frame index later.
The unused emergency stack slot seems to still be left behind,
so an additional 4 bytes is still wasted.
llvm-svn: 295753
Summary:
Rework the code that was sinking/duplicating (icmp and, 0) sequences
into blocks where they were being used by conditional branches to form
more tbz instructions on AArch64. The new code is more general in that
it just looks for 'and's that have all icmp 0's as users, with a target
hook used to select which subset of 'and' instructions to consider.
This change also enables 'and' sinking for X86, where it is more widely
beneficial than on AArch64.
The 'and' sinking/duplicating code is moved into the optimizeInst phase
of CodeGenPrepare, where it can take advantage of the fact the
OptimizeCmpExpression has already sunk/duplicated any icmps into the
blocks where they are used. One minor complication from this change is
that optimizeLoadExt needed to be updated to always mark 'and's it has
determined should be in the same block as their feeding load in the
InsertedInsts set to avoid an infinite loop of hoisting and sinking the
same 'and'.
This change fixes a regression on X86 in the tsan runtime caused by
moving GVNHoist to a later place in the optimization pipeline (see
PR31382).
Reviewers: t.p.northover, qcolombet, MatzeB
Subscribers: aemerson, mcrosier, sebpop, llvm-commits
Differential Revision: https://reviews.llvm.org/D28813
llvm-svn: 295746
PC isn't allowed in the source operand of t2MOVr, so change the register class
to one without PC. SP handling is slightly trickier and changes depending on if
we're in ARMv8, so do that in checkTargetMatchPredicate.
Differential Revision: https://reviews.llvm.org/D30199
llvm-svn: 295732
This matches what is already done during shuffle lowering and helps prevent the need for a zero-vector in cases where shuffles match both patterns.
llvm-svn: 295723
Summary:
Sandy Bridge and later CPUs have better throughput using a SHLD to implement rotate versus the normal rotate instructions. Additionally it saves one uop and avoids a partial flag update dependency.
This patch implements this change on any Sandy Bridge or later processor without BMI2 instructions. With BMI2 we will use RORX as we currently do.
Reviewers: zvi
Reviewed By: zvi
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30181
llvm-svn: 295697
Pull out repeated code for extraction index operand and source vector value type.
Use isNullConstant helper to check for zero extraction index.
llvm-svn: 295670
There used to be a check in the IRTranslator that prevented us from having to
deal with atomic loads/stores. That check has been removed in r294993 and the
AArch64 backend was updated accordingly. This commit does the same thing for the
ARM backend.
In general, in the ARM backend we introduce fences during the atomic expand
pass, so we don't have to worry about atomics, *except* for the 32-bit ARMv8
target, which handles atomics more like AArch64. Since we don't want to worry
about that yet, just bail out of instruction selection if we find any atomic
loads.
llvm-svn: 295662
Its more profitable to go through memory (1 cycles throughput)
than using VMOVD + VPERMV/PSHUFB sequence ( 2/3 cycles throughput) to implement EXTRACT_VECTOR_ELT with variable index.
IACA tool was used to get performace estimation (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
For example for var_shuffle_v16i8_v16i8_xxxxxxxxxxxxxxxx_i8 test from vector-shuffle-variable-128.ll I get 26 cycles vs 79 cycles.
Removing the VINSERT node, we don't need it any more.
Differential Revision: https://reviews.llvm.org/D29690
llvm-svn: 295660
Use tablegen to autogenerate isBranchtarget helper functions. This is a cleanup
that removes almost identical functions that differ only in a few constants.
Differential Revision: https://reviews.llvm.org/D30160
llvm-svn: 295649
Add WIG value to all of AVX instructions which ignore the W-bit in their encoding, instead of giving them the default value of 0.
This patch is needed for a follow up work on EVEX2VEX pass (replacing EVEX encoded instructions with their corresponding VEX version when possible).
Differential Revision: https://reviews.llvm.org/D29876
llvm-svn: 295643
Replaces existing approach that could only search BUILD_VECTOR nodes.
Requires getTargetConstantBitsFromNode to discriminate cases with all/partial UNDEF bits in each element - this should also be useful when we get around to supporting getTargetShuffleMaskIndices with UNDEF elements.
llvm-svn: 295613
As discussed on D27692, this permits another domain to be used to combine a shuffle at high depths.
We currently set the required depth at 4 or more combined shuffles, this is probably too high for most targets but is a good starting point and already helps avoid a number of costly variable shuffles.
llvm-svn: 295608
Add the infrastructure to flag whether float and/or int domains are permitable.
A future patch will enable domain crossing based off shuffle depth and the value types of the source vectors.
llvm-svn: 295604
The instructions are marked commutable, but without special handling we don't get the immediate correct.
While here also remove the masked memory forms that aren't commutable.
llvm-svn: 295602
This requires some instructions to be renamed to move the Y earlier in the instruction name. The new names are more consistent with other instructions.
llvm-svn: 295579
It seems we were already upgrading 128-bit VPCMOV, but the intrinsic was still defined and being used in isel patterns. While I was here I also simplified the tablegen multiclasses.
llvm-svn: 295564
When promoting the Load of a Store-Load pair to a COPY all kill flags
between the store and the load need to be cleared.
rdar://30402435
Differential Revision: https://reviews.llvm.org/D30110
llvm-svn: 295512
Newer ppc supports unaligned memory access, it reduces the cost of unaligned memory access significantly. This patch handles this case in PPCTTIImpl::getMemoryOpCost.
This patch fixes pr31492.
Differential Revision: https://reviews.llvm.org/D28630
This is resubmit of r292680, which was reverted by r293092. The internal application failures were actually caused by a source code bug.
llvm-svn: 295506
This set of patches adds support for Cavium ThunderX ARM64 processors:
* ThunderX
* ThunderX T81
* ThunderX T83
* ThunderX T88
Patch by Stefan Teleman
Differential Revision: https://reviews.llvm.org/D28891
llvm-svn: 295475
Removed the HasT2ExtractPack feature and replaced its references
with HasDSP. This then allows the Thumb2 extend instructions to be
selected for ARMv8M +dsp. These instruction descriptions have also
been refactored and more target tests have been added for their isel.
Differential Revision: https://reviews.llvm.org/D29623
llvm-svn: 295452
Add some asserts to make sure we're using the mappings that we think we're
using. This is to keep us from accidentally breaking functionality while moving
to TableGen'erated mappings.
llvm-svn: 295441
Start using the Subtarget to make decisions about what's legal. In particular,
we only mark floating point operations as legal if we have VFP2, which is
something we should've done from the very start.
llvm-svn: 295439
The original commit was reverted in r283329 due to a miscompile in
Chromium. That turned out to be the same issue as PR31257, which was
fixed in r295262.
llvm-svn: 295357
Defining nodes should not alias with one another, while clobbering
nodes can. When pushing defs on stacks, push clobbers first, link
non-clobbering defs, then push the defs.
The data flow in a statement is now: uses -> clobbers -> defs.
llvm-svn: 295356
The existing code always saves the xmm registers for 64-bit targets even if the
target doesn't support SSE (which is common for kernels). Thus, the compiler
inserts movaps instructions which lead to CPU exceptions when an interrupt
handler is invoked.
This commit fixes this bug by returning a register set without xmm registers
from getCalleeSavedRegs and getCallPreservedMask for such targets.
Patch by Philipp Oppermann.
Differential Revision: https://reviews.llvm.org/D29959
llvm-svn: 295347