Enable multiple COPY hints to eliminate more COPYs during register allocation.
Note that this is something all targets should do, see
https://reviews.llvm.org/D38128.
Review: Yonghong Song
llvm-svn: 325457
Previously we used the immediate encoding if the load was in operand 0 and the short encoding if the load was in operand 1.
This added an insane number of bytes to the size of the isel table. I'm wondering if we should always use the immediate form during isel and change to the short form during emission. This would remove the need to pattern match every combination for both the immediate form and the short form during isel. We could do the same with vpcmpgt
llvm-svn: 325456
This makes sure that alloca() function calls properly probe the
stack as needed.
Differential Revision: https://reviews.llvm.org/D42356
llvm-svn: 325433
Enable multiple COPY hints to eliminate more COPYs during register allocation.
Note that this is something all targets should do, see
https://reviews.llvm.org/D38128.
Review: Stanislav Mekhanoshin, Tom Stellard.
llvm-svn: 325425
Sadly, r324359 caused at least PR36312. There is a patch out for review
but it seems to be taking a bit and we've already had these crashers in
tree for too long. We're hitting this PR in real code now and are
blocked on shipping new compilers as a consequence so I'm reverting us
back to green.
Sorry for the churn due to the stacked changes that I had to revert. =/
llvm-svn: 325420
Summary:
Currently we convert to shuffles during lowering. This moves it to DAG combine so hopefully we can get it done before type legalization has to extend the condition.
I believe in some cases we're creating SHRUNKBLENDs that end up with constant conditions because we see the extended on the condition and think its a dynamic selelect before DAG combine gets a chance to constant fold the extend. We could add combines to turn SHRUNKBLENDs with constant condition back to vselect. But it seemed like it might be better to just send them to shuffles as early as possible so they never get a chance to become SHRUNKBLENDs. This the reason some tests went from blends controlled by a constant pool load to just move.
Some of the constant pool entries changed because the sign_extend introduced by type legalization turned undef elements in select condition into 0s. While the select->shuffle used -1 in the shuffle mask. So now the shuffle lowering can do what it wants with them.
I'll remove the lowering code as a follow up. We might be able to simplify some of the pre-checks for SHRUNKBLEND as the FIXME there says.
Reviewers: spatel, RKSimon, efriedma, zvi, andreadb
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D43367
llvm-svn: 325417
Undef in select condition means we should pick the element from one side or the other. An undef in a shuffle mask means pick any element from either source or worse.
I suspect by the time we get here most of the undefs in a constant vector have been removed by other things, but doing this for safety.
llvm-svn: 325394
The data type is assumed to be a vector, but sometimes it is not, leading
to an assertion.
Add simple test-case to verify this.
Differential revision: https://reviews.llvm.org/D42599
llvm-svn: 325378
Summary:
This patch extends the promotion of alloca to vector to the arrays of up to 16 elements. Also we introduce
an option, -disable-promote-alloca-to-vector, to switch promotion to vector off, if needed.
Reviewers:
arsenm
Differential Revision:
https://reviews.llvm.org/D33559
llvm-svn: 325372
This seems to interfere with a target independent brcond combine that looks for the (srl (and X, C1), C2) pattern to enable TEST instructions. Once we flip, that combine doesn't fire and we end up exposing it to the X86 specific BT combine which causes us to emit a BT instruction. BT has lower throughput than TEST.
We could try to make the brcond combine aware of the alternate pattern, but since the flip was just a code size reduction and not likely to enable other combines, it seemed easier to just delay it until after lowering.
Differential Revision: https://reviews.llvm.org/D43201
llvm-svn: 325371
Add an explicit check before looking up symbol in SymbolIndices.
This was previously silently succeeding and returning zero for such
unnamed temporaries.
Differential Revision: https://reviews.llvm.org/D43365
llvm-svn: 325367
Summary:
In the current implementation of GPR Indexing Mode when the index is of non-uniform, the s_set_gpr_idx_off instruction
is incorrectly inserted after the loop. This will lead the instructions with vgpr operands (v_readfirstlane for example) to read incorrect
vgpr.
In this patch, we fix the issue by inserting s_set_gpr_idx_on/off immediately around the interested instruction.
Reviewers:
rampitec
Differential Revision:
https://reviews.llvm.org/D43297
llvm-svn: 325355
Running a bootstrap build with UBSan produces a number of instances where
we have signed integer overflow due to this transform. Change the type to
long to prevent this UB on 64-bit build machines.
llvm-svn: 325347
These instructions conflict with their full length variants
for the purposes of FastISel as they cannot be distingushed
based on the number and type of operands and predicates.
Reviewers: atanasyan
Differential Revision: https://reviews.llvm.org/D41285
llvm-svn: 325341
Enable multiple COPY hints to eliminate more COPYs during register allocation.
Note that this is something all targets should do, see
https://reviews.llvm.org/D38128.
Review: Eli Friedman
llvm-svn: 325327
This patch combines some cases of ARMISD::CMOV for integers that arise in comparisons of the form
a != b ? x : 0
a == b ? 0 : x
and that currently (e.g. in Thumb1) are emitted as branches.
Differential Revision: https://reviews.llvm.org/D34515
llvm-svn: 325323
Summary:
This patch makes the decoder understand old AMD 3DNow!
instructions that have never been properly supported in the X86
disassembler, despite being supported in other subsystems. Hopefully
this should make the X86 decoder more complete with respect to binaries
containing legacy code.
Reviewers: craig.topper
Reviewed By: craig.topper
Subscribers: llvm-commits, maksfb, bruno
Differential Revision: https://reviews.llvm.org/D43311
llvm-svn: 325295
We already do this for 64-bit when it won't fit into a 64-bit AND/TEST's immediate field. This adds an additional qualifier to do it for any single bit constant larger than 8-bits under optsize
Differential Revision: https://reviews.llvm.org/D43346
llvm-svn: 325290
We can't fold a large immediate into a 64-bit operation. But if we know we're only operating on a single bit we can use the bit instructions.
For now only do this for optsize.
Differential Revision: https://reviews.llvm.org/D37418
llvm-svn: 325287
Use the FunctionType of the callee when it's available. It may not be
available for synthetic calls to functions specified by external symbols.
llvm-svn: 325269
The reference '&' is missing in the function parameter. If there are
back-to-back optimizations in terms of dag node list like below:
t29: i64,ch = load<LD4[bitcast (%struct.test_t* @test.t to i8*)+12](dereferenceable), zext from i32> t3, t43, undef:i64
t34: i64,ch = load<LD4[bitcast (%struct.test_t* @test.t to i8*)](dereferenceable), zext from i32> t3, t41, undef:i64
The bug will trigger a segfault for the added test case remove_truncate_5.ll:
LLVMSymbolizer: error reading file: No such file or directory
#0 0x000000000241c4d9 (llc+0x241c4d9)
#1 0x000000000241c56a (llc+0x241c56a)
#2 0x000000000241aa50 (llc+0x241aa50)
...
#22 0x0000000000fd5edf (llc+0xfd5edf)
#23 0x00007f0fe03bec05 __libc_start_main (/lib64/libc.so.6+0x21c05)
#24 0x0000000000fd3e69 (llc+0xfd3e69)
...
Segmentation fault
Signed-off-by: Yonghong Song <yhs@fb.com>
llvm-svn: 325267
The FunctionType of the callee is always available, even if the Function
of the callee is not. Use that to get the number of fixed parameters.
llvm-svn: 325259
Summary:
In LLVM, 't' selects a floating-point/SIMD register and only supports
32-bit values. This is appropriately documented in the LLVM Language
Reference Manual. However, this behaviour diverges from that of GCC, where
't' selects the s0-s31 registers and its qX and dX variants depending on
additional operand modifiers (q/P).
For example, the following C code:
#include <arm_neon.h>
float32x4_t a, b, x;
asm("vadd.f32 %0, %1, %2" : "=t" (x) : "t" (a), "t" (b))
results in the following assembly if compiled with GCC:
vadd.f32 s0, s0, s1
whereas LLVM will show "error: couldn't allocate output register for
constraint 't'", since a, b, x are 128-bit variables, not 32-bit.
This patch extends the use of 't' to mean that of GCC, thus allowing
selection of the lower Q vector regs and their D/S variants. For example,
the earlier code will now compile as:
vadd.f32 q0, q0, q1
This behaviour still differs from that of GCC but I think it is actually
more correct, since LLVM picks up the right register type based on the
datatype of x, while GCC would need an extra operand modifier to achieve
the same result, as follows:
asm("vadd.f32 %q0, %q1, %q2" : "=t" (x) : "t" (a), "t" (b))
Since this is only an extension of functionality, existing code should not
be affected by this change. Note that operand modifiers q/P are already
supported by LLVM, so this patch should suffice to support inline
assembly with constraint 't' originally built for GCC.
Reviewers: grosbach, rengolin
Reviewed By: rengolin
Subscribers: rogfer01, efriedma, olista01, aemerson, javed.absar, eraman, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D42962
llvm-svn: 325244
We can use PACKSS to saturate each stage of the chain: PACKSSDW down to [-32768,32767] and then PACKSSWB to [-128,127].
PACKUS is a little trickier and will be handled in a separate patch.
llvm-svn: 325235
The bound instruction does not have reversed operands in gas.
Fixes PR27653.
Patch by Maya Madhavan.
Differential Revision: https://reviews.llvm.org/D43243
llvm-svn: 325178
Try to keep PACK*SDW/PACK*SWB as wide as possible, this helps ComputeNumSignBits as it can only peek through bitcasts to wider types, pre-AVX2 codegen was already doing this as it could peek through bitcasts/subvectors more easily than AVX2 could through shuffles.
This shouldn't affect existing results as calls to truncateVectorWithPACK ensure we have enough sign bits to pack to the same value, but it should make it possible to use truncateVectorWithPACK chains to perform saturation in combineTruncateWithSat with a future patch.
llvm-svn: 325149
Kernel arguments likely read by all workitems and should not bypass
cache. Fixes performance hit in sub-dword argument loads.
Differential Revision: https://reviews.llvm.org/D43249
llvm-svn: 325146
The prologue-end line record must be emitted after the last
instruction that is part of the function frame setup code and before
the instruction that marks the beginning of the function body.
Patch by Carlos Alberto Enciso!
Differential Revision: https://reviews.llvm.org/D41762
llvm-svn: 325143
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.
This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.
The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.
Change-Id: Ic41aa9ade6512e0478db66e07e2fde41b4fb35f9
llvm-svn: 325128
While the AVX512 VTRUNCS/VTRUNCUS instructions require legal types, truncateVectorWithPACK handles cases with multiples of legal types through splitting/concatenation. So we just need to ensure that the src/dst scalar types are correct and leave truncateVectorWithPACK to handle the rest of it.
llvm-svn: 325127
* Document most API's
* Delete a useless function call
* Fix a discrepancy between the single and multi-opcode variants of
getActionDefinitions().
The multi-opcode variant now requires that more than one opcode is requested.
Previously it acted much like the single-opcode form but unnecessarily
enforced the requirements of the multi-opcode form.
llvm-svn: 325067
Summary:
Instead of solving the hard problem of how to pass the callee to the indirect
jump thunk without a register, just use a CSR. At a call boundary, there's
nothing stopping us from using a CSR to hold the callee as long as we save and
restore it in the prologue.
Also, add tests for this mregparm=3 case. I wrote execution tests for
__llvm_retpoline_push, but they never got committed as lit tests, either
because I never rewrote them or because they got lost in merge conflicts.
Reviewers: chandlerc, dwmw2
Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D43214
llvm-svn: 325049
It caused "Cannot select: t33: f64 = AArch64ISD::FMOV Constant:i32<0>"
in Chromium builds. See PR36369.
> Get rid of icky goto loops and make the code easier to maintain (NFC).
>
> Differential revision: https://reviews.llvm.org/D42723
llvm-svn: 325034
Change ARMConstantIslandPass to:
- accept f16 literals as litpool entries,
- if the litpool needs to be inserted in the middle of a big block, then we
need to 4-byte align the next instruction in ARM mode.
Differential Revision: https://reviews.llvm.org/D42784
llvm-svn: 325012
Summary:
wasm32-unknown-unknown-elf has MCSymbols that are not MCSymbolWasms, so
we need a non-asserting cast here.
Reviewers: dschuff, sunfish
Subscribers: jfb, sbc100, aheejin, llvm-commits
Differential Revision: https://reviews.llvm.org/D43205
llvm-svn: 324942
Expand existing SchedRW to encompass these like it did for the other memory offset movs - added comments to closing braces to keep track of def scopes.
We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).
llvm-svn: 324910
Armv8.1-A added an atomic load-clear instruction (which performs bitwise
and with the complement of it's operand), but not a load-and
instruction. Our current code-generation for atomic load-and always
inserts an MVN instruction to invert its argument, even if it could be
folded into a constant or another instruction.
This adds lowering early in selection DAG to convert a load-and
operation into an xor with -1 and a load-clear, allowing the normal DAG
optimisations to work on it.
To do this, I've had to add a new ISD opcode, ATOMIC_LOAD_CLR. I don't
see any easy way to do this with an AArch64-specific ISD node, because
the code-generation for atomic operations assumes the SDNodes are of
type AtomicSDNode.
I've left the old tablegen patterns in because they are still needed for
global isel.
Differential revision: https://reviews.llvm.org/D42478
llvm-svn: 324908
Tag AVX512 variants to match SSE/AVX originals.
We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).
llvm-svn: 324901
We only tagged it with the itinerary class, so completeness checks were erroneously passed (PR35639).
AMD targets can perform these a lot quicker than WriteMicrocoded so will need an override in the models.
llvm-svn: 324897
Armv8.1-A added an atomic load-add instruction, but not a load-subtract
instruction. Our current code-generation for atomic load-subtract always
inserts a NEG instruction to negate it's argument, even if it could be
folded into a constant or another instruction.
This adds lowering early in selection DAG to convert a load-subtract
operation into a subtract and a load-add, allowing the normal DAG
optimisations to work on it.
I've left the old tablegen patterns in because they are still needed for
global isel.
Some of the tests in this patch are copied from D35375 by Chad Rosier (which
was abandoned).
Differential revision: https://reviews.llvm.org/D42477
llvm-svn: 324892
It asserts building Chromium; see PR36346.
(This also reverts the follow-up r324836.)
> If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
> A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
> The estimated penalty for a store forward block is ~13 cycles.
>
> This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
> of a load and a store.
>
> The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
> breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.
llvm-svn: 324887
In case of correct using of the 'l' constraint llvm now generates valid
code; otherwise it shows an error message. Initially these triggers an
assertion.
This commit is the same as r324869 with fixed the test's file name.
llvm-svn: 324885
Add a common -trap-unreachable option, similar to the target
specific hexagon equivalent, which has been replaced. This
turns unreachable instructions into traps, which is useful for
debugging.
Differential Revision: https://reviews.llvm.org/D42965
llvm-svn: 324880
In case of correct using of the 'l' constraint llvm now generates valid
code; otherwise it shows an error message. Initially these triggers an
assertion.
llvm-svn: 324869
I don't believe we ever create an X86ISD::SUB with a 0 constant which is what the TEST handling needs. The ternary operator at the end of this code shows up as only going one way in the llvm-cov report from the bots.
llvm-svn: 324865
ISD::ADD implies individual vector element addition with no carries between elements. But for a vXi1 type that would be the same as XOR. And we already turn ISD::ADD into ISD::XOR for all vXi1 types during lowering. So the ISD::ADD pattern would never be able to match anyway.
KADD is different, it adds the elements but also propagates a carry between them. This just a way of doing an add in k-register without bitcasting to the scalar domain. There's still no way to match the pattern, but at least its not obviously wrong.
llvm-svn: 324861
Previously we just emitted this as a MOV8rm which would likely get folded during the peephole pass anyway. This just makes it explicit earlier.
The gpr-to-mask.ll test changed because the kaddb instruction has no memory form.
llvm-svn: 324860
Summary:
Currently we only use min/max to help with ule/uge compares because it removes an invert of the result that would otherwise be needed. But we can also use it for ult/ugt compares if it will prevent the need for a sign bit flip needed to use pcmpgt at the cost of requiring an invert after the compare.
I also refactored the code so that the max/min code is self contained and does its own return instead of setting up a flag to manipulate the rest of the function's behavior.
Most of the test cases look ok with this. I did notice that we added instructions when one of the operands being sign flipped is a constant vector that we were able to constant fold the flip into.
I also noticed that sometimes the SSE min/max clobbers a register that is needed after the compare. This resulted in an extra move being inserted before the min/max to preserve the register. We could try to detect this and switch from min to max and change the compare operands to use the operand that gets reused in the compare.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D42935
llvm-svn: 324842
This allows us to recognise more saturation patterns and also simplify some MINMAX codegen that was failing to combine CMPGE comparisons to a legal CMPGT.
Differential Revision: https://reviews.llvm.org/D43014
llvm-svn: 324837
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.
This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.
The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.
Change-Id: I620b6dc91583ad9a1444591e3ddc00dd25d81748
llvm-svn: 324835
This patch adds a new function attribute "required-vector-width" that can be set by the frontend to indicate the maximum vector width present in the original source code. The idea is that this would be set based on ABI requirements, intrinsics or explicit vector types being used, maybe simd pragmas, etc. The backend will then use this information to determine if its save to make 512-bit vectors illegal when the preference is for 256-bit vectors.
For code that has no vectors in it originally and only get vectors through the loop and slp vectorizers this allows us to generate code largely similar to our AVX2 only output while still enabling AVX512 features like mask registers and gather/scatter. The loop vectorizer doesn't always obey TTI and will create oversized vectors with the expectation the backend will legalize it. In order to avoid changing the vectorizer and potentially harm our AVX2 codegen this patch tries to make the legalizer behavior similar.
This is restricted to CPUs that support AVX512F and AVX512VL so that we have good fallback options to use 128 and 256-bit vectors and still get masking.
I've qualified every place I could find in X86ISelLowering.cpp and added tests cases for many of them with 2 different values for the attribute to see the codegen differences.
We still need to do frontend work for the attribute and teach the inliner how to merge it, etc. But this gets the codegen layer ready for it.
Differential Revision: https://reviews.llvm.org/D42724
llvm-svn: 324834
We promote these via a DAG combine now before lowering gets the chance.
Also remove the v2i1 custom handling since it will no longer be triggered.
llvm-svn: 324833
These were added as part of the refactoring for prefer vector width. At the time I thought the hasAVX512 here would be replaced with "allow 512 bit vectors" so that it would read "allow 512 bit vectors OR VLX". But now the plan is to only give the option of disabling 512 bit vectors when VLX is enabled. So we don't need this qualification at all
llvm-svn: 324831
Summary:
This patch changes the signature of the avx512 packed fp compare intrinsics to return a vXi1 vector and no longer take a mask as input. The casts to scalar type will now need to be explicit in the IR. The masking node will now be an explicit and in the IR.
This makes the intrinsic look much more similar to an fcmp instruction that we wish we could use for these but can't. We already use icmp instructions for integer compares.
Previously the lowering step of isel would turn the intrinsic into an X86 specific ISD node and a emit the masking nodes as well as some bitcasts. This means DAG combines can't see the vXi1 type until somewhat late, making it more difficult to combine out gpr<->mask transition sequences. By exposing the vXi1 type explicitly in the IR and initial SelectionDAG we give earlier DAG combines and even InstCombine the chance to see it and optimize it.
This should make any issues with gpr<->mask sequences the same between integer and fp. Meaning we only have to fix them once.
Reviewers: spatel, delena, RKSimon, zvi
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D43137
llvm-svn: 324827
Undef VLX, getSetCCResultType returns v2i1/v4i1 for v2f32/v4f32 so default type legalization will end up changing the setcc result type back to vXi1 if it had been extended. The resulting extend gets messed up further by type legalization and is difficult to recombine back to (v4i32 (setcc (v4f32))) after legalization.
I went ahead and enabled this for SSE2 and later since its always the result we want and this helps type legalization get there in less steps.
llvm-svn: 324822
This prevents extends of masks being introduced during lowering where it become difficult to combine them out.
There are a few oddities in here.
We sometimes concatenate two k-registers produced by two compares, sign_extend the combined pair, then extract two halves. This worked better previously because the sign_extend wasn't created until after the fp_to_sint was split which led to a split sign_extend being created.
We probably also need to custom type legalize (v2i32 (sext v2i1)) via widening.
llvm-svn: 324820
This avoids a constant pool load to create 1.
The int->float are showing converts to mask and back. We probably need to widen inputs to sint_to_fp/uint_to_fp before type legalization.
llvm-svn: 324805
Summary:
This change is part of step five in the series of changes to remove alignment argument from
memcpy/memmove/memset in favour of alignment attributes. In particular, this changes the
Hexagon LoopIdiom pass to cease using the old IRBuilder createMemCpy/createMemMove
single-alignment APIs in favour of the new API that allows setting source and
destination alignments independently.
Steps:
Step 1) Remove alignment parameter and create alignment parameter attributes for
memcpy/memmove/memset. ( rL322965, rC322964, rL322963 )
Step 2) Expand the IRBuilder API to allow creation of memcpy/memmove with differing
source and dest alignments. ( rL323597 )
Step 3) Update Clang to use the new IRBuilder API. ( rC323617 )
Step 4) Update Polly to use the new IRBuilder API. ( rL323618 )
Step 5) Update LLVM passes that create memcpy/memmove calls to use the new IRBuilder API,
and those that use use MemIntrinsicInst::[get|set]Alignment() to use [get|set]DestAlignment()
and [get|set]SourceAlignment() instead. ( rL323886, rL323891, rL324148, rL324273, rL324278,
rL324384, rL324395, rL324402, rL324626, rL324642, rL324653, rL324654, rL324773, rL324774,
rL324781 )
Step 6) Remove the single-alignment IRBuilder API for memcpy/memmove, and the
MemIntrinsicInst::[get|set]Alignment() methods.
Reference
http://lists.llvm.org/pipermail/llvm-dev/2015-August/089384.htmlhttp://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151109/312083.html
llvm-svn: 324784
This change is part of step five in the series of changes to remove alignment argument from
memcpy/memmove/memset in favour of alignment attributes. In particular, this changes
ARMFastISel to cease using the old getAlignment() API of MemoryIntrinsic in favour of getting
source & dest specific alignments through the new API.
Steps:
Step 1) Remove alignment parameter and create alignment parameter attributes for
memcpy/memmove/memset. ( rL322965, rC322964, rL322963 )
Step 2) Expand the IRBuilder API to allow creation of memcpy/memmove with differing
source and dest alignments. ( rL323597 )
Step 3) Update Clang to use the new IRBuilder API. ( rC323617 )
Step 4) Update Polly to use the new IRBuilder API. ( rL323618 )
Step 5) Update LLVM passes that create memcpy/memmove calls to use the new IRBuilder API,
and those that use use MemIntrinsicInst::[get|set]Alignment() to use [get|set]DestAlignment()
and [get|set]SourceAlignment() instead. ( rL323886, rL323891, rL324148, rL324273, rL324278,
rL324384, rL324395, rL324402, rL324626, rL324642, rL324653, rL324654, rL324773, rL324774 )
Step 6) Remove the single-alignment IRBuilder API for memcpy/memmove, and the
MemIntrinsicInst::[get|set]Alignment() methods.
Reference
http://lists.llvm.org/pipermail/llvm-dev/2015-August/089384.htmlhttp://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151109/312083.html
llvm-svn: 324781
This adds a wasm-import-module function attribute and a .import_module
assembler directive, for specifying module import names for WebAssembly.
Currently these may only be used for function symbols; global variables
may be considered in the future.
WebAssembly has a two-level namespace scheme for symbols, and it's
normally the linker's job to assign the module name, which is the
first-level name. The attributes here allow users to specify their
own module names explicitly, which is useful for tools generating
bindings to modules defined in other languages.
This feature is not fully usable yet. It will evolve along with the
ongoing symbol table and lld changes.
Differential Revision: https://reviews.llvm.org/D42520
llvm-svn: 324778
Summary:
This change is part of step five in the series of changes to remove alignment argument from
memcpy/memmove/memset in favour of alignment attributes. In particular, this changes the
AMDGPUPromoteAlloca pass to cease using:
1) The old getAlignment() API of MemoryIntrinsic in favour of getting source & dest specific
alignments through the new API.
2) The old IRBuilder createMemCpy/createMemMove single-alignment APIs in favour of the new
API that allows setting source and destination alignments independently.
Steps:
Step 1) Remove alignment parameter and create alignment parameter attributes for
memcpy/memmove/memset. ( rL322965, rC322964, rL322963 )
Step 2) Expand the IRBuilder API to allow creation of memcpy/memmove with differing
source and dest alignments. ( rL323597 )
Step 3) Update Clang to use the new IRBuilder API. ( rC323617 )
Step 4) Update Polly to use the new IRBuilder API. ( rL323618 )
Step 5) Update LLVM passes that create memcpy/memmove calls to use the new IRBuilder API,
and those that use use MemIntrinsicInst::[get|set]Alignment() to use [get|set]DestAlignment()
and [get|set]SourceAlignment() instead. ( rL323886, r323891, rL324148, rL324273, rL324278,
rL324384, rL324395, rL324402, rL324626, rL324642, rL324653, rL324654, rL324773 )
Step 6) Remove the single-alignment IRBuilder API for memcpy/memmove, and the
MemIntrinsicInst::[get|set]Alignment() methods.
Reference
http://lists.llvm.org/pipermail/llvm-dev/2015-August/089384.htmlhttp://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151109/312083.html
llvm-svn: 324774
Summary:
This change is part of step five in the series of changes to remove alignment argument from
memcpy/memmove/memset in favour of alignment attributes. In particular, this changes
AArch64FastISel to cease using the old getAlignment() API of MemoryIntrinsic in favour of getting
source & dest specific alignments through the new API.
Steps:
Step 1) Remove alignment parameter and create alignment parameter attributes for
memcpy/memmove/memset. ( rL322965, rC322964, rL322963 )
Step 2) Expand the IRBuilder API to allow creation of memcpy/memmove with differing
source and dest alignments. ( rL323597 )
Step 3) Update Clang to use the new IRBuilder API. ( rC323617 )
Step 4) Update Polly to use the new IRBuilder API. ( rL323618 )
Step 5) Update LLVM passes that create memcpy/memmove calls to use the new IRBuilder API,
and those that use use MemIntrinsicInst::[get|set]Alignment() to use [get|set]DestAlignment()
and [get|set]SourceAlignment() instead. ( rL323886, r323891, rL324148, rL324273, rL324278,
rL324384, rL324395, rL324402, rL324626, rL324642, rL324653, rL324654 )
Step 6) Remove the single-alignment IRBuilder API for memcpy/memmove, and the
MemIntrinsicInst::[get|set]Alignment() methods.
Reference
http://lists.llvm.org/pipermail/llvm-dev/2015-August/089384.htmlhttp://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151109/312083.html
llvm-svn: 324773
In the rare case where the input contains rip-relative addressing with
immediate displacements, *and* the instruction ends with an immediate,
we encode the instruction in the wrong way:
movl $12345678, 0x400(%rdi) // all good, no rip-relative addr
movl %eax, 0x400(%rip) // all good, no immediate at the end of the instruction
movl $12345678, 0x400(%rip) // fails, encodes address as 0x3fc(%rip)
Offset is a label:
movl $12345678, foo(%rip)
we want to account for the size of the immediate (in this case,
$12345678, 4 bytes).
Offset is an immediate:
movl $12345678, 0x400(%rip)
we should not account for the size of the immediate, assuming the
immediate offset is what the user wanted.
Differential Revision: https://reviews.llvm.org/D43050
llvm-svn: 324772
Enable multiple COPY hints to eliminate more COPYs during register allocation.
Note that this is something all targets should do, see
https://reviews.llvm.org/D38128.
Review: Martin Storsjö
llvm-svn: 324720
Previously we extracted two subvectors and concatenate. But the concatenate will be lowered to two insert subvectors. Then DAG combine will merge once of the inserts and one of the extracts back into the original vector. We might as well just directly use one extract and one insert.
llvm-svn: 324710
This regresses a couple cases in the shuffle combining test. But those cases use intrinsics that InstCombine knows how to turn into a generic shuffle earlier. This should give opportunities to fold this earlier in InstCombine or DAG combine.
llvm-svn: 324709
Right now this loops over the entire function every time there
is a change, which is not very efficient. There's no practical
reason to track this so globally, since the code motion optimization
passes should be sinking instructions with single uses and
the pass currently will not fold with multiple uses.
llvm-svn: 324667
The patch essentially makes sure that X86CallLowering adds proper
G_COPY/G_TRUNC and G_ANYEXT/G_COPY when we are doing lowering of
arguments/returns for floating point values passed on registers.
Tests are updated accordingly
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D42287
llvm-svn: 324665
Most vxi1 constant build vectors have to be implemented in the scalar domain anyway so we'll probably end up with a cast there later. But by then its too late to do the combine to get rid of it.
llvm-svn: 324662
This allows the register name to be printed without the leading '%'.
This can be used for emitting calls to the retpoline thunks from inline
asm.
llvm-svn: 324645
ARMDisassembler now depends on the banked register tables in ARMUtils, so the
LLVMBuild.txt needed updating to reflect this.
Original commit mesage:
[ARM] Fix disassembly of invalid banked register moves
When disassembling banked register move instructions, we don't have an
assembly syntax for the unallocated register numbers, so we have to
return Fail rather than SoftFail. Previously we were returning SoftFail,
then crashing in the InstPrinter as we have no way to represent these
encodings in an assembly string.
This also switches the decoder to use the table-generated list of banked
registers, removing the duplicated list of encodings.
Differential revision: https://reviews.llvm.org/D43066
llvm-svn: 324606
The broken bot (clang-ppc64le-linux-multistage) is doign a shared-object build,
so I guess using lookupBankedRegByEncoding in the disassembler is a layering
violation?
llvm-svn: 324604
When disassembling banked register move instructions, we don't have an
assembly syntax for the unallocated register numbers, so we have to
return Fail rather than SoftFail. Previously we were returning SoftFail,
then crashing in the InstPrinter as we have no way to represent these
encodings in an assembly string.
This also switches the decoder to use the table-generated list of banked
registers, removing the duplicated list of encodings.
Differential revision: https://reviews.llvm.org/D43066
llvm-svn: 324600
Instructions affected:
mthc1, mfhc1, add.d, sub.d, mul.d, div.d,
mov.d, neg.d, cvt.w.d, cvt.d.s, cvt.d.w, cvt.s.d
These instructions are now defined for
microMIPS32r3 + microMIPS32r6 in MicroMipsInstrFPU.td
since they shared their encoding with those already defined
in microMIPS32r6InstrInfo.td and have been therefore
removed from the latter file.
Some instructions present in MicroMipsInstrFPU.td which
did not have both AFGR64 and FGR64 variants defined have
been altered to do so.
Differential revision: https://reviews.llvm.org/D42738
llvm-svn: 324584
We were generating "fmov h0, wzr" instructions when FullFP16 is not enabled.
I've not added any tests, because the problem was visible in:
test/CodeGen/AArch64/arm64-zero-cycle-zeroing.ll,
which I had to change: I don't think Cyclone has FullFP16 enabled
by default, so it shouldn't be using this v8.2a instruction.
I've also removed these rdar tags, please shout if there are any objections.
Differential Revision: https://reviews.llvm.org/D43020
llvm-svn: 324581
The KTEST instruction sets the C flag if the result of anding both operands together is all 1s. We can use this to lower (icmp eq/ne (bitcast (vXi1 X), -1)
Differential Revision: https://reviews.llvm.org/D42772
llvm-svn: 324577
Summary:
KTEST has weird flag behavior. The Z flag is set for all bits in the AND of the k-registers being 0, and the C flag is set for all bits being 1. All other flags are cleared.
We currently emit this instruction in EmitTEST and don't check the condition code. This can lead to strange things like using the S flag after a KTEST for a signed compare.
The domain reassignment pass can also transform TEST instructions into KTEST and is not protected against the flag usage either. For now I've disabled this part of the domain reassignment pass. I tried to comment out the checks in the mir test so that we could recover them later, but I couldn't figure out how to get that to work.
This patch moves the KTEST handling into LowerSETCC and now creates a ktest+x86setcc. I've chosen this approach because I'd like to add support for the C flag for all ones in a followup patch. To do that requires that I can rewrite the condition code going in the x86setcc to be different than the original SETCC condition code.
This fixes PR36182. I'll file a PR to fix domain reassignment once this goes in. Should this be merged to 6.0?
Reviewers: spatel, guyblank, RKSimon, zvi
Reviewed By: guyblank
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D42770
llvm-svn: 324576
LowerSELECT_CC is not generating optimal Select_Ri pattern at the moment. It
is not guaranteed to place ConstantNode at RHS which would miss matching
Select_Ri.
A new testcase added into the existing select_ri.ll, also there is an
existing case in cmp.ll which would be improved to use Select_Ri after this
patch, it is adjusted accordingly.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reviewed-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
llvm-svn: 324560
The code reusing existing wait counts is incorrect since it keeps
adding new operands to an old instruction instead of replacing
the immediate. It was also effectively switched off by the condition
that wait count is not an AMDGPU::S_WAITCNT.
Also switched to BuildMI instead of creating instructions directly.
Differential Revision: https://reviews.llvm.org/D42997
llvm-svn: 324547
hit from IR but creates a minefield for MI passes.
The x86 backend has fairly powerful logic to try and fold loads that
feed register operands to instructions into a memory operand on the
instruction. This is almost always a good thing, but there are specific
relocated loads that are only allowed to appear in specific
instructions. Notably, R_X86_64_GOTTPOFF is only allowed in `movq` and
`addq`. This patch blocks folding of memory operands using this
relocation unless the target is in fact `addq`.
The particular relocation indicates why we simply don't hit this under
normal circumstances. This relocation is only used for TLS, and it gets
used in very specific ways in conjunction with %fs-relative addressing.
The result is that loads using this relocation are essentially never
eligible for folding into an instruction's memory operands. Unless, of
course, you have an MI pass that inserts usage of such a load. I have
exactly such an MI pass and was greeted by truly mysterious miscompiles
where the linker replaced my instruction with a completely garbage byte
sequence. Go team.
This is the only such relocation I'm aware of in x86, but there may be
others that need to be similarly restricted.
Fixes PR36165.
Differential Revision: https://reviews.llvm.org/D42732
llvm-svn: 324546
We were doing a lot of whitelisting of what we handle in these routines, but setOperationAction constrains what we can get here. So just add some asserts and prune the unreachable paths.
llvm-svn: 324538
If we are saving/restoring k-registers, the default behavior of getMinimalRegisterClass will find the VK64 class with a spill size of 64 bits. This will cause the KMOVQ opcode to be used for save/restore. If we don't have have BWI instructions we need to constrain the class returned to give us VK16 with a 16-bit spill size. We can do this by passing the either v16i1 or v64i1 into getMinimalRegisterClass.
Also add asserts to make sure BWI is enabled anytime we use KMOVD/KMOVQ. These are what caught this bug.
Fixes PR36256
Differential Revision: https://reviews.llvm.org/D42989
llvm-svn: 324533
Note: This is a candidate for LLVM 6.0, because it was planned to be
in that release but was delayed due to a long review period.
Merge conflict in release_60 - resolution:
Add "-p6:32:32" into the second (non-amdgiz) string.
Only scalar loads support 32-bit pointers. An address in a VGPR will
fail to compile. That's OK because the results of loads will only be used
in places where VGPRs are forbidden.
Updated AMDGPUAliasAnalysis and used SReg_64_XEXEC.
The tests cover all uses cases we need for Mesa.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D41651
llvm-svn: 324487
Summary:
I checked the AMD closed source compiler and the workaround is only
needed when x3 is emulated as x4, which we don't do in LLVM.
SMEM x3 opcodes don't exist, and instead there is a possibility to use x4
with the last component being unused. If the last component is out of
buffer bounds and falls on the next 4K page, the hw hangs.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D42756
llvm-svn: 324486
Both operand codes now work the same way in case of register or memory
operands. It print high-order or low-order word in a double-word
register or memory location.
llvm-svn: 324476
This is a follow up of r324321, adding a match pattern for mov with a FP16
immediate (also fixing operand vfp_f16imm that wasn't even compiling).
Differential Revision: https://reviews.llvm.org/D42973
llvm-svn: 324456
that happened to end up in GCC.
This is really unfortunate, as the names don't have much rhyme or reason
to them. Originally in the discussions it seemed fine to rely on aliases
to map different names to whatever external thunk code developers wished
to use but there are practical problems with that in the kernel it turns
out. And since we're discovering this practical problems late and since
GCC has already shipped a release with one set of names, we are forced,
yet again, to blindly match what is there.
Somewhat rushing this patch out for the Linux kernel folks to test and
so we can get it patched into our releases.
Differential Revision: https://reviews.llvm.org/D42998
llvm-svn: 324449
1. Run the memory legalizer prior to the waitcnt pass; keep the policy that the waitcnt pass does not remove any waitcnts within the incoming IR.
2. The waitcnt pass doesn't (yet) track waitcnts that exist prior to the waitcnt pass (it just skips over them); because the waitcnt pass is ignorant of them, it may insert a redundant waitcnt. To avoid this, check the prev instr. If it and the to-be-inserted waitcnt are the same, suppress the insertion. We keep the existing waitcnt under the assumption that whomever, e.g., the memory legalizer, inserted it knows what they were doing.
3. Follow-on work: teach the waitcnt pass to record the pre-existing waitcnts for better waitcnt production.
Differential Revision: https://reviews.llvm.org/D42854
llvm-svn: 324440
X86 currently has a late DAG combine after cttz/ctlz are turned into BSR+BSF+CMOV to detect this and remove the CMOV. But we should be able to do this much earlier and avoid creating the cmov all together.
For the changed AMDGPU test case it appears that previously the i8 cttz was type legalized to i16 which introduced an OR with 256 in order to limit the result to 8 on the widened type. At this point the result is known to never be zero, but nothing checked that. Then operation legalization is told to promote all i16 cttz to i32. This introduces an extend and a truncate and another OR with 65536 to limit the result to 16. With the DAG combiner change we are able to prevent the creation of the second OR since the opcode will have been changed to cttz_zero_undef after the first OR. I the lack of the OR caused the instruction to change to v_ffbl_b32_sdwa
Differential Revision: https://reviews.llvm.org/D42985
llvm-svn: 324427
Following up on the discussion from
http://lists.llvm.org/pipermail/llvm-dev/2017-April/112305.html, undef
values are now placed in the .bss as well as null values. This prevents
undef global values taking up potentially huge amounts of space in the
.data section.
The following two lines now both generate equivalent .bss data:
@vals1 = internal unnamed_addr global [20000000 x i32] zeroinitializer, align 4
@vals2 = internal unnamed_addr global [20000000 x i32] undef, align 4 ; previously unaccounted for
This is primarily motivated by the corresponding issue in the Rust
compiler (https://github.com/rust-lang/rust/issues/41315).
Differential Revision: https://reviews.llvm.org/D41705
Patch by varkor!
llvm-svn: 324424
It was always using cmpxchg path and in rmw and cmpxchg instructions
are not distinguishable in the BE.
Differential Revision: https://reviews.llvm.org/D42976
llvm-svn: 324383
This is a follow up of r324321, adding f16 <-> f32 and f16 <-> f64 conversion
match patterns.
Differential Revision: https://reviews.llvm.org/D42954
llvm-svn: 324360
Instruction Selection
Cleanup cycle/validity checks in ISel (IsLegalToFold,
HandleMergeInputChains) and X86 (isFusableLoadOpStore). Now do a full
search for cycles / dependencies pruning the search when topological
property of NodeId allows.
As part of this propogate the NodeId-based cutoffs to narrow
hasPreprocessorHelper searches.
Reviewers: craig.topper, bogner
Subscribers: llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D41293
llvm-svn: 324359
Vector pairs are legal types, but not every operation can work on pairs.
For those operations that are legal for single vectors, generate a concat
of their results on pair halves.
llvm-svn: 324350
It was expanded directly into instructions earlier. That was to avoid
loads from a constant pool for a vector negation: "xor x, splat(i1 -1)".
Implement ISD opcodes QTRUE and QFALSE to denote logical vectors of
all true and all false values, and handle setcc with negations through
selection patterns.
llvm-svn: 324348
Followup to D42544 that matches PACKUSWB cases for non-AVX512, SSE and PACKUSDW cases will have to wait until we can add support for general SMIN/SMAX matching.
llvm-svn: 324347
Summary:
Now we generate PAL metadata for the amdpal os type, there is no need to
generate the .AMDGPU.config section.
Reviewers: arsenm, nhaehnle, dstuttard
Subscribers: kzhuravl, wdng, yaxunl, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D37760
Change-Id: I303c5fad66656ce97293da60621afac6595b4c18
llvm-svn: 324346
Summary: Adds support for the SVE AND instruction with vector and logical-immediate operands, and their corresponding aliases.
Reviewers: fhahn, rengolin, samparker, echristo, aadg, kristof.beyls
Reviewed By: fhahn
Subscribers: aemerson, javed.absar, tschuett, llvm-commits
Differential Revision: https://reviews.llvm.org/D42295
llvm-svn: 324343
Followup to D42544 that matches PACKSSWB cases for non-AVX512, SSE and PACKSSDW cases will have to wait until we can add support for general SMIN/SMAX matching.
llvm-svn: 324339
This adds most of the FP16 codegen support, but these areas need further work:
- FP16 literals and immediates are not properly supported yet (e.g. literal
pool needs work),
- Instructions that are generated from intrinsics (e.g. vabs) haven't been
added.
This will be addressed in follow-up patches.
Differential Revision: https://reviews.llvm.org/D42849
llvm-svn: 324321
We now allow all signed comparisons and not equal. The complement that needs to be added for this is no worse than the extend. And the vector output forms of pcmpeq/pcmpgt have better latency than the k-register version on SKX.
llvm-svn: 324294
In the motivating case from PR35681 and represented by the macro-fuse-cmp test:
https://bugs.llvm.org/show_bug.cgi?id=35681
...there's a 37 -> 31 byte size win for the loop because we eliminate the big base
address offsets.
SPEC2017 on Ryzen shows no significant perf difference.
Differential Revision: https://reviews.llvm.org/D42607
llvm-svn: 324289
This allows the immediate to folded into the and instead of being forced to move into a register. This can sometimes result in shorter encodings since the and can sign extend an immediate.
This also allows us to match an and to a movzx after a not.
This can cause an extra move if the input to the separate NOT has an additional user which requires a copy before the NOT.
llvm-svn: 324260
If the upper 32 bits of a 64 bit mask are all zeros, we have special isel patterns to use a 32-bit and instead of a 64-bit and by relying on the impliciting zeroing of 32 bit ops.
This patch teachs shrinkAndImmediate not to break that optimization.
Differential Revision: https://reviews.llvm.org/D42899
llvm-svn: 324249