Summary:
This patch fixes comparison of 64-bit atomic with its expected value in CMP_SWAP_64 expansion.
Currently, the low words are compared with CMP, while the high words are compared with SBC. SBC expects the carry flag to be set if CMP detects a difference. CMP might leave the carry unset for unequal arguments though if the first one is >= than the second. This might cause the comparison logic to detect false equality.
Example of the broken C++ code:
```
std::atomic<long long> at(2);
long long ll = 1;
std::atomic_compare_exchange_strong(&at, &ll, 3);
```
Even though the atomic `at` and the expected value `ll` are not equal and `atomic_compare_exchange_strong` returns `false`, `at` is changed to 3.
The patch replaces SBC with CMPEQ.
Reviewers: t.p.northover
Subscribers: aemerson, rengolin, llvm-commits, asl
Differential Revision: https://reviews.llvm.org/D27315
llvm-svn: 288433
The coalescer eliminates copies from reserved registers of the form:
%vregX = COPY %rY
in the case where %rY is a reserved register. However this turns out to
be invalid if only some of the subregisters are reserved (see also
https://reviews.llvm.org/D26648).
Differential Revision: https://reviews.llvm.org/D26687
llvm-svn: 288428
This change fixes a regression in r279537 and
makes getRawSubclassData behave like r279536.
Without this change, the fp128-g.ll test case will have an
infinite loop involving SoftenFloatRes_LOAD.
Differential Revision: http://reviews.llvm.org/D26942
llvm-svn: 288420
This time the issue is fortunately just a simple mistake rather than a horrible
design spectre. I thought SUBS/SBCS provided sufficient NZCV flags for
comparing two 64-bit values, but they don't.
The fix is slightly clunkier in AArch64 because we can't use conditional
execution to emit a pair of CMPs. Traditionally an "icmp ne i128" would map to
an EOR/EOR/ORR/CBNZ, but that uses more registers so it's easier to go with a
CSET/CINC/CBNZ combination. Slightly less efficient, but this is -O0 anyway.
Thanks to Anton Korobeynikov for pointing out the issue.
llvm-svn: 288418
Support a new assembler directive, .import_global, to declare imported
global variables (i.e. those with external linkage and no
initializer). The linker turns these into wasm imports.
Patch by Jacob Gravelle
Differential Revision: https://reviews.llvm.org/D26875
llvm-svn: 288296
Choosing a "cfi" name makes the intend a bit clearer in an assembly dump
and more importantly the assembly dumps are slightly more stable as the
numbers don't move around anymore when unrelated code calls
createTempSymbol() more or less often.
As they are temp labels the name doesn't influence the generated object
code.
Differential Revision: https://reviews.llvm.org/D27244
llvm-svn: 288290
The LLDB tests are now ready for this patch.
DWARF specifies that "line 0" really means "no appropriate source
location" in the line table. Use this for branch targets and some
other cases that have no specified source location, to prevent
inheriting unfortunate line numbers from physically preceding
instructions (which might be from completely unrelated source).
Differential Revision: http://reviews.llvm.org/D24180
llvm-svn: 288283
Summary:
When computing useful bits for a BFM instruction, we need
to take into consideration the case where both operands
of the BFM are equal and provide data that we need to track.
Not doing this can cause us to miss useful bits.
Fixes PR31138 (https://llvm.org/bugs/show_bug.cgi?id=31138)
Reviewers: t.p.northover, jmolloy
Subscribers: evandro, gberry, srhines, pirama, mcrosier, aemerson, llvm-commits, rengolin
Differential Revision: https://reviews.llvm.org/D27130
llvm-svn: 288253
Initial support for target shuffle constant folding in cases where all shuffle inputs are constant. We may be able to relax this and merge shuffles with only some constant inputs in the future.
I've added the helper function getTargetConstantBitsFromNode (based off a similar function in X86ShuffleDecodeConstantPool.cpp) that could be reused for other cases requiring constant vector extraction.
Differential Revision: https://reviews.llvm.org/D27220
llvm-svn: 288250
Summary: Previously 0 and -1 was matched via tablegen rules. But this could cause problems where a physical register was being used where a virtual register was expected (seen in optimizeSelect and TwoAddressInstructionPass). Instead follow AArch64 and match in DAGToDAGISel.
Reviewers: eliben, majnemer
Subscribers: llvm-commits, aemerson
Differential Revision: https://reviews.llvm.org/D27171
llvm-svn: 288215
This commit caused some miscompiles that did not show up on any of the bots.
Reverting until we can investigate the cause of those failures.
llvm-svn: 288214
DWARF specifies that "line 0" really means "no appropriate source
location" in the line table. Use this for branch targets and some
other cases that have no specified source location, to prevent
inheriting unfortunate line numbers from physically preceding
instructions (which might be from completely unrelated source).
Differential Revision: http://reviews.llvm.org/D24180
llvm-svn: 288212
This is not in the list of valid inputs for the encoding.
When spilling, copies from exec can be folded directly
into the spill instruction which results in broken
stores.
This only fixes the operand constraints, more codegen
work is required to avoid emitting the invalid
spills.
This sort of breaks the dbg.value test. Because the
register class of the s_load_dwordx2 changes, there
is a copy to SReg_64, and the copy is the operand
of dbg_value. The copy is later dead, and removed
from the dbg_value.
llvm-svn: 288191
It isn't generally safe to fold the frame index
directly into the operand since it will possibly
not be an inline immediate after it is expanded.
This surprisingly seems to produce better code, since
the FI doesn't prevent folding other immediate operands.
llvm-svn: 288185
Summary:
In AArch64InstrInfo::foldMemoryOperandImpl, catch more cases where the
COPY being spilled is copying from WZR/XZR, but the source register is
not in the COPY destination register's regclass.
For example, when spilling:
%vreg0 = COPY %XZR ; %vreg0:GPR64common
without this change, the code in TargetInstrInfo::foldMemoryOperand()
and canFoldCopy() that normally handles cases like this would fail to
optimize since %XZR is not in GPR64common. So the spill code generated
would be:
%vreg0 = COPY %XZR
STR %vreg
instead of the new code generated:
STR %XZR
Reviewers: qcolombet, MatzeB
Subscribers: mcrosier, aemerson, t.p.northover, llvm-commits, rengolin
Differential Revision: https://reviews.llvm.org/D26976
llvm-svn: 288176
This patch corresponds to review:
https://reviews.llvm.org/D25912
This is the first patch in a series of 4 that improve the lowering and combining
for BUILD_VECTOR nodes on PowerPC.
llvm-svn: 288152
It looks like this logic was duplicated long ago and the GCC side of
things has grown additional functionality. We need ${:uid} at least to
generate unique MS inline asm labels (PR23715), so expose these.
llvm-svn: 288092
Codegen prepare sinks comparisons close to a user is we have only one register
for conditions. For AMDGPU we have many SGPRs capable to hold vector conditions.
Changed BE to report we have many condition registers. That way IR LICM pass
would hoist an invariant comparison out of a loop and codegen prepare will not
sink it.
With that done a condition is calculated in one block and used in another.
Current behavior is to store workitem's condition in a VGPR using v_cndmask_b32
and then restore it with yet another v_cmp instruction from that v_cndmask's
result. To mitigate the issue a propagation of source SGPR pair in place of v_cmp
is implemented. Additional side effect of this is that we may consume less VGPRs
at a cost of more SGPRs in case if holding of multiple conditions is needed, and
that is a clear win in most cases.
Differential Revision: https://reviews.llvm.org/D26114
llvm-svn: 288053
Bit-shifts by a whole number of bytes can be represented as a shuffle mask suitable for combining.
Added a 'getFauxShuffleMask' function to allow us to create shuffle masks from other suitable operations.
llvm-svn: 288040
This patch adds assembler support for the remaining branch instructions:
the non-relative branch on count variants, and all variants of branch
on index.
The only one of those that can be readily exploited for code generation
is BRCTH (branch on count using a high 32-bit register as count). Do
use it, however, it is necessary to also introduce a hew CHIMux pseudo
to allow comparisons of a 32-bit value agains a short immediate to go
into a high register as well (implemented via CHI/CIH).
This causes a bit of codegen changes overall, but those have proven to
be neutral (or even beneficial) in performance measurements.
llvm-svn: 288029
This patch moves formation of LOC-type instructions from (late)
IfConversion to the early if-conversion pass, and in some cases
additionally creates them directly from select instructions
during DAG instruction selection.
To make early if-conversion work, the patch implements the
canInsertSelect / insertSelect callbacks. It also implements
the commuteInstructionImpl and FoldImmediate callbacks to
enable generation of the full range of LOC instructions.
Finally, the patch adds support for all instructions of the
load-store-on-condition-2 facility, which allows using LOC
instructions also for high registers.
Due to the use of the GRX32 register class to enable high registers,
we now also have to handle the cases where there are still no single
hardware instructions (conditional move from a low register to a high
register or vice versa). These are converted back to a branch sequence
after register allocation. Since the expandRAPseudos callback is not
allowed to create new basic blocks, this requires a simple new pass,
modelled after the ARM/AArch64 ExpandPseudos pass.
Overall, this patch causes significantly more LOC-type instructions
to be used, and results in a measurable performance improvement.
llvm-svn: 288028
Summary: When selectScalarSSELoad is looking for a scalar_to_vector of a scalar load, it makes sure the load is only used by the scalar_to_vector. But it doesn't make sure the scalar_to_vector is only used once. This can cause the same load to be folded multiple times. This can be bad for performance. This also causes the chain output to be duplicated, but not connected to anything so chain dependencies will not be satisfied.
Reviewers: RKSimon, zvi, delena, spatel
Subscribers: andreadb, llvm-commits
Differential Revision: https://reviews.llvm.org/D26790
llvm-svn: 287983
Summary:
Shuffle lowering may have widened the element size of a i32 shuffle to i64 before selecting X86ISD::SHUF128. If this shuffle was used by a vselect this can prevent us from selecting masked operations.
This patch detects this and changes the element size to match the vselect.
I don't handle changing integer to floating point or vice versa as its not clear if its better to push such a bitcast to the inputs of the shuffle or to the user of the vselect. So I'm ignoring that case for now.
Reviewers: delena, zvi, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27087
llvm-svn: 287939
Vectorize UINT_TO_FP v2i32 -> v2f64 instead of scalarization (albeit still on the SIMD unit).
The codegen matches that generated by legalization (and is in fact used by AVX for UINT_TO_FP v4i32 -> v4f64), but has to be done in the x86 backend to account for legalization via 4i32.
Differential Revision: https://reviews.llvm.org/D26938
llvm-svn: 287886
The bug arises during register allocation on i686 for
CMPXCHG8B instruction when base pointer is needed. CMPXCHG8B
needs 4 implicit registers (EAX, EBX, ECX, EDX) and a memory address,
plus ESI is reserved as the base pointer. With such constraints the only
way register allocator would do its job successfully is when the addressing
mode of the instruction requires only one register. If that is not the case
- we are emitting additional LEA instruction to compute the address.
It fixes PR28755.
Patch by Alexander Ivchenko <alexander.ivchenko@intel.com>
Differential Revision: https://reviews.llvm.org/D25088
llvm-svn: 287875
m0 may need to be written for spill code, so
we don't want general code uses relying on the
value stored in it.
This introduces a few code quality regressions where copies
from m0 are not coalesced into copies of a copy of m0.
llvm-svn: 287841
The size and offset were wrong. The size of the object was
being used for the size of the access, when here it is really
being split into 4-byte accesses. The underlying object size
is set in the MachinePointerInfo, which also didn't have the
offset set.
llvm-svn: 287806
We did not support subregs in InlineSpiller:foldMemoryOperand() because targets
may not deal with them correctly.
This adds a target hook to let the spiller know that a target can handle
subregs, and actually enables it for x86 for the case of stack slot reloads.
This fixes PR30832.
Differential Revision: https://reviews.llvm.org/D26521
llvm-svn: 287792
Forward store values to matching loads down through token
factors. Factored from D14834.
Reviewers: jyknight, hfinkel
Subscribers: hfinkel, nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D26080
llvm-svn: 287773
We have the following DAGCombiner transformations:
(mul (shl X, c1), c2) -> (mul X, c2 << c1)
(mul (shl X, C), Y) -> (shl (mul X, Y), C)
(shl (mul x, c1), c2) -> (mul x, c1 << c2)
Usually the constant shift is optimised by SelectionDAG::getNode when it is
constructed, by SelectionDAG::FoldConstantArithmetic, but when we're dealing
with vectors and one of those vector constants contains an undef element
FoldConstantArithmetic does not fold and we enter an infinite loop.
Fix this by making FoldConstantArithmetic use getNode to decide how to fold each
vector element, the same as FoldConstantVectorArithmetic does, and rather than
adding the constant shift to the work list instead only apply the transformation
if it's already been folded into a constant, as if it's not we're going to loop
endlessly. Additionally add missing NoOpaques to one of those transformations,
which I noticed when writing the tests for this.
Differential Revision: https://reviews.llvm.org/D26605
llvm-svn: 287766
Implemented widening (v2f32) and splitting (v16f64).
On splitting, I use "popcnt" to calculate memory increment.
More type legalization work will come in the next patches.
llvm-svn: 287761
This occurs during UINT_TO_FP v2f64 lowering.
We can easily generalize this to other horizontal ops (FHSUB, PACKSS, PACKUS) as required - we are doing something similar with PACKUS in lowerV2I64VectorShuffle
llvm-svn: 287676
Add basic ComputeNumSignBits support for TRUNCATE ops for cases where the source's number of sign bits overlaps with the truncated size.
Improves X86 SIGN_EXTEND_IN_REG vector cases which were needlessly sign extending boolean vector results.
Differential Revision: https://reviews.llvm.org/D26851
llvm-svn: 287635
Summary:
The index and one of the table operands can be swapped by changing the opcode to the other version. Neither of these operands are the one that can load from memory so this can't be used to increase memory folding opportunities.
We need to handle the unmasked forms and the kz forms. Since the load operand isn't being commuted we can commute the load and broadcast instructions too.
Reviewers: igorb, delena, Ayal, Farhana, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D25652
llvm-svn: 287621
Summary:
Shuffle lowering widens the element size of a shuffle if elements are contiguous. This is sometimes help because wider element types have more shuffle options. If the shuffle is one of the arguments to a vselect this shuffle widening can introduce a bitcast between the vselect and the shuffle. This will prevent isel from selecting a masked operation. If the shuffle can be written equally efficiently with a different element size to match the vselect type we should change the shuffle type to allow masking.
This patch does this conversion for all VALIGND/VALIGNQ sizes. It also supports turning 128-bit PALIGNR into VALIGND/VALIGNQ. This fixes the case shown in PR31018.
I plan to add support for more operations in future patches.
Reviewers: RKSimon, zvi, delena
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26902
llvm-svn: 287612
A target intrinsic may be defined as possibly reading memory,
but the call site may have additional knowledge that it doesn't read
memory. The intrinsic lowering will expect the pessimistic
assumption of the intrinsic definition, so the chain should
still be used.
llvm-svn: 287593
Summary:
When searching for load/store instructions to pair/merge don't treat
writes to WZR/XZR as clobbers since they don't change the value read
from WZR/XZR (which is always 0).
Reviewers: mcrosier, junbuml, jmolloy, t.p.northover
Subscribers: aemerson, llvm-commits, rengolin
Differential Revision: https://reviews.llvm.org/D26921
llvm-svn: 287592
Summary: Merging an empty case block into the header block of switch could cause
ISel to add COPY instructions in the header of switch, instead of the case
block, if the case block is used as an incoming block of a PHI. This could
potentially increase dynamic instructions, especially when the switch is in a
loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, davidxl
Subscribers: qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 287553
The test is currently broken, and this CL should fix it.
Patch by Adrian Kuegel!
Differential Revision: https://reviews.llvm.org/D26910
llvm-svn: 287536
At the moment we only use truncateVectorCompareWithPACKSS with direct vector comparison results (just one example of a known all/none signbits input).
This change relaxes the direct matching of a SETCC opcode by moving the logic up into SelectionDAG::ComputeNumSignBits and accepting any input with a known splatted signbit.
llvm-svn: 287535
- teach RelocVisitor to recognize bpf relocations
- fix AsmInfo->PointerSize to make sure dwarf is emitted correctly
- add a test for the above
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
llvm-svn: 287521
This patch adds a test for the assembly code emitted with XRay
instrumentation. It also fixes a bug where the operand of a jump
instruction must be not the number of bytes to jump over, but rather the
number of 4-byte instructions.
Author: rSerge
Reviewers: dberris, rengolin
Differential Revision: https://reviews.llvm.org/D26805
llvm-svn: 287516
The tail call optimization was being used without proper consideration of
ABI requirements for saving and restoring the GP. This patch restricts tail
call optimization to functions within the same translation unit.
Reviewers: vkalintiris
Differential Revision: https://reviews.llvm.org/D24763
llvm-svn: 287505
Many of these problems are because shuffle lowering widens element size and reduces element count when possible. This causes the shuffle to become separated from the select by a bitcast. Future patches will work to improve these cases by rewriting the shuffle back to a narrow element type if we think it can result in folding the mask.
llvm-svn: 287503
The change is part of RegCall calling convention support for LLVM.
Long double (f80) requires special treatment as the first f80 parameter is saved in FP0 (floating point stack).
This review present the change and the corresponding tests.
Differential Revision: https://reviews.llvm.org/D26151
llvm-svn: 287485
add BPF disassembler, so tools like llvm-objdump can be used:
$ llvm-objdump -d -no-show-raw-insn ./sockex1_kern.o
./sockex1_kern.o: file format ELF64-BPF
Disassembly of section socket1:
bpf_prog1:
0: r6 = r1
8: r0 = *(u8 *)skb[23]
10: *(u32 *)(r10 - 4) = r0
18: r1 = *(u32 *)(r6 + 4)
20: if r1 != 4 goto 8
28: r2 = r10
30: r2 += -4
ld_imm64 (the only 16-byte insn) and special ld_abs/ld_ind instructions
had to be treated in a special way. The decoders for the rest of the insns
are automatically generated.
Add tests to cover new functionality.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
llvm-svn: 287477
Summary:
The 32-bit instructions don't zero the high 16-bits like the 16-bit
instructions do.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D26828
llvm-svn: 287342
Summary:
The addr64-based legalization is incorrect for MUBUF instructions with idxen
set as well as for BUFFER_LOAD/STORE_FORMAT_* instructions. This affects
e.g. shaders that access buffer textures.
Since we never actually need the addr64-legalization in shaders, this patch
takes the easy route and keys off the calling convention. If this ever
affects (non-OpenGL) compute, the type of legalization needs to be chosen
based on some TSFlag.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98664
Reviewers: arsenm, tstellarAMD
Subscribers: kzhuravl, wdng, yaxunl, tony-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D26747
llvm-svn: 287339
When we see a SETCC whose only users are zero extend operations, we can replace
it with a subtraction. This results in doing all calculations in GPRs and
avoids CR use.
Currently we do this only for ULT, ULE, UGT and UGE condition codes. There are
ways that this can be extended. For example for signed condition codes. In that
case we will be introducing additional sign extend instructions, so more careful
profitability analysis may be required.
Another direction to extend this is for equal, not equal conditions. Also when
users of SETCC are any_ext or sign_ext, we might be able to do something
similar.
llvm-svn: 287329
The same thing was done to 32-bit and 64-bit element sizes previously.
This will allow us to support these shuffls in InstCombineCalls along with the other variable shift intrinsics.
llvm-svn: 287312
since bpf instruction set was introduced people learned to
read and understand kernel verifier output whereas llvm asm
output stayed obscure and unknown. Convert llvm to emit
assembler text similar to kernel to avoid this discrepancy
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
llvm-svn: 287300
Summary:
This extends FCOPYSIGN support to 512-bit vectors.
I've also added tests to show what the 128-bit and 256-bit cases look like with broadcast loads.
Reviewers: delena, zvi, RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26791
llvm-svn: 287298
vXi64 multiplication is lowered into 3 calls of vpmuludq with the upper/lower 32-bit halves.
If any of these halves are zero then we can remove individual calls. Although there was isBuildVectorAllZeros code to do this I don't think it ever worked (maybe just for constant folded cases that don't seem to be tested for any longer).
This requires additional X86ISD support for computeKnownBitsForTargetNode, so far I've just added support for X86ISD::VZEXT (VPMOVZX* - helping the AVX2+ cases).
Partial fix for PR30845
Differential Revision: https://reviews.llvm.org/D26590
llvm-svn: 287223
Summary:
Variadic functions can be treated in the same way as normal functions
with respect to the number and types of parameters.
Reviewers: grosbach, olista01, t.p.northover, rengolin
Subscribers: javed.absar, aemerson, llvm-commits
Differential Revision: https://reviews.llvm.org/D26748
llvm-svn: 287219
Register Calling Convention defines a new behavior for v64i1 types.
This type should be saved in GPR.
However for 32 bit machine we need to split the value into 2 GPRs (because each is 32 bit).
Differential Revision: https://reviews.llvm.org/D26181
llvm-svn: 287217
ImplicitNullCheck keeps track of one instruction that the memory
operation depends on that it also hoists with the memory operation.
When hoisting this dependency, it would sometimes clobber a live-in
value to the basic block we were hoisting the two things out of. Fix
this by explicitly looking for such dependencies.
I also noticed two redundant checks on `MO.isDef()` in IsMIOperandSafe.
They're redundant since register MachineOperands are either Defs or Uses
-- there is no third kind. I'll change the checks to asserts in a later
commit.
llvm-svn: 287213
We save an inter-register file move this way. If there's any CPU where
the FP logic is slower, we could transform this back to int-logic in
MachineCombiner.
This helps, but doesn't solve, PR6137:
https://llvm.org/bugs/show_bug.cgi?id=6137
The 'andn' test shows that we're missing a pattern match to
recognize the xor with -1 constant as a 'not' op.
llvm-svn: 287171
Summary:
A lot of the pseudo instructions are required because LLVM assumes that
all integers of the same size as the pointer size are legal. This means
that it will not currently expand 16-bit instructions to their 8-bit
variants because it thinks 16-bit types are legal for the operations.
This also adds all of the CodeGen tests that required the pass to run.
Reviewers: arsenm, kparzysz
Subscribers: wdng, mgorny, modocache, llvm-commits
Differential Revision: https://reviews.llvm.org/D26577
llvm-svn: 287162
We don't track callee clobbered registers correctly, so avoid hoisting
across calls.
Note: for this bug to trigger we need a `readonly` call target, since we
already have logic to not hoist across potentially storing instructions
either.
llvm-svn: 287159
One half of the shifts obviously needed conditional selection based on whether
the shift amount is more than 32-bits, but leaving the other half as the
natural shift isn't acceptable either: it's undefined behaviour to shift a
32-bit value by more than 31.
llvm-svn: 287149
Summary:
Extend replaceZeroVectorStore to handle more vector type stores,
floating point zero vectors and set alignment more accurately on split
stores.
This is a follow-up change to r286875.
This change fixes PR31038.
Reviewers: MatzeB
Subscribers: mcrosier, aemerson, llvm-commits, rengolin
Differential Revision: https://reviews.llvm.org/D26682
llvm-svn: 287142
Summary:
1. Don't try to copy values to and from the same register class.
2. Replace copies with of registers with immediate values with v_mov/s_mov
instructions.
The main purpose of this change is to make MachineSink do a better job of
determining when it is beneficial to split a critical edge, since the pass
assumes that copies will become move instructions.
This prevents a regression in uniform-cfg.ll if we enable critical edge
splitting for AMDGPU.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, llvm-commits
Differential Revision: https://reviews.llvm.org/D23408
llvm-svn: 287131
We can replace "scalar" FP-bitwise-logic with other forms of bitwise-logic instructions.
Scalar SSE/AVX FP-logic instructions only exist in your imagination and/or the bowels of
compilers, but logically equivalent int, float, and double variants of bitwise-logic
instructions are reality in x86, and the float variant may be a shorter instruction
depending on which flavor (SSE or AVX) of vector ISA you have...so just prefer float all
the time.
This is a preliminary step towards solving PR6137:
https://llvm.org/bugs/show_bug.cgi?id=6137
Differential Revision:
https://reviews.llvm.org/D26712
llvm-svn: 287122
Both the (V)CVTDQ2PD (i32 to f64) and (V)CVTUDQ2PD (u32 to f64) conversion instructions are lossless and can be safely represented as generic SINT_TO_FP/UINT_TO_FP calls instead of x86 intrinsics without affecting final codegen.
LLVM counterpart to D26686
Differential Revision: https://reviews.llvm.org/D26736
llvm-svn: 287108
MipsFastISel uses a a class to represent addresses with a signed member
to represent the offset. MipsFastISel::emitStore, emitLoad and computeAddress
all treated the offset as being positive. In cases where the offset was
actually negative and a frame pointer was used, this would cause the constant
synthesis routine to crash as it would generate an unexpected instruction
sequence when frame indexes are replaced.
Reviewers: vkalintiris
Differential Revision: https://reviews.llvm.org/D26192
llvm-svn: 287099
Summary: These intrinsics have been unused for clang for a while. This patch removes them. We auto upgrade them to extractelements, a scalar operation and then an insertelement. This matches the sequence used by clangs intrinsic file.
Reviewers: zvi, delena, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26660
llvm-svn: 287083
Doing this before register allocation reduces register pressure as we do
not even have to allocate a register for those dead definitions.
Differential Revision: https://reviews.llvm.org/D26111
llvm-svn: 287076
In https://reviews.llvm.org/D25347, Geoff noticed that we still have
useless copy that we can eliminate after register allocation. At the
time the allocation is chosen for those copies, they are not useless
but, because of changes in the surrounding code, later on they might
become useless.
The Greedy allocator already has a mechanism to deal with such cases
with a late recoloring. However, we missed to record the some of the
missed hints.
This commit fixes that.
llvm-svn: 287070
For the default, small and medium code model, use the existing
difference from the jump table towards the label. For all other code
models, setup the picbase and use the difference between the picbase and
the block address.
Overall, this results in smaller data tables at the expensive of one or
two more arithmetic operation at the jump site. Given that we only create
jump tables with a lot more than two entries, it is a net win in size.
For larger code models the assumption remains that individual functions
are no larger than 2GB.
Differential Revision: https://reviews.llvm.org/D26336
llvm-svn: 287059
wbinvl.* are vector instruction that do not sue vector registers.
v2: check only M?BUF instructions
Differential Revision: https://reviews.llvm.org/D26633
llvm-svn: 287056
Lower a = b * C where C = (2^n + 1) * 2^m to
add w0, w0, w0, lsl n
lsl w0, w0, m
Differential Revision: https://reviews.llvm.org/D229245
llvm-svn: 287019
The wave barrier represents the discardable barrier. Its main purpose is to
carry convergent attribute, thus preventing illegal CFG optimizations. All lanes
in a wave come to convergence point simultaneously with SIMT, thus no special
instruction is needed in the ISA. The barrier is discarded during code generation.
Differential Revision: https://reviews.llvm.org/D26585
llvm-svn: 287007
Also, fix the test params to use an attribute rather than a CPU model
and remove the AVX run because that does nothing but check for a 'v'
prefix in all of these tests.
llvm-svn: 287003
In RateRegister of existing LSR, if a formula contains a Reg which is a SCEVAddRecExpr,
and this SCEVAddRecExpr's loop is an outerloop, the formula will be marked as Loser
and dropped.
Suppose we have an IR that %for.body is outerloop and %for.body2 is innerloop. LSR only
handle inner loop now so only %for.body2 will be handled.
Using the logic above, formula like
reg(%array) + reg({1,+, %size}<%for.body>) + 1*reg({0,+,1}<%for.body2>) will be dropped
no matter what because reg({1,+, %size}<%for.body>) is a SCEVAddRecExpr type reg related
with outerloop. Only formula like
reg(%array) + 1*reg({{1,+, %size}<%for.body>,+,1}<nuw><nsw><%for.body2>) will be kept
because the SCEVAddRecExpr related with outerloop is folded into the initial value of the
SCEVAddRecExpr related with current loop.
But in some cases, we do need to share the basic induction variable
reg{0 ,+, 1}<%for.body2> among LSR Uses to reduce the final total number of induction
variables used by LSR, so we don't want to drop the formula like
reg(%array) + reg({1,+, %size}<%for.body>) + 1*reg({0,+,1}<%for.body2>) unconditionally.
From the existing comment, it tries to avoid considering multiple level loops at the same time.
However, existing LSR only handles innermost loop, so for any SCEVAddRecExpr with a loop other
than current loop, it is an invariant and will be simple to handle, and the formula doesn't have
to be dropped.
Differential Revision: https://reviews.llvm.org/D26429
llvm-svn: 286999
Summary:
This fixes the runtime results produces by the fallback multiplication expansion introduced in r270720.
For tests I created a fuzz tester that compares the results with Boost.Multiprecision.
Reviewers: hfinkel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26628
llvm-svn: 286998
This patch helps avoids poor legalization of boolean vector results (e.g. 8f32 -> 8i1 -> 8i16) that feed into SINT_TO_FP by inserting an early SIGN_EXTEND and so help improve the truncation logic.
This is not necessary for AVX512 targets where boolean vectors are legal - AVX512 manages to lower ( sint_to_fp vXi1 ) into some form of ( select mask, 1.0f , 0.0f ) in most cases.
Fix for PR13248
Differential Revision: https://reviews.llvm.org/D26583
llvm-svn: 286979
This patch implements all the overloads for vec_xl_be and vec_xst_be. On BE,
they behaves exactly the same with vec_xl and vec_xst, therefore they are
simply implemented by defining a matching macro. On LE, they are implemented
by defining new builtins and intrinsics. For int/float/long long/double, it
is just a load (lxvw4x/lxvd2x) or store(stxvw4x/stxvd2x). For char/char/short,
we also need some extra shuffling before or after call the builtins to get the
desired BE order. For int128, simply call vec_xl or vec_xst.
llvm-svn: 286967
Summary:
Fix a case where the overflow value of type i1, which is legal on AVX512, was assigned to a VK1 register class.
We always want this value to be assigned to a GPR since the overflow return value is lowered to a SETO instruction.
Fixes pr30981.
Reviewers: mkuper, igorb, craig.topper, guyblank, qcolombet
Subscribers: qcolombet, llvm-commits
Differential Revision: https://reviews.llvm.org/D26620
llvm-svn: 286958
This patch adds the Sched Machine Model for Cortex-R52.
Details of the pipeline and descriptions are in comments
in file ARMScheduleR52.td included in this patch.
Reviewers: rengolin, jmolloy
Differential Revision: https://reviews.llvm.org/D26500
llvm-svn: 286949