This is similar logic/motivation to the select splitting in D62969.
In D63233, the pattern changes so that we no longer have an extract_subvector of vselect,
but the operands of the select are still being concatenated.
The closest case is represented in either the first or last test diffs here - we have an
extra instruction, but we converted 3-4 ymm instructions into 4-5 xmm instructions.
I think that's the right trade-off for most AVX1 targets.
In the example based on PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428
...this makes the loop about 30% faster (tested on Haswell by compiling with -mavx).
Differential Revision: https://reviews.llvm.org/D63364
llvm-svn: 363508
Summary:
If the nested loop is an innermost loop, prefer to a 32-byte alignment, so that
we can decrease cache misses and branch-prediction misses. Actual alignment of
the loop will depend on the hotness check and other logic in alignBlocks.
The old code will only align hot loop to 32 bytes when the LoopSize larger than
16 bytes and smaller than 32 bytes, this patch will align the innermost hot loop
to 32 bytes not only for the hot loop whose size is 16~32 bytes.
Reviewed By: steven.zhang, jsji
Differential Revision: https://reviews.llvm.org/D61228
llvm-svn: 363495
Currently you get extra waits, because waits are inserted for the
register dependencies of the call, and the function prolog waits on
everything.
Currently waits are still inserted on returns. It may make sense to
not do this, and wait in the caller instead.
llvm-svn: 363465
The way SelectionDAG treats memory operands is very frustrating, and
by default drops them unless a property is set on the pattern. There
is no pattern for manually selected instructions, so this requires
manually setting them.
llvm-svn: 363455
I recently discovered a bug on the x86 platform: The fp80 type was not handled well by x86 for constrained floating point nodes, as their regular counterparts are replaced by extending loads and truncating stores during the preprocess phase. Normally, platforms don't have this issue, as they don't typically attempt to perform such legalizations during instruction selection preprocessing. Before this change, strict_fp nodes survived until they were mutated to normal nodes, which happened shortly after preprocessing on other platforms. This modification lowers these nodes at the same phase while properly utilizing the chain.5
Submitted by: Drew Wock <drew.wock@sas.com>
Reviewed by: Craig Topper, Kevin P. Neal
Approved by: Craig Topper
Differential Revision: https://reviews.llvm.org/D63271
llvm-svn: 363417
Earlier commit has added AMDGPUOperand::isBoolReg(). Turns out
gcc issues warning about unused function since D63204 is not
yet submitted.
Added NFC part of D63204 to have a use of that function and
mute the warning.
llvm-svn: 363416
Avoid producing illegal register bank copies for reg_sequence and
phi. The default implementation assumes it is possible to pick any
operand's bank and use that for the result, introducing a copy for
operands with a different bank. This does not check for illegal
copies. It is not legal to introduce a VGPR->SGPR copy, so any VGPR
operand requires the result to be a VGPR.
The changes in getInstrMappingImpl aren't strictly necessary, since
AMDGPU now just bypasses this for reg_sequence/phi. This could be
replaced with an assert in case other targets run into this. It is
currently responsible for producing the error for unsatisfiable
copies, but this will be better served with a verifier check.
For phis, for now assume any undetermined operands must be
VGPRs. Eventually, this needs to be able to defer mapping these
operations. This also does not yet have a way to check for whether the
block is in a divergent region.
llvm-svn: 363410
This is the family of vector instructions that combine all the lanes
in their input vector(s), and output a value in one or two GPRs.
Differential Revision: https://reviews.llvm.org/D62670
llvm-svn: 363403
Initial commit of a new pass to create vector predication blocks, called VPT
blocks, that are supported by the Armv8.1-M MVE architecture.
This is a first naive implementation. I.e., for 2 consecutive predicated
instructions I1 and I2, for example, it will generate 2 VPT blocks:
VPST
I1
VPST
I2
A more optimal implementation would obviously put instructions in the same VPT
block when they are predicated on the same condition and when it is allowed to
do this:
VPTT
I1
I2
We will address this optimisation with follow up patches when the groundwork is
in. Creating VPT Blocks is very similar to IT Blocks, which is the reason I
added this to Thumb2ITBlocks.cpp. This allows reuse of the def use analysis
that we need for the more optimal implementation.
VPT blocks cannot be nested in IT blocks, and vice versa, and so these 2 passes
cannot interact with each other. Instructions allowed in VPT blocks must
be MVE instructions that are marked as VPT compatible.
Differential Revision: https://reviews.llvm.org/D63247
llvm-svn: 363370
Merging the two bits shrinks the context table from 16384 bytes to 8192 bytes.
Remove the ATTRIBUTE_BITS macro and just create an enum directly. Then fix the ATTR_max define to be 8192 to reflect the table size so we stop hardcoding it separately.
llvm-svn: 363330
Previously it copied over MachineMemOperands verbatim which caused MOV32rm to have store flags set, and MOV32mr to have load flags set. This fixes some assertions being thrown with EXPENSIVE_CHECKS on.
Committed on behalf of @luke (Luke Lau)
Differential Revision: https://reviews.llvm.org/D62726
llvm-svn: 363268
This commit prepares the way to start adding the main collection of
MVE instructions, which operate on the 128-bit vector registers.
The most obvious thing that's needed, and the simplest, is to add the
MQPR register class, which is like the existing QPR except that it has
fewer registers in it.
The more complicated part: MVE defines a system of vector predication,
in which instructions operating on 128-bit vector registers can be
constrained to operate on only a subset of the lanes, using a system
of prefix instructions similar to the existing Thumb IT, in that you
have one prefix instruction which designates up to 4 following
instructions as subject to predication, and within that sequence, the
predicate can be inverted by means of T/E suffixes ('Then' / 'Else').
To support instructions of this type, we've added two new Tablegen
classes `vpred_n` and `vpred_r` for standard clusters of MC operands
to add to a predicated instruction. Both include a flag indicating how
the instruction is predicated at all (options are T, E and 'not
predicated'), and an input register field for the register controlling
the set of active lanes. They differ from each other in that `vpred_r`
also includes an input operand for the previous value of the output
register, for instructions that leave inactive lanes unchanged.
`vpred_n` lacks that extra operand; it will be used for instructions
that don't preserve inactive lanes in their output register (either
because inactive lanes are zeroed, as the MVE load instructions do, or
because the output register isn't a vector at all).
This commit also adds the family of prefix instructions themselves
(VPT / VPST), and all the machinery needed to work with them in
assembly and disassembly (e.g. generating the 't' and 'e' mnemonic
suffixes on disassembled instructions within a predicated block)
I've added a couple of demo instructions that derive from the new
Tablegen base classes and use those two operand clusters. The bulk of
the vector instructions will come in followup commits small enough to
be manageable. (One exception is that I've added the full version of
`isMnemonicVPTPredicable` in the AsmParser, because it seemed
pointless to carefully split it up.)
Reviewers: dmgreen, samparker, SjoerdMeijer, t.p.northover
Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D62669
llvm-svn: 363258
During assembly, the mask operand to an IT instruction (storing the
sequence of T/E for 'Then' and 'Else') is parsed out of the mnemonic
into a representation that encodes 'Then' and 'Else' in the same way
regardless of the condition code. At some point during encoding it has
to be converted into the instruction encoding used in the
architecture, in which the mask encodes a sequence of replacement
low-order bits for the condition code, so that which bit value means
'then' and which 'else' depends on whether the original condition code
had its low bit set.
Previously, that transformation was done by processInstruction(), half
way through assembly. So an MCOperand storing an IT mask would
sometimes store it in one format, and sometimes in the other,
depending on where in the assembly pipeline you were. You can see this
in diagnostics from `llvm-mc -debug -triple=thumbv8a -show-inst`, for
example: if you give it an instruction such as `itete eq`, you'd see
an `<MCOperand Imm:5>` in a diagnostic become `<MCOperand Imm:11>` in
the final output.
Having the same data structure store values with time-dependent
semantics is confusing already, and it will get more confusing when we
introduce the MVE VPT instruction which reuses the Then/Else bitmask
idea in a different context. So I'm refactoring: now, all `ARMOperand`
and `MCOperand` representations of an IT mask work exactly the same
way, namely, 0 means 'Then' and 1 means 'Else', regardless of what
original predicate is being referred to. The architectural encoding of
IT that depends on the original condition is now constructed at the
point when we turn the `MCOperand` into the final instruction bit
pattern, and decoded similarly in the disassembler.
The previous condition-independent parse-time format used 0 for Else
and 1 for Then. I've taken the opportunity to flip the sense of it
while I'm changing all of this anyway, because it seems to me more
natural to use 0 for 'leave the starting condition unchanged' and 1
for 'invert it', as if those bits were an XOR mask.
Reviewers: ostannard
Subscribers: javed.absar, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63219
llvm-svn: 363244
TTI should report that it's not profitable to generate a hardware loop
if it, or one of its child loops, has already been converted.
Differential Revision: https://reviews.llvm.org/D63212
llvm-svn: 363234
Summary:
- Remove redundant initializations from pass constructors that were
already being initialized by LLVMInitializeX86Target().
- Add initialization function for the FPS pass.
Reviewers: craig.topper
Reviewed By: craig.topper
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63218
llvm-svn: 363221
Add support for s.d instruction for Mips1 which expands into two swc1
instructions.
Patch by Mirko Brkusanin.
Differential Revision: https://reviews.llvm.org/D63199
llvm-svn: 363184
As discussed on D62910, we need to check whether particular types of memory access are allowed, not just their alignment/address-space.
This NFC patch adds a MachineMemOperand::Flags argument to allowsMemoryAccess and allowsMisalignedMemoryAccesses, and wires up calls to pass the relevant flags to them.
If people are happy with this approach I can then update X86TargetLowering::allowsMisalignedMemoryAccesses to handle misaligned NT load/stores.
Differential Revision: https://reviews.llvm.org/D63075
llvm-svn: 363179
Without this fix clang 3.6 complains with:
../lib/Target/ARM/ARMAsmPrinter.cpp:1473:18: error: variable 'BranchTarget' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
} else if (MI->getOperand(1).isSymbol()) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
../lib/Target/ARM/ARMAsmPrinter.cpp:1479:22: note: uninitialized use occurs here
MCInst.addExpr(BranchTarget);
^~~~~~~~~~~~
../lib/Target/ARM/ARMAsmPrinter.cpp:1473:14: note: remove the 'if' if its condition is always true
} else if (MI->getOperand(1).isSymbol()) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../lib/Target/ARM/ARMAsmPrinter.cpp:1465:33: note: initialize the variable 'BranchTarget' to silence this warning
const MCExpr *BranchTarget;
^
= nullptr
1 error generated.
Discussed here:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190610/661417.html
llvm-svn: 363166
Implement the backend target hook to drive the HardwareLoops pass.
The low-overhead branch extension for Arm M-class cores is flexible
enough that we don't have to ensure correctness at this point, except
checking that the loop counter variable can be stored in LR - a
32-bit register. For it to be profitable, we want to avoid loops that
contain function calls, or any other instruction that alters the PC.
This implementation uses TargetLoweringInfo, to query type and
operation actions, looks at intrinsic calls and also performs some
manual checks for remainder/division and FP operations.
I think this should be a good base to start and extra details can be
filled out later.
Differential Revision: https://reviews.llvm.org/D62907
llvm-svn: 363149
Noticed in D63075 - there was a allowsMisalignedMemoryAccesses call to check for unaligned loads and a check for aligned legal type loads - which is exactly what allowsMemoryAccess does.
llvm-svn: 363141
Noticed in D63075 - there was a allowsMisalignedMemoryAccesses call to check for unaligned loads and a check for aligned legal type loads - which is exactly what allowsMemoryAccess does.
llvm-svn: 363137