If no pal metadata is given, default to the msgpack format instead of
the legacy metadata. This makes tests better readable.
Differential Revision: https://reviews.llvm.org/D90035
Support atomic load instruction and add a regression test.
VE uses release consitency, so need to insert fence around
atomic instructions. This patch enable AtomicExpandPass
and use emitLeadingFence and emitTrailingFence mechanism
for such purpose.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D90135
This adds a MultiHazardRecognizer and starts to make use of it in the
ARM backend. The idea of the class is to allow multiple independent
hazard recognizers to be added to a single base MultiHazardRecognizer,
allowing them to all work in parallel without requiring them to be
chained into subclasses. They can then be added or not based on cpu or
subtarget features, which will become useful in the ARM backend once
more hazard recognizers are being used for various things.
This also renames ARMHazardRecognizer to ARMHazardRecognizerFPMLx in the
process, to more clearly explain what that recognizer is designed for.
Differential Revision: https://reviews.llvm.org/D72939
Support atomic fence instruction and add a regression test.
Add MEMBARRIER pseudo insturction also to use it as a barrier
against to the compiler optimizations.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D90112
We use an absolute address for stack objects and
it would be necessary to have a constant 0 for soffset field.
Fixes: SWDEV-228562
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D89234
The 0xf3 prefix has been defined as wbnoinvd on Icelake Server. So
the prefix isn't ignored by the CPU. AMD documentation suggests that
wbnoinvd is treated as wbinvd on older processors. Intel documentation
is not clear. Perhaps 0xf2 and 0x66 are treated the same, but its
not documented.
This patch changes TB to PS in the td file so 0xf2 and 0x66 will
be treated as errors. This matches versions of objdump after
wbnoinvd was added.
For now, we lost the encoding information if we using inline assembly.
The encoding for the inline assembly will keep default even if we add
the vex/evex prefix.
Differential Revision: https://reviews.llvm.org/D90009
We have been producing R_X86_64_REX_GOTPCRELX (MOV64rm/TEST64rm/...) and
R_X86_64_GOTPCRELX for CALL64m/JMP64m without the REX prefix since 2016 (to be
consistent with GNU as), but not for MOV32rm/TEST32rm/...
1. Throughput and codesize costs estimations was separated and updated.
2. Updated fdiv cost estimation for different cases.
3. Added scalarization processing for types that are treated as !isSimple() to
improve codesize estimation in getArithmeticInstrCost() and
getArithmeticInstrCost(). The code was borrowed from TCK_RecipThroughput path
of base implementation.
Next step is unify scalarization part in base class that is currently works for
TCK_RecipThroughput path only.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D89973
Replace the X86 specific isSplatZeroExtended helper with a generic BuildVectorSDNode method.
I've just used this to simplify the X86ISD::BROADCASTM lowering so far (and remove isSplatZeroExtended), but we should be able to use this in more places to lower to complex broadcast patterns.
Differential Revision: https://reviews.llvm.org/D87930
This value had the default value of 4 which caused branch relaxation to fail.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D90065
This does not change anything at the moment, but needed for
D89170. In that change I am probing a physical SGPR to see if
it is legal. RC is SReg_32, but DRC for scratch instructions
is SReg_32_XEXEC_HI and test fails.
That is sufficient just to check if DRC contains a register
here in case of physreg. Physregs also do not use subregs
so the subreg handling below is irrelevant for these.
Differential Revision: https://reviews.llvm.org/D90064
There are two optimizations here:
1. Consider the following code:
FCMPSrr %0, %1, implicit-def $nzcv
%sel1:gpr32 = CSELWr %_, %_, 12, implicit $nzcv
%sub:gpr32 = SUBSWrr %_, %_, implicit-def $nzcv
FCMPSrr %0, %1, implicit-def $nzcv
%sel2:gpr32 = CSELWr %_, %_, 12, implicit $nzcv
This kind of code where we have 2 FCMPs each feeding a CSEL can happen
when we have a single IR fcmp being used by two selects. During selection,
to ensure that there can be no clobbering of nzcv between the fcmp and the
csel, we have to generate an fcmp immediately before each csel is
selected.
However, often we can essentially CSE these together later in MachineCSE.
This doesn't work though if there are unrelated flag-setting instructions
in between the two FCMPs. In this case, the SUBS defines NZCV
but it doesn't have any users, being overwritten by the second FCMP.
Our solution here is to try to convert flag setting operations between
a interval of identical FCMPs, so that CSE will be able to eliminate one.
2. SelectionDAG imported patterns for arithmetic ops currently select the
flag-setting ops for CSE reasons, and add the implicit-def $nzcv operand
to those instructions. However if those impdef operands are not marked as
dead, the peephole optimizations are not able to optimize them into non-flag
setting variants. The optimization here is to find these dead imp-defs and
mark them as such.
This pass is only enabled when optimizations are enabled.
Differential Revision: https://reviews.llvm.org/D89415
Immediate must be in an integer range [0,255] for umin/umax instruction.
Extend pattern matching helper SelectSVEArithImm() to take in value type
bitwidth when checking immediate value is in range or not.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D89831
In this patch, Predicates fix added for the following:
* disable prefix-instrs will disable pcrelative-memops
* set two predicates PairedVectorMemops and PrefixInstrs for PLXVP/PSTXVP definitions
Differential Revision: https://reviews.llvm.org/D89727
Reviewed by: amyk, steven.zhang
I was wrong in thinking that MRI.use_instructions return unique instructions and mislead Jay in his previous patch D64393.
First loop counted more instructions than it was in reality and the second loop went beyond the basic block with that counter.
I used Jay's previous code that relied on MRI.use_operands to constrain the number of instructions to check among.
modifiesRegister is inlined to reduce the number of passes over instruction operands and added assert on BB end boundary.
Differential Revision: https://reviews.llvm.org/D89386
Implementation of instructions table.get, table.set, table.grow,
table.size, table.fill, table.copy.
Missing instructions are table.init and elem.drop as they deal with
element sections which are not yet implemented.
Added more tests to tables.s
Differential Revision: https://reviews.llvm.org/D89797
This follows on from D89558 which added the new intrinsic and D88955
which added similar combines for llvm.amdgcn.fmul.legacy.
Differential Revision: https://reviews.llvm.org/D90028
This patch adds a specialized implementation of getIntrinsicInstrCost
and add initial cost-modeling for min/max vector intrinsics.
AArch64 NEON support umin/smin/umax/smax for vectors
<8 x i8>, <16 x i8>, <4 x i16>, <8 x i16>, <2 x i32> and <4 x i32>.
Notably, it does not support vectors with i64 elements.
This change by itself should have very little impact on codegen, but in
follow-up patches I plan to teach the vectorizers to consider using
those intrinsics on platforms where it is profitable, e.g. because there
is no general 'select'-like instruction.
The current cost returned should be better for throughput, latency and size.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D89953
Move the code which adjusts the immediate/predicate on a G_ICMP to
AArch64PostLegalizerLowering.
This
- Reduces the number of places we need to test for optimized compares in the
selector. We know that the compare should have been simplified by the time it
hits the selector, so we can avoid testing this in selects, brconds, etc.
- Allows us to potentially fold more compares (previously, this optimization
was only done after calling `tryFoldCompare`, this may allow us to hit some more
TST cases)
- Simplifies the selection code in `emitIntegerCompare` significantly; we can
just use an emitSUBS function.
- Allows us to avoid checking that the predicate has been updated after
`emitIntegerCompare`.
Also add a utility header file for things that may be useful in the selector
and various combiners. No need for an implementation file at this point, since
it's just one constexpr function for now. I've run into a couple cases where
having one of these would be handy, so might as well add it here. There are
a couple functions in the selector that can probably be factored out into
here.
Differential Revision: https://reviews.llvm.org/D89823
There are a lot of combines in AArch64PostLegalizerCombiner which exist to
facilitate instruction matching in the selector. (E.g. matching for G_ZIP and
other shuffle vector pseudos)
It still makes sense to select these instructions at -O0.
Matching earlier in a combiner can reduce complexity in the selector
significantly. For example, a good portion of our selection code for compares
would be a lot easier to represent in a combine.
This patch moves matching combines into a "AArch64PostLegalizerLowering"
combiner which runs at all optimization levels.
Also, while we're here, improve the documentation for the
AArch64PostLegalizerCombiner, and fix up the filepath in its file comment.
And also add a 'r' which somehow got dropped from a bunch of function names.
https://reviews.llvm.org/D89820
Add new loop metadata amdgpu.loop.unroll.threshold to allow the initial AMDGPU
specific unroll threshold value to be specified on a loop by loop basis.
The intention is to be able to to allow more nuanced hints, e.g. specifying a
low threshold value to indicate that a loop may be unrolled if cheap enough
rather than using the all or nothing llvm.loop.unroll.disable metadata.
Differential Revision: https://reviews.llvm.org/D84779
This commit marks i16 MULH as expand in AMDGPU backend,
which is necessary after the refactoring in D80485.
Differential Revision: https://reviews.llvm.org/D89965
Add the MVT equivalent handling for EVT changeTypeToInteger/changeVectorElementType/changeVectorElementTypeToInteger.
All the SimpleVT code already exists inside the EVT equivalents, but by splitting this out we can use these directly inside MVT types without converting to/from EVT.
This patch touches two optimizations, TwoAddressInstruction and X86's
FixupLEAs pass, both of which optimize by re-creating instructions. For
LEAs, various bits of arithmetic are better represented as LEAs on X86,
while TwoAddressInstruction sometimes converts instrs into three address
instructions if it's profitable.
For debug instruction referencing, both of these require substitutions to
be created -- the old instruction number must be pointed to the new
instruction number, as illustrated in the added test. If this isn't done,
any variable locations based on the optimized instruction are
conservatively dropped.
Differential Revision: https://reviews.llvm.org/D85756
- Make the SIMemoryLegalizer insertAcquire function be in the same
order for each target to be consistent.
Differential Revision: https://reviews.llvm.org/D89880
Some of our conversion algorithms produce -0.0 when converting unsigned i64 to double when the rounding mode is round toward negative. This switches them to other algorithms that don't have this problem. Since it is undefined behavior to change rounding mode with the non-strict nodes, this patch only changes the behavior for strict nodes.
There are still problems with unsigned i32 conversions too which I'll try to fix in another patch.
Fixes part of PR47393
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D87115
1. Fixed liveness issue with implicit kills.
2. Fixed potential problem with an indirect mov.
Fixes: SWDEV-256848
Differential Revision: https://reviews.llvm.org/D89599
We use the Real vs Pseudo instruction abstraction for other
types of instructions to facilitate changes in opcode
between gpu generations.
This patch introduces that abstraction to SOPC and SOPP.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D89738
Change-Id: I59d53c2c7058b49d05b60350f4062a9b542d3138
Fixes being overly conservative with the register counts in called
functions. This should try to do a conservative range merge, but for
now just clone.
Also fix not being able to functionally run the pass standalone.
For historical reasons, the R6 register is a callee-saved argument
register. This means that if it is used to pass an argument to a function
that does not clobber it, it is live throughout the function.
This patch makes sure that in this special case any kill flags of it are
removed.
Review: Ulrich Weigand, Eli Friedman
Differential Revision: https://reviews.llvm.org/D89451
Some instructions may be removable through processes such as IfConversion,
however DefinesPredicate can not be made aware of when this should be considered.
This parameter allows DefinesPredicate to distinguish these removable instructions
on a per-call basis, allowing for more fine-grained control from processes like
ifConversion.
Renames DefinesPredicate to ClobbersPredicate, to better reflect it's purpose
Differential Revision: https://reviews.llvm.org/D88494
Passes that are run after the post-RA scheduler may insert instructions like
waitcnt which eliminate the need for certain noops. After this patch the
scheduler is still aware of possible latency from hazards but noops will
not be inserted until the dedicated hazard recognizer pass is run.
Depends on D89753.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D89754
If a target can encode multiple wait-states into a noop allow emitting such
instructions directly.
Reviewed By: rampitec, dmgreen
Differential Revision: https://reviews.llvm.org/D89753
Change waitcnt insertion to check the memory operand tokens to see if
flat memory operations access VMEM in the same way it does to check if
accessing LDS. This avoids adding waitcnt for counters for address
spaces that are not accessed.
In addition, only generate the pessimistic waitcnt 0 if a flat memory
operation appears to access both VMEM and LDS.
This benefits flat memory operations that explicitly specify the
address space as GLOBAL or LOCAL.
Differential Revision: https://reviews.llvm.org/D89618
Remove getAllVGPR32() interface and update the SGPR spill code to use
a proper method to get the relevant VGPR registers list.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D89806
- In general, a generic point may alias to pointers in all other address
spaces. However, for certain cases enforced by the programming model,
we may found a generic point won't alias to pointers to local objects.
* When a generic pointer is loaded from the constant address space, it
could only be a pointer to the GLOBAL or CONSTANT address space.
Thus, it won't alias to pointers to the PRIVATE or LOCAL address
space.
* When a generic pointer is passed as a kernel argument, it also could
only be a pointer to the GLOBAL or CONSTANT address space. Thus, it
also won't alias to pointers to the PRIVATE or LOCAL address space.
Differential Revision: https://reviews.llvm.org/D89525
Remove immediate operand from SI_ELSE which indicates if EXEC has
been modified. Instead always emit code that handles EXEC and
remove unnecessary instructions during pre-RA optimisation.
This facilitates passes (i.e. SIWholeQuadMode) adding exec mask
manipulation post control flow lowering, and pre control flow
lower passes do not need to be aware of SI_ELSE handling.
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D89644
Remove duplicate code and move things around to make it easier to
add additional optimisations to the pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D89619
The "Size" value returned by SystemZDisassembler::getInstruction is
used by common code even in the case where the routine returns
failure. If that Size value exceeds the number of bytes remaining
in the section, that could cause disassembler crashes.
Fixed by never returning more than the number of bytes remaining.
This reverts commit 38f625d0d1.
This commit contains some holes in its logic and has been causing
issues since it was commited. The idea sounds OK but some cases were not
handled correctly. Instead of trying to fix that up later it is probably
simpler to revert it and work to reimplement it in a more reliable way.
Create the LLVM / CodeView register mappings for the 32-bit ARM Window targets.
Reviewed By: compnerd
Differential Revision: https://reviews.llvm.org/D89622
extract_vector_elt will turn type vxi1 into i8, which triggers the assertion fail.
Since we don't really handle vxi1 cases in below code, we can just return from here.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D89096
GFX10 enables third addressing mode for flat scratch instructions,
an ST mode. In that mode both register operands are omitted and
only swizzled offset is used in addition to flat_scratch base.
Differential Revision: https://reviews.llvm.org/D89501
Before the change attempt to link libLTO.so against shared
LLVM library failed as:
```
[ 76%] Linking CXX shared library ../../lib/libLTO.so
... /usr/bin/cmake -E cmake_link_script CMakeFiles/LTO.dir/link.txt --verbose=1
c++ -o ...libLTO.so.12git ...ibLLVM-12git.so
ld: CMakeFiles/LTO.dir/lto.cpp.o: in function `llvm::InitializeAllTargetInfos()':
include/llvm/Config/Targets.def:31: undefined reference to `LLVMInitializeVETargetInfo'
```
It happens because on linux llvm build system sets default
symbol visibility to "hidden". The fix is to set visibility
back to "default" for exported APIs with LLVM_EXTERNAL_VISIBILITY.
Bug: https://bugs.llvm.org/show_bug.cgi?id=47847
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89633
Summary:
Initializer merging generates pretty inefficient code for large allocas
that also happens to trigger an exponential algorithm somewhere in
Machine Instruction Scheduler. See https://bugs.llvm.org/show_bug.cgi?id=47867.
This change adds an upper limit for the alloca size. The default limit
is selected such that worst case size of memtag-generated code is
similar to non-memtag (but because of the ISA quirks, this case is
realized at the different value of alloca size, ex. memset inlining
triggers at sizes below 512, but stack tagging instructions are 2x
shorter, so limit is approx. 256).
We could try harder to emit more compact code with initializer merging,
but that would only affect large, sparsely initialized allocas, and
those are doing fine already.
Reviewers: vitalybuka, pcc
Subscribers: llvm-commits
We have pseudo instructions we use for bitcasts between these types.
We have them in the load folding table, but not the store folding
table. This adds them there so they can be used for stack spills.
I added an exact size check so that we don't fold when the stack slot
is larger than the GPR. Otherwise the upper bits in the stack slot
would be garbage. That would be fine for Eli's test case in PR47874,
but I'm not sure its safe in general.
A step towards fixing PR47874. Next steps are to change the ADDSSrr_Int
pseudo instructions to use FR32 as the second source register class
instead of VR128. That will keep the coalescer from promoting the
register class of the bitcast instruction which will make the stack
slot 4 bytes instead of 16 bytes.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D89656
These were introduced in r279902 on the grounds that using separate
MUL_U24/MUL_I24 and MULHI_U24/MULHI_I24 nodes would introduce multiple
uses of the operands, which would prevent SimplifyDemandedBits from
simplifying the operands.
This has since been fixed by D24672 "AMDGPU/SI: Use new SimplifyDemandedBits helper for multi-use operations"
No functional change intended. At least it has no effect on lit tests.
Differential Revision: https://reviews.llvm.org/D89706
MULH is often expanded on targets.
This patch removes the isMulhCheaperThanMulShift hook and uses
isOperationLegalOrCustom instead.
Differential Revision: https://reviews.llvm.org/D80485
S_CMP_LG_U64 was added in gfx8 and is guarded by hasScalarCompareEq64().
Rewrite S_CMP_LG_U64 to S_OR_B32 + S_CMP_LG_U32 for targets that
do not support 64-bit scalar compare.
Differential Revision: https://reviews.llvm.org/D89536
Add setcc for fp128 and clean existing ISel patterns. Also add
a regression test.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89683
Add missing ISel patterns related to select_cc DAG nodes.
Add regression test of all combination of possible scalar types.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89672
Add VBRD/VMV vector instructions. In order to do that, also support
VM512 registers and RV instruction format in MC layer. Also add
regression tests for new instructions.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89641
Add LSV/LVS/LVM/SVM vector instructions and regression tests.
Also update AsmParser to support new format of operands.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89499
Support br_cc instruction comparing fp128 values. Add a br_cc.ll
regression test for all kind of br_cc instructions. And, clean
existing branch regression tests, this time. Clean a brcond.ll
regression test for brcond instruction. Remove mixed branch1.ll
regression test.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89627
Add an ISel pattern for fp128 select instruction and optimize generated
code for other types' select. instructions. Add a regression test also.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89509
In many places in the AArch64 backend we are comparing TypeSize objects,
but in fact we are only ever expecting fixed width types. I've changed
all such comparisons to use their integer equivalents by replacing
calls to getSizeInBits() with getFixedSizeInBits(), etc.
Differential Revision: https://reviews.llvm.org/D89116
loadiwkey and aesenc128kl share the same opcode but one is memory
and one is register. But they're behavior is quite different. We
were crashing because one has an output register and one doesn't
and the backend couldn't account for that. But since they aren't
foldable we can just add NotMemoryFoldable so they won't be looked at.
This adds some basic costs for MVE reductions - currently just costing
the simple legal add vectors as a single MVE instruction. More complex
costing can be added in the future when the framework more readily
allows it.
Differential Revision: https://reviews.llvm.org/D88980
This adds a very basic cost for active_lane_mask under MVE - making the
assumption that they will be free and then apologizing for that in a
comment.
In reality they may either be free (by being nicely folded into a tail
predicated loop), cost the same as a VCTP or be expanded into vdup's,
adds and cmp's. It is difficult to detect the difference from a single
getIntrinsicInstrCost call, so makes the assumption that the vectorizer
is adding them, and only added them where it makes sense.
We may need to change this in the future to better model predicate costs
in the vectorizer, especially at -Os or non-tail predicated loops. The
vectorizer currently does not query the cost of these instructions but
that will change in the future and a zero cost there probably makes the
most sense at the moment.
Differential Revision: https://reviews.llvm.org/D88989
This lets external consumers customize the output, similar to how
AssemblyAnnotationWriter lets the caller define callbacks when printing
IR. The array of handlers already existed, this just cleans up the code
so that it can be exposed publically.
Differential Revision: https://reviews.llvm.org/D74158
If instructions were removed in peephole passes after the hazard recognizer was
run it is possible that new hazards could be introduced.
Fixes: SWDEV-253090
Reviewed By: rampitec, arsenm
Differential Revision: https://reviews.llvm.org/D89077
NEON is pretty limited in it's reduction support. As a first step add some
basic rules for the legal types we can select.
Differential Revision: https://reviews.llvm.org/D89070
This would end up killing part of the result super-register, resulting
in a verifier error on a later use of the overlapping registers. We
could add kills of any non-aliasing registers, but we should be moving
away from relying on kill flags.
- The goal of this patch is improve option compatible with RISCV-V GCC,
-mcpu support on GCC side will sent patch in next few days.
- -mtune only affect the pipeline model and non-arch/extension related
target feature, e.g. instruction fusion; in td file it called
TuneFeatures, which is introduced by X86 back-end[1].
- -mtune accept all valid option for -mcpu and extra alias processor
option, e.g. `generic`, `rocket` and `sifive-7-series`, the purpose is
option compatible with RISCV-V GCC.
- Processor alias for -mtune will resolve according the current target arch,
rv32 or rv64, e.g. `rocket` will resolve to `rocket-rv32` or `rocket-rv64`.
- Interaction between -mcpu and -mtune:
* -mtune has higher priority than -mcpu for pipeline model and
TuneFeatures.
[1] https://reviews.llvm.org/D85165
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D89025
Simplify emitIntegerCompare and improve comments + asserts.
Mostly making the code a little easier to follow.
Also, this code is only used for G_ICMP. The legalizer ensures that the LHS/RHS
for every G_ICMP is either a s32 or s64. So, there's no need to handle anything
else. This lets us remove a bunch of checks for whether or not we successfully
emitted the compare.
Differential Revision: https://reviews.llvm.org/D89433
removeMBBifRedundant normally tries to keep predecessors fallthrough when removing redundant MBB.
It has to change MBBs layout to keep the new successor to immediately follow the predecessor of removed MBB.
It only may be allowed in case the new successor itself has no successors to which it fall through.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D89397
Implement stack frame reordering in the AArch64 backend.
Unlike the X86 implementation, AArch64 does not seem to benefit from
"access density" based frame reordering, mainly because it has a much
smaller variety of addressing modes, and the fact that all instructions
are 4 bytes so each frame object is either in range of an instruction
(and then the access is "free") or not (and that has a code size cost
of 4 bytes).
This change improves Memory Tagging codegen by
* Placing an object that has been chosen as the base tagged pointer of
the function at SP + 0. This saves one instruction to setup the pointer
(IRG does not have an offset immediate), and more because that object
can now be referenced without materializing its tagged address in a
scratch register.
* Placing objects that go out of scope simultaneously together. This
exposes opportunities for instruction merging in tryMergeAdjacentSTG.
Differential Revision: https://reviews.llvm.org/D72366
Summary:
Pin the tagged base pointer to one of the stack slots, and (if
necessary) rewrite tag offsets so that an object that occupies that
slot has both address and tag offsets of 0. This allows ADDG
instructions for that object to be eliminated and their uses replaced
with the tagged base pointer itself.
This optimization must be done in machine instructions and not in the IR
instrumentation pass, because referring to a stack slot through an IRG
pointer would confuse the stack coloring pass.
The optimization makes a (pretty naive) attempt to find the slot that
would benefit the most by counting the uses of stack slots in the
function.
Reviewers: ostannard, pcc
Subscribers: merge_guards_bot, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72365
Prototype the newly proposed load_lane instructions, as specified in
https://github.com/WebAssembly/simd/pull/350. Since these instructions are not
available to origin trial users on Chrome stable, make them opt-in by only
selecting them from intrinsics rather than normal ISel patterns. Since we only
need rough prototypes to measure performance right now, this commit does not
implement all the load and store patterns that would be necessary to make full
use of the offset immediate. However, the full suite of offset tests is included
to make it easy to track improvements in the future.
Since these are the first instructions to have a memarg immediate as well as an
additional immediate, the disassembler needed some additional hacks to be able
to parse them correctly. Making that code more principled is left as future
work.
Differential Revision: https://reviews.llvm.org/D89366
This does unfortunately end up with extra waitcnts getting inserted
that were avoided before. Ideally we would avoid the spills of these
undef components in the first place.
Generate the minimal set of s_mov instructions required when
expanding a SGPR copy operation in copyPhysReg.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D89187
In most of lib/Target we know that we are not dealing with scalable
types so it's perfectly fine to replace TypeSize comparison operators
with their fixed width equivalents, making use of getFixedSize()
and so on.
Differential Revision: https://reviews.llvm.org/D89101
VE doesn't have SHL_PARTS/SRA_PARTS/SRL_PARTS instructions, so need
to expand them. Add regression tests too.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89396
These cause problems for later optimizations, just using an unused vreg like
SelectionDAG generates better code in the end, and obviates the need for some
GISel specific flag optimizations.
Differential Revision: https://reviews.llvm.org/D89419
NVPTXLowerArgs works as follows.
* Create a regular alloca with alignment identical to arg.
* Copy arg from param space (and ASC'ing it from generic AS first) to
the alloca (it's still in generic AS).
* Replace loads of arg with loads of alloca.
The bug here is that we did not preserve the arg's alignment when
loading from the alloca.
The impact of this bug is that sometimes param loads would be lowered as
a series of u8 loads, because we're incorrectly assuming everything has
alignment 1.
Differential Revision: https://reviews.llvm.org/D89404
In order to correctly load an all-ones FP NaN value into a floating point
register with a VGBM, the analyzed 32/64 FP bits must first be shifted left
(into element 0 of the vector register).
SystemZVectorConstantInfo has so far relied on element replication which has
bypassed the need to do this shift, but now it is clear that this must be
done in order to handle NaNs.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D89389
Generate (at runtime) the table used to drive getSubRegFromChannel,
base on AMDGPUSubRegIdxRanges from TableGen data.
The is a step closer to it being staticly generated by TableGen and
allows getSubRegFromChannel handle all bitwidths in the mean time.
Reviewed By: rampitec, arsenm, foad
Differential Revision: https://reviews.llvm.org/D89217
When passing SVE types as arguments to function calls we can run
out of hardware SVE registers. This is normally fine, since we
switch to an indirect mode where we pass a pointer to a SVE stack
object in a GPR. However, if we switch over part-way through
processing a SVE tuple then part of it will be in registers and
the other part will be on the stack. This is wrong and we'd like
to avoid any silent ABI compatibility issues in future. For now,
I've added a fatal error when this happens until we can get a
proper fix.
Differential Revision: https://reviews.llvm.org/D89326
This instruction was introduced in GFX10.3, reusing the opcode of
v_mac_legacy_f32 from GFX10.1.
Differential Revision: https://reviews.llvm.org/D89247
This patch adds support for assemble disassemble intrinsics
for MMA.
Reviewed By: bsaleil, #powerpc
Differential Revision: https://reviews.llvm.org/D88739
Implement computeKnownBitsForTargetInstr for G_AMDGPU_BUFFER_LOAD_UBYTE
and G_AMDGPU_BUFFER_LOAD_USHORT. This allows generic combines to remove
some unnecessary G_ANDs.
Differential Revision: https://reviews.llvm.org/D89316
Adds more testing in basic-assembly.s and a new test tables.s.
Adds support to yaml reading and writing of tables as well.
Differential Revision: https://reviews.llvm.org/D88815
A dynamic linker with lazy binding support may need to handle variant
PCS function symbols specially, so an ELF symbol table marking
STO_AARCH64_VARIANT_PCS [1] was added to address this.
Function symbols that follow the vector PCS are marked via the
.variant_pcs assembler directive, which takes a single parameter
specifying the symbol name and sets the STO_AARCH64_VARIANT_PCS st_other
flag in the object file.
[1] https://github.com/ARM-software/abi-aa/blob/master/aaelf64/aaelf64.rst#st-other-values
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D89138
This passes existing X86 test but I'm not sure if it handles all type
legalization cases it needs to.
Alternative to D89200
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D89222
This reverts commit 432e4e56d3, which reverted 542523a61a. Two issues from
the original commit have been fixed. First, MSVC does not like when std::array
is initialized with only single braces, so this commit switches to using the
more portable double braces. Second, there was a subtle endianness bug that
prevented the original commit from working correctly on big-endian machines,
which has been fixed by switching to using endianness-agnostic bit twiddling
instead of type punning.
Differential Revision: https://reviews.llvm.org/D88773
This can fix an asan failure like below.
==15856==ERROR: AddressSanitizer: use-after-poison on address ...
READ of size 8 at 0x6210001a3cb0 thread T0
#0 llvm::MachineInstr::getParent()
#1 llvm::LiveVariables::VarInfo::findKill()
#2 TwoAddressInstructionPass::rescheduleMIBelowKill()
#3 TwoAddressInstructionPass::tryInstructionTransform()
#4 TwoAddressInstructionPass::runOnMachineFunction()
We need to update the Kills if we replace instructions. The Kills
may be later accessed within TwoAddressInstruction pass.
Differential Revision: https://reviews.llvm.org/D89092
The change starts from LiveRangeMatrix and also checks the users of the
APIs are typed accordingly.
Differential Revision: https://reviews.llvm.org/D89145
Extend loadSRsrcFromVGPR to allow moving a range of instructions into
the loop. The call instruction is surrounded by copies into physical
registers which should be part of the waterfall loop.
Differential Revision: https://reviews.llvm.org/D88291
VE doesn't have instruction for copysign, so expand it. Add a
regression test also.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89228
VE doesn't have fneg or frem instruction, so change them to expand. Add
regression tests also.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89205
VE doesn't have BRCOND instruction, so need to expand it. Also add
a regression test.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D89173
I have introduced a new template PolySize class, where the template
parameter determines the type of quantity, i.e. for an element
count this is just an unsigned value. The ElementCount class is
now just a simple derivation of PolySize<unsigned>, whereas TypeSize
is more complicated because it still needs to contain the uint64_t
cast operator, since there are still many places in the code that
rely upon this implicit cast. As such the class also still needs
some of it's own operators.
I've tried to minimise the amount of code in the base PolySize
class, which led to a couple of changes:
1. In some places we were relying on '==' operator comparisons
between ElementCounts and the scalar value 1. I didn't put this
operator in the new PolySize class, and thought it was actually
clearer to use the isScalar() function instead.
2. I removed the isByteSized function and replaced it with calls
to isKnownMultipleOf(8).
I've also renamed NextPowerOf2 to be coefficientNextPowerOf2 so
that it's more consistent with coefficientDivideBy.
Differential Revision: https://reviews.llvm.org/D88409
This is my first LLVM patch, so please tell me if there are any process issues.
The main observation for this patch is that we can lower UMIN/UMAX with v8i16 by using unsigned saturated subtractions in a clever way. Previously this operation was lowered by turning the signbit of both inputs and the output which turns the unsigned minimum/maximum into a signed one.
We could use this trick in reverse for lowering SMIN/SMAX with v16i8 instead. In terms of latency/throughput this is the needs one large move instruction. It's just that the sign bit turning has an increased chance of being optimized further. This is particularly apparent in the "reduce" test cases. However due to the slight regression in the single use case, this patch no longer proposes this.
Unfortunately this argument also applies in reverse to the new lowering of UMIN/UMAX with v8i16 which regresses the "horizontal-reduce-umax", "horizontal-reduce-umin", "vector-reduce-umin" and "vector-reduce-umax" test cases a bit with this patch. Maybe some extra casework would be possible to avoid this. However independent of that I believe that the benefits in the common case of just 1 to 3 chained min/max instructions outweighs the downsides in that specific case.
Patch By: @TomHender (Tom Hender) ActuallyaDeviloper
Differential Revision: https://reviews.llvm.org/D87236
The bextri intrinsic has a ImmArg attribute which will be converted
in SelectionDAG using TargetConstant. We previously converted this
to a plain Constant to allow X86ISD::BEXTR to call SimplifyDemandedBits
on it.
But while trying to decide if D89178 was safe, I realized that
this conversion of TargetConstant to Constant would be one case
where that would break.
So this patch adds a new opcode specifically for the immediate case.
And then teaches computeKnownBits and SimplifyDemandedBits to also
handle it, but not try to SimplifyDemandedBits on it. To make up
for that, I immediately masked the constant to 16 bits when
converting from the intrinsic node to the X86ISD node.
At AMD, in an internal audit of our code, we found some corner cases
where we were not quite differentiating targets enough for some old
hardware. This commit is part of fixing that by adding three new
targets:
* The "Oland" and "Hainan" variants of gfx601 are now split out into
gfx602. LLPC (in the GPUOpen driver) and other front-ends could use
that to avoid using the shaderZExport workaround on gfx602.
* One variant of gfx703 is now split out into gfx705. LLPC and other
front-ends could use that to avoid using the
shaderSpiCsRegAllocFragmentation workaround on gfx705.
* The "TongaPro" variant of gfx802 is now split out into gfx805.
TongaPro has a faster 64-bit shift than its former friends in gfx802,
and a subtarget feature could be set up for that to take advantage of
it. This commit does not make that change; it just adds the target.
V2: Add clang changes. Put TargetParser list in order.
V3: AMDGCNGPUs table in TargetParser.cpp needs to be in GPUKind order,
so fix the GPUKind order.
Differential Revision: https://reviews.llvm.org/D88916
Change-Id: Ia901a7157eb2f73ccd9f25dbacec38427312377d
There are a number of places in RDA where we assume the block will not
be empty. This isn't necessarily true for tail predicated loops where we
have removed instructions. This attempt to make the pass more resilient
to empty blocks, not casting pointers to machine instructions where they
would be invalid.
The test contains a case that was previously failing, but recently been
hidden on trunk. It contains an empty block to begin with to show a
similar error.
Differential Revision: https://reviews.llvm.org/D88926
This patch introduce files that just enough for lib/Target/CSKY to compile.
Notably a basic CSKYTargetMachine and CSKYTargetInfo.
Differential Revision: https://reviews.llvm.org/D88466
The notrack prefix is a relaxation of CET policies which makes it possible to indirectly call targets which do not have an ENDBR instruction in the landing address. To emit a call with this prefix, the special attribute "nocf_check" is used. When used as a function attribute, a CallInst targeting the respective function will return true for the method "doesNoCfCheck()", no matter if it is a direct call (and such should remain like this, as the information that the to-be-called function won't perform control-flow checks is useful in other contexts). Yet, when emitting an X86ISD::NT_CALL, the respective CallInst should be verified for its indirection, allowing that the prefixed calls are only emitted in the right situations.
Update the respective testing unit to also verify for direct calls to functions with ''nocf_check'' attributes.
The bug can also be reproduced through compiling the following C code using the -fcf-protection=full flag.
int __attribute__((nocf_check)) foo(int a) {};
int main() {
foo(42);
}
Differential Revision: https://reviews.llvm.org/D87320
The selection of HVX shuffles can produce more nodes in the DAG,
which need special handling, or otherwise they would be left
unselected by the main selection code. Make the handling of such
nodes more general.
I suspect getAddressFromInstr and addFullAddress are not handling
all addresses cases properly based on a report from MaskRay.
So just copy the operands directly. This should be more efficient
anyway.
The expansion code creates a copy to RBX before the real LCMPXCHG16B.
It's possible this copy uses a register that is also used by the
real LCMPXCHG16B. If we set the kill flag on the use in the copy,
then we'll fail the machine verifier on the use on the LCMPXCHG16B.
Differential Revision: https://reviews.llvm.org/D89151