Explicitly set the exec mask for SGPR spills and reloads.
This fixes a bug where SGPR spills to memory could be incorrect
if the exec mask was 0 (or differed between spill and reload).
Additionally pack scalar subregisters (upto 16/32 per VGPR),
so that the majority of scalar types can be spilt or reloaded
with a simple memory access. This should amortize some of the
additional overhead of manipulating the exec mask.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D80282
These are scalar instructions that change vector instructions, so they
should not be executed without any active lanes.
The implementation of -amdgpu-skip-threshold also seem to be backwards
from expected, since decreasing it prevents removal.
The pass which infers when it's legal to load a global address space
as SMRD was only considering amdgpu_kernel, and ignoring the shader
entry type calling conventions.
Summary:
Combine unmerge(trunc) to enable other merge combines.
Without this combine, the scalar unmerge(trunc(merge))
pattern cannot be combined and easily lead to
hard-to-legalize merge/unmerge artifacts.
Reviewed By: arsenm
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D79567
This is a custom inserter because it was less work than teaching
tablegen a way to indicate that it is sometimes OK to have a no side
effect instruction in the output of a side effecting pattern.
The asm is needed to look like a read of the mode register to prevent
it from being deleted. However, there seems to be a bug where the mode
register def instructions are moved across the asm sideeffect by the
post-RA scheduler.
Another oddity is the immediate is formatted differently between
s_denorm_mode and s_round_mode.
Summary:
PHIs result register class is set to VGPR or SGPR depending on the cross block value divergence.
In some cases uniform PHI need to be converted to return VGPR to prevent the oddnumber of moves values from VGPR to SGPR and back.
PHI should certainly return VGPR if it has at least one VGPR input. This change adds the exception.
We don't want to convert uniform PHI to VGPRs in case the only VGPR input is a VGPR to SGPR COPY and definition od the
source VGPR in this COPY is move immediate.
bb.0:
%0:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
%2:sreg_32 = .....
bb.1:
%3:sreg_32 = PHI %1, %bb.3, %2, %bb.1
S_BRANCH %bb.3
bb.3:
%1:sreg_32 = COPY %0
S_BRANCH %bb.2
Reviewers: rampitec
Reviewed By: rampitec
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D80434
OpenMP emits these for some reason, so handle them. Assume these use
4096 bytes by default, with a flag to override this. Also change the
related stack assumption for calls to have a flag.
We do not have register classes for all possible vector
sizes, so round it up for extract vector element.
Also fixes selection of G_MERGE_VALUES when vectors are
not a power of two.
This has required to refactor getRegSplitParts() in way
that it can handle not just power of two vectors.
Ideally we would like RegSplitParts to be generated by
tablegen.
Differential Revision: https://reviews.llvm.org/D80457
- This allow us to specify the (minimal) alignment on an intrinsic's
arguments and, more importantly, the return value.
Differential Revision: https://reviews.llvm.org/D80422
This is the groundwork required to implement strictfp. For now, this
should be NFC for regular instructoins (many instructions just gain an
extra use of a reserved register). Regalloc won't rematerialize
instructions with reads of physical registers, but we were suffering
from that anyway with the exec reads.
Should add it for all the related FP uses (possibly with some
extras). I did not add it to either the gpr index mode instructions
(or every single VALU instruction) since it's a ridiculous feature
already modeled as an arbitrary side effect.
Also work towards marking instructions with FP exceptions. This
doesn't actually set the bit yet since this would start to change
codegen. It seems nofpexcept is currently not implied from the regular
IR FP operations. Add it to some MIR tests where I think it might
matter.
All 3 passes that change instruction encodings were dropping MI
flags. This avoids scheduling regressions caused by setting
mayRaiseFPExceptions on FP instructions for non-strictfp functions.
The vector equivalent has backwards operands, but the scalar version
does not. The passes that use these hooks aren't enabled by default,
so this doesn't really change anything.
I'm guessing this was a holdover from when 0 was an invalid stack
pointer, but surprised nobody has discovered this before.
Also don't allow offset folding for -1 pointers, since it looks weird
to partially fold this.
I consider this to be a hack, since we probably should not mark any
16-bit extract as legal, and require all extracts to be done on
multiples of 32. There are quite a few more battles to fight in the
legalizer for sub-dword vectors, so just select this for now so we can
pass OpenCL conformance without crashing.
Also fix the same assert for G_INSERTs. Unlike G_EXTRACT there's not a
trivial way to select this so just fail on it.
Confusingly, these were unrelated and had different semantics. The
G_PTR_MASK instruction predates the llvm.ptrmask intrinsic, but has a
different format. G_PTR_MASK only allows clearing the low bits of a
pointer, and only a constant number of bits. The ptrmask intrinsic
allows an arbitrary mask. Replace G_PTR_MASK to match the intrinsic.
Only selects the cases that look like the old instruction. More work
is needed to select the general case. Also new legalization code is
still needed to deal with the case where the incoming mask size does
not match the pointer size, which has a specified behavior in the
langref.
This is currently missing most of the hard parts to lower correctly,
so disable it for now. This fixes at least one OpenCL conformance test
and allows it to pass with fallback. Hide this behind an option for
now.
Summary: 'A' constraint requires an immediate int or fp constant that can be inlined in an instruction encoding.
Reviewers: arsenm, rampitec
Differential Revision: https://reviews.llvm.org/D78494
EarlyCSE was added with D75145, but the motivating test is
not regressed by removing the extra pass now. That might be
because VectorCombine altered the way it processes instructions,
or it might be from (re)moving VectorCombine in the pipeline.
The extra round of EarlyCSE appears to cost approximately
0.26% in compile-time as discussed in D80236, so we need some
evidence to justify its inclusion here, but we do not have
that (yet).
I suspect that between SLP and VectorCombine, we are creating
patterns that InstCombine and/or codegen are not prepared for,
but we will need to reduce those examples and include them as
PhaseOrdering and/or test-suite benchmarks.
As noted in D80236, moving the pass in the pipeline exposed this
shortcoming. Extra work to recalculate the alias results showed
up as a compile-time slowdown.
There are 2 known problem patterns shown in the test diffs here:
vector horizontal ops (an x86 specialization) and vector reductions.
SLP has greater ability to match and fold those than vector-combine,
so let SLP have first chance at that.
This is a quick fix while we continue to improve vector-combine and
possibly canonicalize to reduction intrinsics.
In the longer term, we should improve matching of these patterns
because if they were created in the "bad" forms shown here, then we
would miss optimizing them.
I'm not sure what is happening with alias analysis on the addsub test.
The old pass manager now shows an extra line for that, and we see an
improvement that comes from SLP vectorizing a store. I don't know
what's missing with the new pass manager to make that happen.
Strangely, I can't reproduce the behavior if I compile from C++ with
clang and invoke the new PM with "-fexperimental-new-pass-manager".
Differential Revision: https://reviews.llvm.org/D80236
Unlike SelectionDAGBuilder, IRTranslator omits the unconditional
branch in fallthrough cases. Confusingly, the control flow pseudos
function in the opposite way the intrinsics are used, and the branch
targets always need to be swapped. We're inverting the target blocks,
so we need to figure out the old fallthrough block and insert a branch
to the original unconditional branch target.
This only affects assembly and -filetype=asm codegen of PAL metadata.
Differential Revision: https://reviews.llvm.org/D78860
Change-Id: I7b822e1917bf7b403486820d31afc483be207652
Promote alloca to vector before SROA and loop unroll. If we manage
to eliminate allocas before unroll we may choose to unroll less.
Differential Revision: https://reviews.llvm.org/D80386
It was set in total vector size while the idea was to limit
a number of instructions. Now it started to work with doubles
and thresholds needs to be updated.
Differential Revision: https://reviews.llvm.org/D80322
Even though series of cmd/cndmask can produce quite a lot of
code that is still better than a loop. In case of doubles we
would even produce two loops.
Differential Revision: https://reviews.llvm.org/D80032
Relying on any MachineFunction state in the MachineFunctionInfo
constructor is hazardous, because the construction time is unclear and
determined by the first use. The function may be only partially
constructed, which is part of why we have many of these hacky string
attributes to track what we need for ABI lowering.
For SelectionDAG, all stack objects are created up-front before
calling convention lowering so stack objects are visible at
construction time. For GlobalISel, none of the IR function has been
visited yet and the allocas haven't been added to the MachineFrameInfo
yet. This should fix failing to set flat_scratch_init in GlobalISel
when needed.
This pass really needs to be turned into some kind of analysis, but I
haven't found a nice way use one here.
This should be directly implied from the register class, and there's
no need to special case live ins here. This was getting the wrong
answer for the queue ptr argument in callable functions, since it's
not an explicit IR argument and is always uniform.
Fixes not using scalar loads for the aperture in addrspacecast
lowering, and any other places that use implicit SGPR arguments.
This was assumed to be a simple move, and interpreting the immediate
modifier operand as a materialized immediate. Apparently the SDWA pass
never produces these, but GlobalISel does emit these for some vector
shuffles.
When the callee requires a dynamic stack realignment,
it is not possible to correcty access the incoming
stack arguments using the stack pointer. We reserve a
base pointer in such cases to access the function arguments
inside the callee. The base pointer will hold the incoming
stack pointer value before any kind of delta added to it.
Reviewed By: arsenm, scott.linder
Differential Revision: https://reviews.llvm.org/D78811
Along the lines of D77454 and D79968. Unlike loads and stores, the
default alignment is getPrefTypeAlign, to match the existing handling in
various places, including SelectionDAG and InstCombine.
Differential Revision: https://reviews.llvm.org/D80044
Summary:
When spilling in the entry function we should be able to borrow
StackPtrOffsetReg as a last resort. This restores behaviour
removed in D75138, and fixes failures when shaders use all
SGPRs, VGPRs and spill in the entry function.
Reviewers: scott.linder, arsenm, tpr
Reviewed By: scott.linder, arsenm
Subscribers: qcolombet, foad, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D79776
For IR generated by a compiler, this is really simple: you just take the
datalayout from the beginning of the file, and apply it to all the IR
later in the file. For optimization testcases that don't care about the
datalayout, this is also really simple: we just use the default
datalayout.
The complexity here comes from the fact that some LLVM tools allow
overriding the datalayout: some tools have an explicit flag for this,
some tools will infer a datalayout based on the code generation target.
Supporting this properly required plumbing through a bunch of new
machinery: we want to allow overriding the datalayout after the
datalayout is parsed from the file, but before we use any information
from it. Therefore, IR/bitcode parsing now has a callback to allow tools
to compute the datalayout at the appropriate time.
Not sure if I covered all the LLVM tools that want to use the callback.
(clang? lli? Misc IR manipulation tools like llvm-link?). But this is at
least enough for all the LLVM regression tests, and IR without a
datalayout is not something frontends should generate.
This change had some sort of weird effects for certain CodeGen
regression tests: if the datalayout is overridden with a datalayout with
a different program or stack address space, we now parse IR based on the
overridden datalayout, instead of the one written in the file (or the
default one, if none is specified). This broke a few AVR tests, and one
AMDGPU test.
Outside the CodeGen tests I mentioned, the test changes are all just
fixing CHECK lines and moving around datalayout lines in weird places.
Differential Revision: https://reviews.llvm.org/D78403
Enable clausing of memory loads on gfx10 by adding a new pass to insert
the s_clause instructions that mark the start of each hard clause.
Differential Revision: https://reviews.llvm.org/D79792
SelectMOVRELOffset prevents peeling of a constant from an index
if final base could be negative. isBaseWithConstantOffset() succeeds
if a value is an "add" or "or" operator. In case of "or" it shall
be an add-like "or" which never changes a sign of the sum given a
non-negative offset. I.e. we can safely allow peeling if operator is
an "or".
Differential Revision: https://reviews.llvm.org/D79898
We can produce such vectors in the Promote Alloca pass,
but we are unable to use movrel to operate it and lower
via scratch. Making it legal makes SI_INDIRECT patterns
work.
There is more work to do in subsequent changes:
1. We initialize m0 twice to access each dword. It shall
be possible to only do it once and increment base register
number instead.
2. We also need v16i64/v16f64 but these first need to be
added to tablegen.
Differential Revision: https://reviews.llvm.org/D79808
Summary:
ConstantExprs involving operations on <1 x Ty> could translate into MIR
that failed to verify with:
*** Bad machine code: Reading virtual register without a def ***
The problem was that translate(const Constant &C, Register Reg) had
recursive calls that passed the same Reg in for the translation of a
subexpression, but without updating VMap for the subexpression first as
translate(const Constant &C, Register Reg) expects.
Fix this by using the same translateCopy helper function that we use for
translating Instructions. In some cases this causes extra G_COPY
MIR instructions to be generated.
Fixes https://bugs.llvm.org/show_bug.cgi?id=45576
Reviewers: arsenm, volkan, t.p.northover, aditya_nandakumar
Subscribers: jvesely, wdng, nhaehnle, rovka, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78378
When splitting loads in RegBankSelect G_EXTRACT_VECTOR_ELT were being added
which could not be selected. Since invoking the legalizer will generate
instructions that split and combine wide loads, we can remove the redundant
repair instructions which are added by RegBankSelect.
Differential Revision: https://reviews.llvm.org/D75547
allocas in LLVM IR have a specified alignment. When that alignment is
specified, the alloca has at least that alignment at runtime.
If the specified type of the alloca has a higher preferred alignment,
SelectionDAG currently ignores that specified alignment, and increases
the alignment. It does this even if it would trigger stack realignment.
I don't think this makes sense, so this patch changes that.
I was looking into this for SVE in particular: for SVE, overaligning
vscale'ed types is extra expensive because it requires realigning the
stack multiple times, or using dynamic allocation. (This currently isn't
implemented.)
I updated the expected assembly for a couple tests; in particular, for
arg-copy-elide.ll, the optimization in question does not increase the
alignment the way SelectionDAG normally would. For the rest, I just
increased the specified alignment on the allocas to match what
SelectionDAG was inferring.
Differential Revision: https://reviews.llvm.org/D79532
If there are no available lanes in a reserved VGPR, no free SGPR, and no unused CSR
VGPR when trying to save the FP it needs to be spilled to memory as a last
resort. This can be done in the prolog/epilog if we manually add the spill
and manage exec.
Differential Revision: https://reviews.llvm.org/D79610
Just do not touch loads and stores which are already vector.
Previously pass was just unable to see these loads and stores
because these were hidden bitcasts.
Differential Revision: https://reviews.llvm.org/D79738
Currently this code exists in widenScalar for G_MERGE_VALUE
sources. I'm not sure if the existing expansion in widenScalar should
be removed or not. The widenScalar variant tries to extend to the
requested size, but this just uses the original bitwidth.
G_BITCAST can be lowered with a pair of G_UNMERGE_VALUES and
G_MERGE_VALUES with different types, but G_UNMERGE_VALUES of a vector
can also be implemented with a bitcast to a scalar, which introduces
the possibility for infinite loops. Try to eliminate an illegal source
register type in the artifact combiner to avoid this from happening.
Avoids infinite looping in the legalizer in a future patch which
allows lowering G_UNMERGE_VALUES of a vector source with a G_BITCAST.
Check the address space first before searching for the object
definition to save compile time. As an added bonus, this will now
treat casts to constant addrspace as constant.
We also seemed to be missing targeted tests for this, so add a few
missing other cases too.
This is mostly useful if alloca element type is not integer
and then casted to an integer for load or store. We now can
vectorize an [i32] alloca but cannot do so for [float].
There also a separate patch needed to properly lower 64 bit
types after they vectorized. At the moment these are lowered
via scratch anyway.
Differential Revision: https://reviews.llvm.org/D79641
If the SimplifyMultipleUseDemandedBits calls BITCASTs that peek through back to the original type then we can remove the BITCASTs entirely.
Differential Revision: https://reviews.llvm.org/D79572
These were testing byval private kernel arguments, which doesn't make
any sense and has never been used. There didn't seem to be any tests
for real value struct arguments, which are.
We do not want to break asm syntax. These suffixes are
quite useful for debugging, so add an option to print
them. Right now it is NFC.
Differential Revision: https://reviews.llvm.org/D79435
Since SRSRC has alignment requirements, first find non GIT pointer clobbered
registers for SRSRC and then if those registers clobber preloaded Scratch Wave
Offset register, copy the Scratch Wave Offset register to a free SGPR.
The AMDGPU target has a convention that defined all VGPRs
(execept the initial 32 argument registers) as callee-saved.
This convention is not efficient always, esp. when the callee
requiring more registers, ended up emitting a large number of
spills, even though its caller requires only a few.
This patch revises the ABI by introducing more scratch registers
that a callee can freely use.
The 256 vgpr registers now become:
32 argument registers
112 scratch registers and
112 callee saved registers.
The scratch registers and the CSRs are intermixed at regular
intervals (a split boundary of 8) to obtain a better occupancy.
Reviewers: arsenm, t-tye, rampitec, b-sumner, mjbedy, tpr
Reviewed By: arsenm, t-tye
Differential Revision: https://reviews.llvm.org/D76356
VMEM soft clauses only contain VMEM and FLAT instructions. Teaching
GCNHazardRecognizer::checkSoftClauseHazards that other kinds of
instructions will naturally break the clause means there are far fewer
cases where it has to insert an s_nop instruction to forcibly break the
clause.
Differential Revision: https://reviews.llvm.org/D79353
Marking a section as ALLOC tells the ELF loader to load the section into memory.
As we do not want to load the notes into VRAM, the flag should not be there.
On AMDHSA, .note is still marked as ALLOC, apparently this is currently
needed for OpenCL (see https://reviews.llvm.org/D74995).
Differential Revision: https://reviews.llvm.org/D76278
This a hack to fix illegal 32 to 16 bit copies.
The problem is when we make 16 bit subregs legal it creates
a huge amount of failures which can only be resolved at once
without a temporary hack like this.
The next step is to change operands, instruction definitions
and patterns until this hack is not needed.
Differential Revision: https://reviews.llvm.org/D79119
Summary: This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.
Reviewers: rampitec, arsenm, vpykhtin
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78091
The two code paths have the same goal, legalizing a load of a non-byte-sized vector by loading the "flattened" representation in memory, slicing off each single element and then building a vector out of those pieces.
The technique employed by `ExpandLoad` is slightly more convoluted and produces slightly better codegen on ARM, AMDGPU and x86 but suffers from some bugs (D78480) and is wrong for BE machines.
Differential Revision: https://reviews.llvm.org/D79096
VMEM loads of the same type (sampler vs no sampler) are guaranteed to
write their result registers in order, so there is no need for an
s_waitcnt even if they write to overlapping vgprs.
Differential Revision: https://reviews.llvm.org/D79176
It allows it not to crash and analyze 16 bit subregs if those
appear in the instructions. At the same time it does not attempt
to reassign these. It still can correctly identify register
banks to let larger registers to be reassigned.
More work will be needed here when real instructions will use
these registers and more tests as well.
Differential Revision: https://reviews.llvm.org/D78772
These are used in SReg_32 and when we start to use SGPR_LO16
there will be compaints that not all registers in RC support
all subreg indexes. For now it is NFC.
Unused regunits are reserved so that verifier does not complain
about missing phys reg live-ins.
Differential Revision: https://reviews.llvm.org/D78591
These are used in SReg_32 and when we start to use SGPR_LO16
there will be compaints that not all registers in RC support
all subreg indexes. For now it is NFC.
Unused regunits are reserved so that verifier does not complain
about missing phys reg live-ins.
Differential Revision: https://reviews.llvm.org/D78591
This is to fix performance regressions introduced by
86c944d790.
The old search would collect all potentially mergeable instructions in
the entire block. In this case, the same address is written in
multiple places in the block on the other side of a fence. When sorted
by offset, the two unmergeable, identical addresses would be next to
each other and the merge would give up.
Break the search space when we encounter an instruction we won't be
able to merge across. This will keep the identical addresses in
different merge attempts.
This may also improve compile time by reducing the merge list size.
Summary:
Frontend guarantees that coherent accesses have
corresponding cache policy bits set (glc, dlc).
Therefore there is no need for extra instructions
that invalidate cache.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78800
If f32 denormals were enabled pre-gfx9, we would still try to
implement this with v_max_f32. Pre-gfx9, these instructions ignored
the denormal mode and did not flush. Switch to the multiply form for
f32 as a workaround which should always work in any case.
This fixes conformance failures when the library implementation of
fmin/fmax were accidentally not inlined, forcing the assumption of no
flushing on targets where denormals are not enabled by default. This
is a workaround, since really we should not be mixing code with
different FP mode expectations, but prefer the lowering that will work
in any mode.
Now this will always use max to implement canonicalize on gfx9+. This
is only really beneficial for f64. For f32/f16 it's a neutral choice
(and worse in terms of code size in 1 case), but possibly worse for
the compiler since it does add an extra register use operand. Leave
this change for later.
12994a70cf did this for 128-bit classes:
SGPR_128 only includes the real allocatable SGPRs, and SReg_128 adds
the additional non-allocatable TTMP registers. There's no point in
allocating SReg_128 vregs. This shrinks the size of the classes
regalloc needs to consider, which is usually good.
This patch extends it to all classes > 64 bits, for consistency.
Differential Revision: https://reviews.llvm.org/D78622
These are needed as a counterpart for VGPR subregs even though
there are no scalar instructions which can operate 16 bit values.
When we are materializing a constant that is done into an SGPR
and that SGPR may/will be copied into a 16 bit VGPR subreg. Such
copy is illegal. There are also similar problems if a source
operand of a 16 bit VALU instruction is an SGPR. In addition
we need to get a register with a lo16 subregister of an SGPR
RC during selection and this fails as well.
All of that makes me believe we need these subregisters as a
syntactic glue.
Differential Revision: https://reviews.llvm.org/D78250
Summary:
The INLINEASM MIR instructions use immediate operands to encode the values of some operands.
The MachineInstr pretty printer function already handles those operands and prints human readable annotations instead of the immediates. This patch adds similar annotations to the output of the MIRPrinter, however uses the new MIROperandComment feature.
Reviewers: SjoerdMeijer, arsenm, efriedma
Reviewed By: arsenm
Subscribers: qcolombet, sdardis, jvesely, wdng, nhaehnle, hiraditya, jrtc27, atanasyan, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78088
We generally only combine starting from users to defs in the artifact combiner,
but this doesn't catch cases where at the point of combining a G_UNMERGE we don't
yet have the opposite G_MERGE on input yet since we haven't legalized that far.
This change adds the users of a G_MERGE to the artifact combiner worklist if one
of the uses is a G_UNMERGE or G_TRUNC.
Differential Revision: https://reviews.llvm.org/D77931
Summary:
The combine for unmerge(cast(merge)) is only valid for vectors, but was
missing a corresponding check. Add a check that the operands are vectors
to avoid an invalid combine.
Without this check, the combiner would emit incorrect code for scalars
and pointers because the artifact cast (trunc/ext) only affects bits at
the end of the type, while this combine assumes that the casted bits
appear between meaningful bits.
This also uncovered a segmentation fault in the AMDGPU
InstructionSelector. The tests triggering this bug have been moved to
their own file and a check for the segmentation fault has been added.
Reviewers: arsenm, dsanders, aemerson, paquette, aditya_nandakumar
Reviewed By: arsenm
Subscribers: tpr, jvesely, wdng, nhaehnle, rovka, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78191
This was hitting the default instruction constraint code which uses
the register classes in the instruction def, which REG_SEQUENCE does
not have.
Fixes not constraining the register class for AMDGPU fneg/fabs
patterns, which would fail when the use was another generic,
unconstrained instruction.
Another oddity I noticed is that the temporary registers are created
with an unnecessary, but incorrect 16-bit LLT but this shouldn't
matter.
I'm also still unclear why root and sub-instructions have to be
handled differently.
Previously, getWithOffset() would drop the offset if the base was null.
Because of this, MachineMemOperand would return the wrong result from
getAlign() in these cases. MachineMemOperand stores the alignment of
the pointer without the offset.
A bunch of MIR tests changed because we print the offset now.
Split off from D77687.
Differential Revision: https://reviews.llvm.org/D78049
Ports the existing DAG combines, minus the simplify demanded bits
which seems to have no equivalent now. Without these, this isn't
particularly helpful in most of the IR sample cases.
This is equivalent in terms of LLVM IR semantics, but we want to
transition away from using MaybeAlign to represent the alignment of
these instructions.
Differential Revision: https://reviews.llvm.org/D77984
Since 1725f28841, this should check
isFMADLegalForFAddFSub rather than the the plain isOperationLegal.
This would assert in a subset of cases due to an oddity in how FMAD is
selected. We will allow FMA formation pre-legalize, but not FMAD even
in cases where it would be valid.
The current hook requires passing in the root fadd/fsub. However, in
this distributed case, this would be far more complicated to pass in
the relevant operand. AMDGPU doesn't get any value from the node, and
only needs the type and is the only implementor, so I'm not sure why
we have this complexity. Just rename and expand the assert to avoid
the more complicated checks spread through the distribution logic.