R600 relies on this behaviour.
Fixes: 6e18266aa4 ('Partially revert D61491 "AMDGPU: Be explicit about whether the high-word in SI_PC_ADD_REL_OFFSET is 0"')
Fixes ~100 piglit regressions since 6e18266
Differential Revision: https://reviews.llvm.org/D72991
The pattern is also mishandled by the generated matcher, so workaround
this as in the DAG path.
The existing DAG tests aren't particularly targeted to just this one
intrinsic. These also end up differing in scheduling from SGPR->VGPR
operand constraint copies.
The waterfall utility function blindly inserts a phi for every def in
the loop. We don't need this one to be preserved for every
iteration. Saves an extra phi and copy inside the loop body.
The result and source vector are going to be tied, so these need to be
the same bank.
The inserted value also needs to be broken down based on the result
bank, not the inserted value itself.
Handle dynamic vector extracts that use an index that's an add of a
constant offset into moving the base subregister of the indexing
operation.
Force the add into the loop in regbankselect, which will be recognized
when selected.
DAGCombiner does this, but divisions expanded here miss this
optimization. Since 67aa18f165,
divisions have been expanded here and missed out on this
optimization. Avoids test regressions in a future patch.
This using the wrong result register, and dropping the result entirely
for v2f16. This would fail to select on the scalar case. I believe it
was also mishandling packed/unpacked subtargets.
Summary:
Check that a s_cbranch_execz is not a loop exit before removing it.
As the pass is generating infinite loops.
Reviewers: cdevadas, arsenm, nhaehnle
Reviewed By: nhaehnle
Subscribers: kzhuravl, jvesely, wdng, yaxunl, tpr, t-tye, hiraditya, kerbowa, llvm-commits, dstuttard, foad
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72997
The current implementation of skip insertion (SIInsertSkip) makes it a
mandatory pass required for correctness. Initially, the idea was to
have an optional pass. This patch inserts the s_cbranch_execz upfront
during SILowerControlFlow to skip over the sections of code when no
lanes are active. Later, SIRemoveShortExecBranches removes the skips
for short branches, unless there is a sideeffect and the skip branch is
really necessary.
This new pass will replace the handling of skip insertion in the
existing SIInsertSkip Pass.
Differential revision: https://reviews.llvm.org/D68092
Pointers of unrecognized address spaces shoudl be treated as
global-like pointers. Even if loads and stores of them aren't handled,
dumb operations that just operate on the bits should work.
This was unconditionally folding this to the source operand, even if the access was out of bounds. Use undef instead of the extract is not the first element.
This helps with some cases where 3-vectors are legalized and avoids processing the 4th component.
Original Patch by: arsenm (Matt Arsenault)
Differential Revision: https://reviews.llvm.org/D51589
There's no reason to introduce a new, unnaturally sized value
here. This has a chance to produce worse code with
legalization. Avoids regression in a future patch.
Currently the lowering for i16 image coordinates asserts on gfx10. I'm
somewhat confused by this though. The feature is missing from the
gfx10 feature lists, but the a16 bit appears to be present in the
manual for MIMG instructions.
This was assuming the narrow target was the source type. Respect the
requested type when these don't match by using intermediate
merges. This avoids producing very wide, illegal shift expansions.
This was dropping the invariant metadata on dead argument loads, so
they weren't deleted.
Atomics still need to be fixed the same way. Also, apparently store
was never preserving dereferencable which should also be fixed.
Summary:
This patch could be treated as a rebase of D33960. It also fixes PR35547.
A fix for `llvm/test/Other/close-stderr.ll` is proposed in D68164. Seems
the consensus is that the test is passing by chance and I'm not
sure how important it is for us. So it is removed like in D33960 for now.
The rest of the test fixes are just adding `--crash` flag to `not` tool.
** The reason it fixes PR35547 is
`exit` does cleanup including calling class destructor whereas `abort`
does not do any cleanup. In multithreading environment such as ThinLTO or JIT,
threads may share states which mostly are ManagedStatic<>. If faulting thread
tearing down a class when another thread is using it, there are chances of
memory corruption. This is bad 1. It will stop error reporting like pretty
stack printer; 2. The memory corruption is distracting and nondeterministic in
terms of error message, and corruption type (depending one the timing, it
could be double free, heap free after use, etc.).
Reviewers: rnk, chandlerc, zturner, sepavloff, MaskRay, espindola
Reviewed By: rnk, MaskRay
Subscribers: wuzish, jholewinski, qcolombet, dschuff, jyknight, emaste, sdardis, nemanjai, jvesely, nhaehnle, sbc100, arichardson, jgravelle-google, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, lenary, s.egerton, pzheng, cfe-commits, MaskRay, filcab, davide, MatzeB, mehdi_amini, hiraditya, steven_wu, dexonsmith, rupprecht, seiya, llvm-commits
Tags: #llvm, #clang
Differential Revision: https://reviews.llvm.org/D67847
When tail duplication estimates a size of tail it uses instruction
count. Account for a number of instrictions in a bundle too.
Differential Revision: https://reviews.llvm.org/D72783
This does produce slightly different code. Now a unique IMPLICIT_DEF
is emitted for each of the implicit_def operands, rather than reusing
the same one.
This now develops the same problem G_ZEXT/G_ANYEXT have where the
requested type is assumed to be the source type. This will be fixed
separately by creating intermediate merges.
Bitcast only really applies between scalars and vectors. Implement as
an unmerge and remerge. The test needs to tolerate failure since one
of the unmerges currently fails to legalize.
The current implementation of skip insertion (SIInsertSkip) makes it a
mandatory pass required for correctness. Initially, the idea was to
have an optional pass. This patch inserts the s_cbranch_execz upfront
during SILowerControlFlow to skip over the sections of code when no
lanes are active. Later, SIRemoveShortExecBranches removes the skips
for short branches, unless there is a sideeffect and the skip branch is
really necessary.
This new pass will replace the handling of skip insertion in the
existing SIInsertSkip Pass.
Differential revision: https://reviews.llvm.org/D68092
Summary:
- `dead-mi-elimination` assumes MIR in the SSA form and cannot be
arranged after phi elimination or DeSSA. It's enhanced to handle the
dead register definition by skipping use check on it. Once a register
def is `dead`, all its uses, if any, should be `undef`.
- Re-arrange the DIE in RA phase for AMDGPU by placing it directly after
`detect-dead-lanes`.
- Many relevant tests are refined due to different register assignment.
Reviewers: rampitec, qcolombet, sunfish
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72709
Summary:
Mem op clustering adds a weak edge in the DAG between two loads or
stores that should be clustered, but the direction of this edge is
pretty arbitrary (it depends on the sort order of MemOpInfo, which
represents the operands of a load or store). This often means that two
loads or stores will get reordered even if they would naturally have
been scheduled together anyway, which leads to test case churn and goes
against the scheduler's "do no harm" philosophy.
The fix makes sure that the direction of the edge always matches the
original code order of the instructions.
Reviewers: atrick, MatzeB, arsenm, rampitec, t.p.northover
Subscribers: jvesely, wdng, nhaehnle, kristof.beyls, hiraditya, javed.absar, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72706
This change allows to model the height of the instruction
within a bundle for latency adjustment purposes.
Differential Revision: https://reviews.llvm.org/D72669
We do not have InstrItinerary so generic getInstLatency() was always
defaulting to return 1 cycle. We need to use TargetSchedModel instead
to compute an instruction's latency.
Differential Revision: https://reviews.llvm.org/D72655
The branch target needs to be changed depending on whether there is an
unconditional branch or not.
Loops also need to be similarly fixed, but compiling a simple testcase
end to end requires another set of patches that aren't upstream yet.
As detailed in https://blog.regehr.org/archives/1709 we don't make use of the known leading/trailing zeros for shifted values in cases where we don't know the shift amount value.
This patch adds support to SelectionDAG::ComputeKnownBits to use KnownBits::countMinTrailingZeros and countMinLeadingZeros to set the minimum guaranteed leading/trailing known zero bits.
Differential Revision: https://reviews.llvm.org/D72573
Summary:
- SI Whole Quad Mode phase is replacing WQM pseudo instructions with v_mov instructions.
While this is necessary for the special handling of moving results out of WWM live ranges,
it is not necessary for WQM live ranges. The result is a v_mov from a register to itself after every
WQM operation. This change uses a COPY psuedo in these cases, which allows the register
allocator to coalesce the moves away.
Reviewers: tpr, dstuttard, foad, nhaehnle
Reviewed By: nhaehnle
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71386
If an SGPR vector is indexed with a VGPR, the actual indexing will be
done on the SGPR and produce an SGPR. A copy needs to be inserted
inside the waterwall loop to the VGPR result.
The current implementation assumes there is an instruction associated
with the transform, but this is not the case for
timm/TargetConstant/immarg values. These transforms should directly
operate on a specific MachineOperand in the source
instruction. TableGen would assert if you attempted to define an
equivalent GISDNodeXFormEquiv using timm when it failed to find the
instruction matcher.
Specially recognize SDNodeXForms on timm, and pass the operand index
to the render function.
Ideally this would be a separate render function type that looks like
void renderFoo(MachineInstrBuilder, const MachineOperand&), but this
proved to be somewhat mechanically painful. Add an optional operand
index which will only be passed if the transform should only look at
the one source operand.
Theoretically it would also be possible to only ever pass the
MachineOperand, and the existing renderers would check the parent. I
think that would be somewhat ugly for the standard usage which may
want to inspect other operands, and I also think MachineOperand should
eventually not carry a pointer to the parent instruction.
Use it in one sample pattern. This isn't a great example, since the
transform exists to satisfy DAG type constraints. This could also be
avoided by just changing the MachineInstr's arbitrary choice of
operand type from i16 to i32. Other patterns have nontrivial uses, but
this serves as the simplest example.
One flaw this still has is if you try to use an SDNodeXForm defined
for imm, but the source pattern uses timm, you still see the "Failed
to lookup instruction" assert. However, there is now a way to avoid
it.
When these arguments are broken down by the EVT based callbacks, the
pointer information is lost. Hack around this by coercing the register
types to be the expected pointer element type when building the
remerge operations.
This should be legal, but will require future selection work. 16-bit
shift amounts were already removed from being legal, but this didn't
adjust the transformation rules.
Summary:
Added MIRFormatter for target specific MIR formating and parsing with
immediate and custom pseudo source values. Target machine can subclass
MIRFormatter and implement custom logic for printing and parsing
immediate and custom pseudo source values for better readability.
* Target specific immediate mnemonic need to start with "." follows by
identifier string. When MIR parser sees immediate it will call target
specific parsing function.
* Custom pseudo source value need to start with custom follows by
double-quoted string. MIR parser will pass the quoted string to target
specific PSV parsing function.
* MIRFormatter have 2 helper functions to facilitate LLVM value printing
and parsing for custom PSV if they refers LLVM values.
Patch by Peng Guo
Reviewers: dsanders, arsenm
Reviewed By: dsanders
Subscribers: wdng, jvesely, nhaehnle, hiraditya, jfb, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69836
Summary:
Added MIRFormatter for target specific MIR formating and parsing with
immediate and custom pseudo source values. Target machine can subclass
MIRFormatter and implement custom logic for printing and parsing
immediate and custom pseudo source values for better readability.
* Target specific immediate mnemonic need to start with "." follows by
identifier string. When MIR parser sees immediate it will call target
specific parsing function.
* Custom pseudo source value need to start with custom follows by
double-quoted string. MIR parser will pass the quoted string to target
specific PSV parsing function.
* MIRFormatter have 2 helper functions to facilitate LLVM value printing
and parsing for custom PSV if they refers LLVM values.
Reviewers: dsanders, arsenm
Reviewed By: dsanders
Subscribers: wdng, jvesely, nhaehnle, hiraditya, jfb, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69836
Don't overwrite existing target-cpu attributes.
I've often found the replacement behavior annoying, and this is
inconsistent with how the fast math command line flags interact with
the function attributes.
Does not yet change target-features, since I think that should behave
as a concatenation.
This solves selection failures with generated selection patterns,
which would fail due to inferring the SGPR reg bank for virtual
registers with a set register class instead of VCC bank. Use
instruction selection would constrain the virtual register to a
specific class, so when the def was selected later the bank no longer
was set to VCC.
Remove the SCC reg bank. SCC isn't directly addressable, so it
requires copying from SCC to an allocatable 32-bit register during
selection, so these might as well be treated as 32-bit SGPR values.
Now any scalar boolean value that will produce an outupt in SCC should
be widened during RegBankSelect to s32. Any s1 value should be a
vector boolean during selection. This makes the vcc register bank
unambiguous with a normal SGPR during selection.
Summary of how this should now work:
- G_TRUNC is always a no-op, and never should use a vcc bank result.
- SALU boolean operations should be promoted to s32 in RegBankSelect
apply mapping
- An s1 value means vcc bank at selection. The exception is for
legalization artifacts that use s1, which are never VCC. All other
contexts should infer the VCC register classes for s1 typed
registers. The LLT for the register is now needed to infer the
correct register class. Extensions with vcc sources should be
legalized to a select of constants during RegBankSelect.
- Copy from non-vcc to vcc ensures high bits of the input value are
cleared during selection.
- SALU boolean inputs should ensure the inputs are 0/1. This includes
select, conditional branches, and carry-ins.
There are a few somewhat dirty details. One is that G_TRUNC/G_*EXT
selection ignores the usual register-bank from register class
functions, and can't handle truncates with VCC result banks. I think
this is OK, since the artifacts are specially treated anyway. This
does require some care to avoid producing cases with vcc. There will
also be no 100% reliable way to verify this rule is followed in
selection in case of register classes, and violations manifests
themselves as invalid copy instructions much later.
Standard phi handling also only considers the bank of the result
register, and doesn't insert copies to make the source banks
match. This doesn't work for vcc, so we have to manually correct phi
inputs in this case. We should add a verifier check to make sure there
are no phis with mixed vcc and non-vcc register bank inputs.
There's also some duplication with the LegalizerHelper, and some code
which should live in the helper. I don't see a good way to share
special knowledge about what types to use for intermediate operations
depending on the bank for example. Using the helper to replace
extensions with selects also seems somewhat awkward to me.
Another issue is there are some contexts calling
getRegBankFromRegClass that apparently don't have the LLT type for the
register, but I haven't yet run into a real issue from this.
This also introduces new unnecessary instructions in most cases, since
we don't yet try to optimize out the zext when the source is known to
come from a compare.
This would complain about invalid legalizer rules otherwise.
Mark some operations as unsupported for AMDGPU. This currently seems
to produce the same legalize error as when no rules are defined, but
eventually this should produce a proper user facing error.
The existing test only covered one case for r600. The use of
mul_legacy also looks suspicious to me, but leave it for now. The
patterns are also not making use of source modifiers.