This assumed a 32-bit extract size, which would produce invalid copies
with 64-bit extracts. Handle the easy case. Ideally we would have a
way to get the proper subreg index for any 32-bit offset, but there
should probably be a tablegenerated way of getting the subreg index
for any size and offset.
This produces more intelligible looking results, more comparabble to
the DAG output in the simplest cases. This is probably wrong in
complex control flow, but RegBankSelect doesn't attempt analyzing if
this is on a masked path for selecting the bank yet.
This was increasing the number of instructions when fsub was legalized
on AMDGPU with no signed zeros enabled. This fold should be guarded by
hasOneUse, and I don't think getNode should be doing that. The same
fold is already done as a regular combine through isNegatibleForFree.
This does require duplicating, even though isNegatibleForFree does
this combine already (and properly checks hasOneUse) to avoid one PPC
regression. In the regression, the outer fneg has nsz but the fsub
operand does not. isNegatibleForFree only sees the operand, and
doesn't see it's used from a nsz context. A nsz parameter needs to be
added and threaded through isNegatibleForFree to avoid this.
G_BITREVERSE is generated from llvm.bitreverse.<type> intrinsics,
clang genrates these intrinsics from __builtin_bitreverse32 and
__builtin_bitreverse64.
Add lower and narrowscalar for G_BITREVERSE.
Lower G_BITREVERSE on MIPS32.
Recommit notes:
Introduce temporary variables in order to make sure
instructions get inserted into MachineFunction in same order
regardless of compiler used to build llvm.
Differential Revision: https://reviews.llvm.org/D71363
Sometimes the result bank of the phi is already assigned to something,
and should not be ignored. This is in preparation for additional
boolean phi handling changes.
Also refine the logic to fix some cases that were incorrectly deciding
to use SGPRs.
G_BITREVERSE is generated from llvm.bitreverse.<type> intrinsics,
clang genrates these intrinsics from __builtin_bitreverse32 and
__builtin_bitreverse64.
Add lower and narrowscalar for G_BITREVERSE.
Lower G_BITREVERSE on MIPS32.
Differential Revision: https://reviews.llvm.org/D71363
The early tail duplicator pass introduces new ones, so a MIR test that
infers no phis since there were none on the input would fail the
verifier after running.
There ended up being two result registers, which would fail on
select. It was really defing a new temp register in the correct def
position, instead of the correct result register.
Summary:
The only useful information the UndefValue conveys is the address space,
which MachinePointerInfo can represent directly without referring to an
IR value.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71838
Summary:
Without this check unnecessary FMA instructions are generated when the FSUB terms are reused.
This also has the side-effect that the same value is computed to different levels of precision, which can create undesirable effects if the results are used together in subsequent computation.
Reviewers: arsenm, nhaehnle, foad, tpr, dstuttard, spatel
Reviewed By: arsenm
Subscribers: jvesely, wdng, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71656
Confusingly, the intrinsic operands do not match the
instruction/custom node. The order is shuffled, and the 3rd operand is
an immediate to select operands.
I'm not 100% sure I did this right, but fdiv still doesn't select end
to end and it will be easier to tell when it does. This at least
avoids an assertion in RegBankSelect and allows hitting the fallback
on selection.
Summary:
The typo has been present since memOpsHaveSameBasePtr was introduced in
r313208.
It caused SIInstrInfo::shouldClusterMemOps to cluster more mem ops than
it was supposed to.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71616
Summary:
At present, the code calculating known bits of AMDGPU MUL_I24 confuses the concepts of "non-negative number" and "positive number".
In some situations, it results in incorrect code. I have a case where the optimizer replaces the result of calculating MUL_I24(-5, 0) with -8.
Reviewers: foad, arsenm
Reviewed By: arsenm
Subscribers: foad, arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Patch by Eugene Kuznetsov.
Differential Revision: https://reviews.llvm.org/D70367
Legalization algorithm is complicated by two facts:
1) While regular instructions should be possible to legalize in
an isolated, per-instruction, context-free manner, legalization
artifacts can only be eliminated in pairs, which could be deeply, and
ultimately arbitrary nested: { [ () ] }, where which paranthesis kind
depicts an artifact kind, like extend, unmerge, etc. Such structure
can only be fully eliminated by simple local combines if they are
attempted in a particular order (inside out), or alternatively by
repeated scans each eliminating only one innermost pair, resulting in
O(n^2) complexity.
2) Some artifacts might in fact be regular instructions that could (and
sometimes should) be legalized by the target-specific rules. Which
means failure to eliminate all artifacts on the first iteration is
not a failure, they need to be tried as instructions, which may
produce more artifacts, including the ones that are in fact regular
instructions, resulting in a non-constant number of iterations
required to finish the process.
I trust the recently introduced termination condition (no new artifacts
were created during as-a-regular-instruction-retrial of artifacts not
eliminated on the previous iteration) to be efficient in providing
termination, but only performing the legalization in full if and only if
at each step such chains of artifacts are successfully eliminated in
full as well.
Which is currently not guaranteed, as the artifact combines are applied
only once and in an arbitrary order that has to do with the order of
creation or insertion of artifacts into their worklist, which is a no
particular order.
In this patch I make a small change to the artifact combiner, making it
to re-insert into the worklist immediate (modulo a look-through copies)
artifact users of each vreg that changes its definition due to an
artifact combine.
Here the first scan through the artifacts worklist, while not
being done in any guaranteed order, only needs to find the innermost
pair(s) of artifacts that could be immediately combined out. After that
the process follows def-use chains, making them shorter at each step, thus
combining everything that can be combined in O(n) time.
Reviewers: volkan, aditya_nandakumar, qcolombet, paquette, aemerson, dsanders
Reviewed By: aditya_nandakumar, paquette
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71448
This reverts commit 69fcfb7d35.
As shown in the test I attached to this commit, the change I reverted
causes a problem with "zext(cc1) - zext(cc2)". It commuted
the operands to the sub and used different logic to select the addc/subc
instruction:
sub zext (setcc), x => addcarry 0, x, setcc
sub sext (setcc), x => subcarry 0, x, setcc
... but that is bogus. I believe it is not possible to fold those commuted
patterns into any form of addcarry or subcarry. It may have worked as
intended before "AMDGPU: Change boolean content type to 0 or 1" because
the setcc was considered to be -1 rather than 1.
Differential Revision: https://reviews.llvm.org/D70978
Change-Id: If2139421aa6c935cbd1d925af58fe4a4aa9e8f43
Current tail duplication integrated in bb layout is designed to increase the fallthrough from a BB's predecessor to its successor, but we have observed cases that duplication doesn't increase fallthrough, or it brings too much size overhead.
To overcome these two issues in function canTailDuplicateUnplacedPreds I add two checks:
make sure there is at least one duplication in current work set.
the number of duplication should not exceed the number of successors.
The modification in hasBetterLayoutPredecessor fixes a bug that potential predecessor must be at the bottom of a chain.
Differential Revision: https://reviews.llvm.org/D64376
Summary:
Catch the (admittedly unusual) case where SIFoldOperands attempts to fold 2
constant operands into the same SALU operation, with neither operand able to be
encoded as an inline constant.
Change-Id: Ibc48d662c9ffd8bbacd154976b0b1c257ace0927
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70896
Summary:
Pre gfx9 we need to scavenge a 64-bit SGPR to use as the carry out for an Add.
If only one SGPR was available this crashed when trying to scavenge another
32bit SGPR to materialize the offset.
Instead, reuse a 32-bit SGPR from the carry out as the offset register.
Also prefer to use vcc for the unused carry out when it is available.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70614
Summary:
I was seeing some failures on a test with slightly different instruction
ordering. Adding in some DAG directives solved the issue.
Change-Id: If5a3d3969055fb19279943bd45161bb70a3dabce
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70531
This reapplies: 8ff85ed905
Original commit message:
As a follow-up to my initial mail to llvm-dev here's a first pass at the O1 described there.
This change doesn't include any change to move from selection dag to fast isel
and that will come with other numbers that should help inform that decision.
There also haven't been any real debuggability studies with this pipeline yet,
this is just the initial start done so that people could see it and we could start
tweaking after.
Test updates: Outside of the newpm tests most of the updates are coming from either
optimization passes not run anymore (and without a compelling argument at the moment)
that were largely used for canonicalization in clang.
Original post:
http://lists.llvm.org/pipermail/llvm-dev/2019-April/131494.html
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65410
This reverts commit c9ddb02659.
This change doesn't include any change to move from selection dag to fast isel
and that will come with other numbers that should help inform that decision.
There also haven't been any real debuggability studies with this pipeline yet,
this is just the initial start done so that people could see it and we could start
tweaking after.
Test updates: Outside of the newpm tests most of the updates are coming from either
optimization passes not run anymore (and without a compelling argument at the moment)
that were largely used for canonicalization in clang.
Original post:
http://lists.llvm.org/pipermail/llvm-dev/2019-April/131494.html
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65410
Summary:
The waitcnt pass can overflow the counters when the number of outstanding events
for a type exceed the capacity of the counter. This can lead to inefficient
insertion of waitcnts, or to waitcnt instructions with max values for each type.
The last situation can cause an instruction which when disassembled appears to
be an illegal waitcnt without an operand.
In these cases we should add a wait for the 'counter maximum' - 1, and update the
waitcnt brackets accordingly.
Reviewers: rampitec, arsenm
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70418
Summary:
Add a function attribute to allow the target specific default loop unroll threshold
to be specified on a per-function basis. This allows a front-end to give guidance
where it has insight that is not available to the back-end, while still allowing the
target specific heuristics to also have an effect.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68873
Hostcall is a service that allows a kernel to submit requests to the
host using shared buffers, and block until a response is
received. This will eventually replace the shared buffer currently
used for printf, and repurposes the same hidden kernel argument. This
change introduces a new ValueKind in the HSA metadata to represent the
hostcall buffer.
Differential Revision: https://reviews.llvm.org/D70038
Start moving towards treating this as a property of the calling
convention, and not the subtarget. The default denormal mode should
not be part of the subtarget, and be moved into a separate function
attribute.
This patch is still NFC. The denormal mode remains as a subtarget
feature for now, but make the necessary changes to switch to using an
attribute.
Summary:
Most of IR instructions got better code size estimations after commit 47a5c36b.
So default parameters values should be updated to improve inlining and
unrolling for the target.
Reviewers: rampitec, arsenm
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, zzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70391
The usage of target boolean checks is overly inflexible, since sext
and zext of a compare are equally cheap. The choice is arbitrary, but
using 0/1 to some degree is the choice of lower resistance since
that's what most targets use. This enables a few combines that don't
bother to support ZeroOrNegativeOneBooleanContent.
Previously this would default to 256, not the maximum supported size
of 1024. Using a maximum lower than the hardware maximum requires
language runtimes to enforce this limit for correctness, which no
language has correctly done. Switch the default to the conservatively
correct maximum, and force frontends to opt-in to the more optimal 256
default maximum.
I don't really understand why the changes in occupancy-levels.ll
increased the computed occupancy, which I expected to decrease. I'm
not sure if these tests should be forcing the old maximum.
In MachineCopyPropagation, when propagating the source of a copy into
the operand of a later instruction, bail if a destination overlaps
(partly defines) the copy source. If the instruction where the
substitution is happening is also a copy, allowing the propagation
confuses the tracking mechanism.
Differential Revision: https://reviews.llvm.org/D69953
Change-Id: Ic570754f878f2d91a4a50a9bdcf96fbaa240726d
The combine G_UNMERGE_VALUES with G_CONCAT_VECTORS used to only be performed
when the result type of the G_UNMERGE_VALUES was a vector type.
In other words, we were expecting that the G_UNMERGE_VALUES was effectively
the exact opposite of the G_CONCAT_VECTORS.
Lift that constraint by allowing any G_UNMERGE_VALUES to be combined
with any G_CONCAT_VECTORS (as long as the size of the different pieces
that we merge/unmerge match).
Differential Revision: https://reviews.llvm.org/D69288
Summary:
G_GEP is rather poorly named. It's a simple pointer+scalar addition and
doesn't support any of the complexities of getelementptr. I therefore
propose that we rename it. There's a G_PTR_MASK so let's follow that
convention and go with G_PTR_ADD
Reviewers: volkan, aditya_nandakumar, bogner, rovka, arsenm
Subscribers: sdardis, jvesely, wdng, nhaehnle, hiraditya, jrtc27, atanasyan, arphaman, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69734
readlane and writelane instructions are not allowed to use m0 as the
data operand, so spilling them is tricky and would require an
intermediate SGPR to spill it. Constrain the virtual register class in
this caes to disallow the inline spiller from folding the m0 operand
directly into the spill instruction.
I copied this hack from AArch64 which has the same problem for $sp.
Summary:
VCCZBugHandledSet was used to make sure we don't apply the same
workaround more than once to a single cbranch instruction, but it's not
necessary because the workaround involves inserting an s_waitcnt
instruction, which is enough for subsequent iterations to detect that no
further workaround is necessary.
Also beef up the test case to check that the workaround was only applied
once. I have also manually verified that the test still passes even if I
hack the big do-while loop in runOnMachineFunction to run a minimum of
five iterations.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69621
Summary:
This is used on AMDGPU for rounding from v3f64 (which is illegal) to
v3f32 (which is legal).
Subscribers: jvesely, nhaehnle, tpr, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69339
When TableGen is inferring register classes from contexts, it uses a
sorting function based on the number of registers in the class. Since
this was being treated as an alias of VGPR_32, they had exactly the
same size. The sort used wasn't a stable sort, and even if it were, I
believe the tie breaker would effectively end up being the
alphabetical ordering of the class name. There appear to be issues
trying to use an empty set of registers, so add only one so this will
always sort to the end.
Also add a comment explaining how VReg_1 is a dirty hack for
SelectionDAG.
This does end up changing the behavior of i1 with inline asm and VGPR
constraints, but the existing behavior was was already nonsensical and
inconsistent. It should probably be disallowed anyway.
Fixes bug 43699
Summary:
An outstanding load with same destination sgpr as call could cause PC to be
updated with junk value on return.
Reviewers: arsenm, rampitec
Reviewed By: arsenm
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69474
There is a minor flaw in the implementation of function lowerPhis.
This function replaces values of regclass Vreg_1 (boolean values)
involved in PHIs into an SGPR. Currently it iterates over the MBBs
and performs an inplace lowering of PHIs and fails to lower any
incoming value that itself is another PHI of Vreg_1 regclass.
The failure occurs only when the MBB where the incoming PHI value
belongs is not visited/lowered yet.
To fix this problem, collect all Vreg_1 PHIs upfront and then
perform the lowering.
Differential Revision: https://reviews.llvm.org/D69182
This makes the DAG behavior consistent with IR's extractelement after:
rGb32e4664a715
https://bugs.llvm.org/show_bug.cgi?id=42689
I've tried to maintain test intent for WebAssembly.
The AMDGPU test is trying to test for crashing or other bad behavior,
but I'm not sure if that's possible after this change.
That used to fail in the last testcase function because after
%0:sreg_64.sub0 was folded into %3:sreg_32_xm0_xexec COPY, it
was further folded into S_STORE_DWORD_IMM. Its legal effective
subreg class is SReg_32 while instruction expects more restricted
SReg_32_XM0_EXEC. However, SIInstrInfo::isLegalRegOperand()
passed the legality check and it was caught in the verifier.
Borrowed code from the verifier to check for RC legality.
Differential Revision: https://reviews.llvm.org/D69445
Custom lower this to a target instruction with the merge operands. I
think it might be better to directly select this and emit a
REG_SEQUENCE, but this would be more work since it would require
splitting the tablegen patterns for these cases from the other
atomics.
Summary:
In loadSRsrcFromVGPR, if MBB is the same as Succ, Remiander is not the immediate dominator of Succ.
Reviewer:
arsenm
Differential Revision:
https://reviews.llvm.org/D69358
The test split-arg-dbg-value.ll has a host-specific path in the
full output captured by update_llc_test_checks.
Fix for test failures introduced in https://reviews.llvm.org/D69402
Tags: #llvm
Summary:
Some tests have been hand edited without removing the
update_llc_test_checks header, some have slightly outdated CHECK lines
which still pass, and some have additional comments which
update_llc_test_checks pushes towards the function body.
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69402
An SUnit can be neither intruction not SDNode. It is all
null if represents a nop. Fixed a crash on using SU->getInstr().
Differential Revision: https://reviews.llvm.org/D69395
Potentially sgpr to sgpr copy should also be possible.
That is however trickier because we may end up with a
wrong register class at use because of xm0/xexec permutations.
Differential Revision: https://reviews.llvm.org/D69280
Only handle simple inter-block redefs of m0 to the same value. This
avoids interference from redefs of m0 in SILoadStoreOptimzer. I was
initially teaching that pass to ignore redefs of m0, but having them
not exist beforehand is much simpler.
This is in preparation for deleting the current special m0 handling in
SIFixSGPRCopies to allow the register coalescer to handle the
difficult cases.
llvm-svn: 375449
r375293 removed the SGPR spilling with scalar stores path, so this is
no longer necessary. This also always had the defect of adding the def
even when this path wasn't in use.
llvm-svn: 375448
If a PHI defines AGPR legalize its operands to AGPR.
At the moment we can get an AGPR PHI with VGPR operands.
I am not aware of any problems as it seems to be handled
gracefully in RA, but this is not right anyway.
It also slightly decreases VGPR pressure in some cases
because we do not have to a copy via VGPR.
Differential Revision: https://reviews.llvm.org/D69206
llvm-svn: 375446
Current implementation of Instruction::mayReadFromMemory()
returns !doesNotAccessMemory() which is !ReadNone. This
does not take into account that the writeonly attribute
also indicates that the call does not read from memory.
The patch changes the predicate to !doesNotReadMemory()
that reflects the intended behavior.
Differential Revision: https://reviews.llvm.org/D69086
llvm-svn: 375389
We handle it this way for some other address spaces.
Since r349196, SILoadStoreOptimizer has been trying to do this. This
is after SIFoldOperands runs, which can change the addressing
patterns. It's simpler to just split this earlier.
llvm-svn: 375366
It's not clear why the test had this. I'm unable to break the original
case with the original patch reverted with or without optnone.
This avoids a failure in a future commit.
llvm-svn: 375321
If a subregister def was moved across another subregister def and
another use, the main range was not correctly updated. The end point
of the moved interval ended too early and missed the use from theh
other lanes in the subreg def.
llvm-svn: 375300
If all uses of a PHI are in AGPR register class we should
avoid unneeded copies via VGPRs.
Differential Revision: https://reviews.llvm.org/D69200
llvm-svn: 375297
Summary: The implementation was never completed and never used except in tests.
Reviewers: arsenm, mareko
Subscribers: qcolombet, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69163
llvm-svn: 375293
Mostly use SReg_32 instead of SReg_32_XM0 for arbitrary values. This
will allow the register coalescer to do a better job eliminating
copies to m0.
For GlobalISel, as a terrible hack, use SGPR_32 for things that should
use SCC until booleans are solved.
llvm-svn: 375267
Summary:
Even though writelane doesn't have the same constraints as other valu
instructions it still can't violate the >1 SGPR operand constraint
Due to later register propagation (e.g. fixing up vgpr operands via
readfirstlane) changing writelane to only have a single SGPR is tricky.
This implementation puts a new check after SIFixSGPRCopies that prevents
multiple SGPRs being used in any writelane instructions.
The algorithm used is to check for trivial copy prop of suitable constants into
one of the SGPR operands and perform that if possible. If this isn't possible
put an explicit copy of Src1 SGPR into M0 and use that instead (this is
allowable for writelane as the constraint is for SGPR read-port and not
constant-bus access).
Reviewers: rampitec, tpr, arsenm, nhaehnle
Reviewed By: rampitec, arsenm, nhaehnle
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, mgorny, yaxunl, tpr, t-tye, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D51932
Change-Id: Ic7553fa57440f208d4dbc4794fc24345d7e0e9ea
llvm-svn: 375004
Summary:
Extend the SI Load/Store optimizer to merge MIMG load instructions. Handle
different flavours of image_load and image_sample instructions.
When the instructions of the same subclass differ only in dmask, merge
them and update dmask accordingly.
Reviewers: nhaehnle
Reviewed By: nhaehnle
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64911
llvm-svn: 374984
Summary:
Two conditions could lead to infinite loops when processing PHI nodes in
SIFixSGPRCopies.
The first condition involves a REG_SEQUENCE that uses registers defined by both
a PHI and a COPY.
The second condition arises when a physical register is copied to a virtual
register which is then used in a PHI node. If the same virtual register is
copied to the same physical register, the result is an endless loop.
%0:sgpr_64 = COPY $sgpr0_sgpr1
%2 = PHI %0, %bb.0, %1, %bb.1
$sgpr0_sgpr1 = COPY %0
Reviewers: alex-t, rampitec, arsenm
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68970
llvm-svn: 374944
We define mov/update dpp intrinsics as overloaded but do not
support i64, which is a practically useful type. Fix the
selection and lowering.
Differential Revision: https://reviews.llvm.org/D68673
llvm-svn: 374910
previously we would generate literal check lines w/ no reg-exps for
vregs as MI flags (nsw, ninf, etc.) won't be recognized as a part of MI.
Fixing that. Includes updating the MIR tests that suffered from the
problem.
Reviewed By: bogner
Differential Revision: https://reviews.llvm.org/D68905
llvm-svn: 374829
This defaults to zero fi operand, but we do not expose it
anyway. Should we expose it later it needs to be added to
the pseudo.
This enables dpp combining on gfx10.
Differential Revision: https://reviews.llvm.org/D68888
llvm-svn: 374604
In GISel we have both G_CONSTANT and G_FCONSTANT, but because
in GISel we don't really have a concept of Float vs Int value
the only difference between the two is where the data originates
from.
What both G_CONSTANT and G_FCONSTANT return is just a bag of bits
with the constant representation in it.
By making getConstantVRegVal() return G_FCONSTANTs bit representation
as well we allow ConstantFold and other things to operate with
G_FCONSTANT.
Adding tests that show ConstantFolding to work on mixed G_CONSTANT
and G_FCONSTANT sources.
Differential Revision: https://reviews.llvm.org/D68739
llvm-svn: 374458
SGPR_128 only includes the real allocatable SGPRs, and SReg_128 adds
the additional non-allocatable TTMP registers. There's no point in
allocating SReg_128 vregs. This shrinks the size of the classes
regalloc needs to consider, which is usually good.
llvm-svn: 374284
In a future patch, this will help cleanup m0 handling.
The register coalescer handles copies from a register that
materializes an immediate, but doesn't handle move immediates
itself. The virtual register uses will often be allocated to the same
register, so there end up being no real copy.
llvm-svn: 374257
This was ignoring the register bank of the input pointer, and
isUniformMMO seems overly aggressive.
This will now conservatively assume a VGPR in cases where the incoming
bank hasn't been determined yet (i.e. is from a loop phi).
llvm-svn: 374255
If original instruction did not have source modifiers they were
not added to the new DPP instruction as well, even if needed.
Differential Revision: https://reviews.llvm.org/D68729
llvm-svn: 374241
There were 2 problems here. First, these patterns were duplicated to
handle the inverted shift operands instead of using the commuted
PatFrags.
Second, the point of the zext folding patterns don't apply to the
non-0ing high subtargets. They should be skipped instead of inserting
the extension. The zeroing high code would be emitted when necessary
anyway. This was also emitting unnecessary zexts in cases where the
high bits were undefined.
llvm-svn: 374092
Summary:
Without offsets on the MachineMemOperands (MMOs),
MachineInstr::mayAlias() will return true for all reads and writes to the
same resource descriptor. This leads to O(N^2) complexity in the MachineScheduler
when analyzing dependencies of buffer loads and stores. It also limits
the SILoadStoreOptimizer from merging more instructions.
This patch reduces the compile time of one pathological compute shader
from 12 seconds to 1 second.
Reviewers: arsenm, nhaehnle
Reviewed By: arsenm
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, jfb, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65097
llvm-svn: 374087
Summary:
When getValueInMiddleOfBlock happens to be called for a basic block
that has no incoming value at all, an IMPLICIT_DEF is inserted in that
block via GetValueAtEndOfBlockInternal. This IMPLICIT_DEF must be at
the top of its basic block or it will likely not reach the use that
the caller intends to insert.
Issue: https://github.com/GPUOpen-Drivers/llpc/issues/204
Reviewers: arsenm, rampitec
Subscribers: jvesely, wdng, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68183
llvm-svn: 374040
Start manually writing a table to get the subreg index. TableGen
should probably generate this, but I'm not sure what it looks like in
the arbitrary case where subregisters are allowed to not fully cover
the super-registers.
llvm-svn: 373947
At minimum handle the s64 insert type, which are emitted in real cases
during legalization.
We really need TableGen to emit something to emit something like the
inverse of composeSubRegIndices do determine the subreg index to use.
llvm-svn: 373938
Allows targets to introduce regbankselectable
pseudo-instructions. Currently the closet feature to this is an
intrinsic. However this requires creating a public intrinsic
declaration. This litters the public intrinsic namespace with
operations we don't necessarily want to expose to IR producers, and
would rather leave as private to the backend.
Use a new instruction bit. A previous attempt tried to keep using enum
value ranges, but it turned into a mess.
llvm-svn: 373937
The GFX10-DENORM-STRICT checks were only passing by accident. Fix them
to make the test more robust in the face of scheduling or register
allocation changes.
llvm-svn: 373893
Register indexing 64-bit elements is possible on the SALU, but not the
VALU. Handle splitting this into two 32-bit indexes. Extend waterfall
loop handling to allow moving a range of instructions.
llvm-svn: 373638
We can still do a waterfall loop over the index if using a VGPR to
index an SGPR. The result will still be a VGPR, but we can avoid the
wide copy of the source register to a VGPR.
llvm-svn: 373637
When SIFixSGPRCopies attempts to fix an illegal copy from vector to
scalar register it calls moveToVALU(). A copy from an agpr to sgpr
becomes a copy from agpr to agpr, which may result in the illegal
register class at a use of this copy.
Solution is to copy it always into a vgpr. This may result in a
subsequent copy into an agpr if that is what really needed, however
should not happen too often and likely will be folded later.
The opposite situation may not happen because an sgpr is always
illegal where agpr is legal, so such user instructions may not
exist.
Differential Revision: https://reviews.llvm.org/D68358
llvm-svn: 373544
Summary:
Extend cachepolicy operand in the new VMEM buffer intrinsics
to supply information whether the buffer data is swizzled.
Also, propagate this information to MIR.
Intrinsics updated:
int_amdgcn_raw_buffer_load
int_amdgcn_raw_buffer_load_format
int_amdgcn_raw_buffer_store
int_amdgcn_raw_buffer_store_format
int_amdgcn_raw_tbuffer_load
int_amdgcn_raw_tbuffer_store
int_amdgcn_struct_buffer_load
int_amdgcn_struct_buffer_load_format
int_amdgcn_struct_buffer_store
int_amdgcn_struct_buffer_store_format
int_amdgcn_struct_tbuffer_load
int_amdgcn_struct_tbuffer_store
Furthermore, disable merging of VMEM buffer instructions
in SI Load/Store optimizer, if the "swizzled" bit on the instruction
is on.
The default value of the bit is 0, meaning that data in buffer
is linear and buffer instructions can be merged.
There is no difference in the generated code with this commit.
However, in the future it will be expected that front-ends
use buffer intrinsics with correct "swizzled" bit set.
Reviewers: arsenm, nhaehnle, tpr
Reviewed By: nhaehnle
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, arphaman, jfb, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68200
llvm-svn: 373491
In principle this should behave as any other constant. However
eliminateFrameIndex currently assumes a VALU use and uses a vector
shift. Work around this by selecting to VGPR for now until
eliminateFrameIndex is fixed.
llvm-svn: 373415
Account and report agprs separately on gfx908. Other targets
do not change the reporting.
Differential Revision: https://reviews.llvm.org/D68307
llvm-svn: 373411
There are 1024 bit register classes defined for AGPRs. Additionally
OpenCL defines vectors up to 16 x i64, and this helps those tests
legalize.
llvm-svn: 373350
SelectionDAG has a bunch of machinery to defer this to selection time
for some reason. Just directly emit a copy during IRTranslator. The
x86 usage does somewhat questionably check hasFP, which could depend
on the whole function being at minimum translated.
This does lose the convergent bit if the callsite had it, which may be
a problem. We also lose that in general for intrinsics, which may also
be a problem.
llvm-svn: 373294
This is sort of papering over the fact that we don't run a combiner
anywhere, but avoiding creating 2 instructions in the first place is
easy.
llvm-svn: 373293
ISD::SADDO uses the suggested sequence described in the section §2.4 of
the RISCV Spec v2.2. ISD::SSUBO uses the dual approach but checking for
(non-zero) positive.
Differential Revision: https://reviews.llvm.org/D47927
llvm-svn: 373187
I believe all of the uniform/divergent pattern predicates are
redundant and can be removed. The uniformity bit already influences
the register class, and nothhing has broken when I've removed this and
others.
llvm-svn: 372450
According to the documentation method returns predecessor
if the given loop's header has exactly one unique predecessor
outside the loop. Otherwise return null.
In reality it asserts if there is no predecessor outside of
the loop.
The testcase has the loop where predecessors outside of the
loop were not identified as analyzeBranch() was unable to
process the mask branch and returned true. That is also not
correct to assert for the truly dead loops.
Differential Revision: https://reviews.llvm.org/D67634
llvm-svn: 372405
This reverts commit 52621307bc.
Tests have been failing all night with
[0/2] ACTION //llvm/test:check-llvm(//llvm/utils/gn/build/toolchain:unix)
-- Testing: 33647 tests, 64 threads --
Testing: 0 .. 10..
UNRESOLVED: LLVM :: CodeGen/AMDGPU/GlobalISel/isel-blendi-gettargetconstant.ll (6943 of 33647)
******************** TEST 'LLVM :: CodeGen/AMDGPU/GlobalISel/isel-blendi-gettargetconstant.ll' FAILED ********************
Test has no run line!
********************
Since there were other concerns on https://reviews.llvm.org/D67785,
I'm just reverting for now.
llvm-svn: 372383
Summary: This fixes a crasher introduced by r372338.
Reviewers: echristo, arsenm
Subscribers: jvesely, wdng, nhaehnle, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67785
Tighten up the test case.
llvm-svn: 372366
If an instruction had multiple subregister defs, and one of them was
undef, this would improperly conclude all other lanes are
killed. There could still be other defs of those read-undef lanes in
other operands. This would improperly remove register uses from
CurrentVRegUses, so the visitation of later operands would not find
the necessary register dependency. This would also mean this would
fail or not depending on how different subregister def operands were
ordered.
On an undef subregister def, scan the instruction for other
subregister defs and avoid killing those.
This possibly should be deferring removing anything from
CurrentVRegUses until the entire instruction has been processed
instead.
llvm-svn: 372362
This reverts r372314, reapplying r372285 and the commits which depend
on it (r372286-r372293, and r372296-r372297)
This was missing one switch to getTargetConstant in an untested case.
llvm-svn: 372338
This broke the Chromium build, causing it to fail with e.g.
fatal error: error in backend: Cannot select: t362: v4i32 = X86ISD::VSHLI t392, Constant:i8<15>
See llvm-commits thread of r372285 for details.
This also reverts r372286, r372287, r372288, r372289, r372290, r372291,
r372292, r372293, r372296, and r372297, which seemed to depend on the
main commit.
> Encode them directly as an imm argument to G_INTRINSIC*.
>
> Since now intrinsics can now define what parameters are required to be
> immediates, avoid using registers for them. Intrinsics could
> potentially want a constant that isn't a legal register type. Also,
> since G_CONSTANT is subject to CSE and legalization, transforms could
> potentially obscure the value (and create extra work for the
> selector). The register bank of a G_CONSTANT is also meaningful, so
> this could throw off future folding and legalization logic for AMDGPU.
>
> This will be much more convenient to work with than needing to call
> getConstantVRegVal and checking if it may have failed for every
> constant intrinsic parameter. AMDGPU has quite a lot of intrinsics wth
> immarg operands, many of which need inspection during lowering. Having
> to find the value in a register is going to add a lot of boilerplate
> and waste compile time.
>
> SelectionDAG has always provided TargetConstant for constants which
> should not be legalized or materialized in a register. The distinction
> between Constant and TargetConstant was somewhat fuzzy, and there was
> no automatic way to force usage of TargetConstant for certain
> intrinsic parameters. They were both ultimately ConstantSDNode, and it
> was inconsistently used. It was quite easy to mis-select an
> instruction requiring an immediate. For SelectionDAG, start emitting
> TargetConstant for these arguments, and using timm to match them.
>
> Most of the work here is to cleanup target handling of constants. Some
> targets process intrinsics through intermediate custom nodes, which
> need to preserve TargetConstant usage to match the intrinsic
> expectation. Pattern inputs now need to distinguish whether a constant
> is merely compatible with an operand or whether it is mandatory.
>
> The GlobalISelEmitter needs to treat timm as a special case of a leaf
> node, simlar to MachineBasicBlock operands. This should also enable
> handling of patterns for some G_* instructions with immediates, like
> G_FENCE or G_EXTRACT.
>
> This does include a workaround for a crash in GlobalISelEmitter when
> ARM tries to uses "imm" in an output with a "timm" pattern source.
llvm-svn: 372314