The MachineFunction wasn't used in getOptimalMemOpType, but more importantly,
this allows reuse of findOptimalMemOpLowering that is calling getOptimalMemOpType.
This is the groundwork for the changes in D59766 and D59787, that allows
implementation of TTI::getMemcpyCost.
Differential Revision: https://reviews.llvm.org/D59785
llvm-svn: 359537
Now we have vec3 MVTs, this commit implements dwordx3 variants of the
buffer intrinsics.
On gfx6, a dwordx3 buffer load intrinsic is implemented as a dwordx4
instruction, and a dwordx3 buffer store intrinsic is not supported.
We need to support the dwordx3 load intrinsic because it is generated by
subtarget-unaware code in InstCombine.
Differential Revision: https://reviews.llvm.org/D58904
Change-Id: I016729d8557b98a52f529638ae97c340a5922a4e
llvm-svn: 356755
Reassociate adds to collect scalar operands in a single
instruction when possible. That will result in a scalar
add followed by vector instead of two vector adds, thus
better utilizing SALU.
Differential Revision: https://reviews.llvm.org/D58220
llvm-svn: 354066
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
Summary:
Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.
If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.
There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.
Change-Id: I170e6816323beb1348677b358c9d380865cd1a19
Reviewers: arsenm, alex-t, rampitec, tpr
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D53283
llvm-svn: 348050
This allows to avoid scratch use or indirect VGPR addressing for
small vectors.
Differential Revision: https://reviews.llvm.org/D54606
llvm-svn: 347231
The main caller of this already has an MVT and several targets called getSimpleVT inside without checking isSimple. This makes the simpleness explicit.
llvm-svn: 346180
Introduce new versions that follow the IEEE semantics
to help with legalization that may need quieted inputs.
There are some regressions from inserting unnecessary
canonicalizes when these are matched from fast math
fcmp + select which should be fixed in a future commit.
llvm-svn: 344914
Summary:
Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.
If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.
There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.
Change-Id: I170e6816323beb1348677b358c9d380865cd1a19
Reviewers: arsenm, alex-t, rampitec, tpr
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D53283
llvm-svn: 344696
Summary:
This is patch 1 of the new DivergenceAnalysis (https://reviews.llvm.org/D50433).
The purpose of this patch is to free up the name DivergenceAnalysis for the new generic
implementation. The generic implementation class will be shared by specialized
divergence analysis classes.
Patch by: Simon Moll
Reviewed By: nhaehnle
Subscribers: jvesely, jholewinski, arsenm, nhaehnle, mgorny, jfb, llvm-commits
Differential Revision: https://reviews.llvm.org/D50434
Change-Id: Ie8146b11be2c50d5312f30e11c7a3036a15b48cb
llvm-svn: 341071
This was hackily adding in the 4-bytes reserved for the callee's
emergency stack slot. Treat it like a normal stack allocation
so we get the correct alignment padding behavior. This fixes
an inconsistency between the caller and callee.
llvm-svn: 340396
Summary:
This commit adds new intrinsics
llvm.amdgcn.raw.buffer.load
llvm.amdgcn.raw.buffer.load.format
llvm.amdgcn.raw.buffer.load.format.d16
llvm.amdgcn.struct.buffer.load
llvm.amdgcn.struct.buffer.load.format
llvm.amdgcn.struct.buffer.load.format.d16
llvm.amdgcn.raw.buffer.store
llvm.amdgcn.raw.buffer.store.format
llvm.amdgcn.raw.buffer.store.format.d16
llvm.amdgcn.struct.buffer.store
llvm.amdgcn.struct.buffer.store.format
llvm.amdgcn.struct.buffer.store.format.d16
llvm.amdgcn.raw.buffer.atomic.*
llvm.amdgcn.struct.buffer.atomic.*
with the following changes from the llvm.amdgcn.buffer.*
intrinsics:
* there are separate raw and struct versions: raw does not have an
index arg and sets idxen=0 in the instruction, and struct always sets
idxen=1 in the instruction even if the index is 0, to allow for the
fact that gfx9 does bounds checking differently depending on whether
idxen is set;
* there is a combined cachepolicy arg (glc+slc)
* there are now only two offset args: one for the offset that is
included in bounds checking and swizzling, to be split between the
instruction's voffset and immoffset fields, and one for the offset
that is excluded from bounds checking and swizzling, to go into the
instruction's soffset field.
The AMDISD::BUFFER_* SD nodes always have an index operand, all three
offset operands, combined cachepolicy operand, and an extra idxen
operand.
The obsolescent llvm.amdgcn.buffer.* intrinsics continue to work.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, jfb, llvm-commits
Differential Revision: https://reviews.llvm.org/D50306
Change-Id: If897ea7dc34fcbf4d5496e98cc99a934f62fc205
llvm-svn: 340269
Summary:
This commit adds new intrinsics
llvm.amdgcn.raw.tbuffer.load
llvm.amdgcn.struct.tbuffer.load
llvm.amdgcn.raw.tbuffer.store
llvm.amdgcn.struct.tbuffer.store
with the following changes from the llvm.amdgcn.tbuffer.* intrinsics:
* there are separate raw and struct versions: raw does not have an index
arg and sets idxen=0 in the instruction, and struct always sets
idxen=1 in the instruction even if the index is 0, to allow for the
fact that gfx9 does bounds checking differently depending on whether
idxen is set;
* there is a combined format arg (dfmt+nfmt)
* there is a combined cachepolicy arg (glc+slc)
* there are now only two offset args: one for the offset that is
included in bounds checking and swizzling, to be split between the
instruction's voffset and immoffset fields, and one for the offset
that is excluded from bounds checking and swizzling, to go into the
instruction's soffset field.
The AMDISD::TBUFFER_* SD nodes always have an index operand, all three
offset operands, combined format operand, combined cachepolicy operand,
and an extra idxen operand.
The tbuffer pseudo- and real instructions now also have a combined
format operand.
The obsolescent llvm.amdgcn.tbuffer.* and llvm.SI.tbuffer.store
intrinsics continue to work.
V2: Separate raw and struct intrinsics.
V3: Moved extract_glc and extract_slc defs to a more sensible place.
V4: Rebased on D49995.
V5: Only two separate offset args instead of three.
V6: Pseudo- and real instructions have joint format operand.
V7: Restored optionality of dfmt and nfmt in assembler.
V8: Addressed minor review comments.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D49026
Change-Id: If22ad77e349fac3a5d2f72dda53c010377d470d4
llvm-svn: 340268
If denormals are enabled, denormals are canonical.
Also fix a few other issues. minnum/maxnum are supposed
to canonicalize. Temporarily improve workaround for the
instruction behavior change in gfx9.
Handle selects and fcopysign.
The tests were also largely broken, since they were
checking for a flush used on some targets after the
store of the result.
llvm-svn: 339061
Summary:
By not reconstructing the operand list of the SDNode, this change makes
it easier to add the forthcoming new tbuffer and buffer intrinsics.
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D49995
Change-Id: I0cb79ef0801532645d7dd954a6d7355139db7b38
llvm-svn: 338784
SelectionDAGBuilder widens v3i32/v3f32 arguments to
to v4i32/v4f32 which consume an additional register.
In addition to wasting argument space, this produces extra
instructions since now it appears the 4th vector component has
a meaningful value to most combines.
llvm-svn: 338197
Summary:
This is a follow-up to r335942.
- Merge SISubtarget into AMDGPUSubtarget and rename to GCNSubtarget
- Rename AMDGPUCommonSubtarget to AMDGPUSubtarget
- Merge R600Subtarget::Generation and GCNSubtarget::Generation into
AMDGPUSubtarget::Generation.
Reviewers: arsenm, jvesely
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D49037
llvm-svn: 336851
Summary:
We now have two sets of generated TableGen files, one for R600 and one
for GCN, so each sub-target now has its own tables of instructions,
registers, ISel patterns, etc. This should help reduce compile time
since each sub-target now only has to consider information that
is specific to itself. This will also help prevent the R600
sub-target from slowing down new features for GCN, like disassembler
support, GlobalISel, etc.
Reviewers: arsenm, nhaehnle, jvesely
Reviewed By: arsenm
Subscribers: MatzeB, kzhuravl, wdng, mgorny, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D46365
llvm-svn: 335942
If a source of rcp instruction is a result of any conversion from
an integer convert it into rcp_iflag instruction. No FP exception
can ever happen except division by zero if a single precision rcp
argument is a representation of an integral number.
Differential Revision: https://reviews.llvm.org/D48569
llvm-svn: 335742
Summary:
Having TableGen patterns for image intrinsics is hitting limitations:
for D16 we already have to manually pre-lower the packing of data
values, and we will have to do the same for A16 eventually.
Since there is already some custom C++ code anyway, it is arguably easier
to just do everything in C++, now that we can use the beefed-up generic
tables backend of TableGen to provide all the required metadata and map
intrinsics to corresponding opcodes. With this approach, all image
intrinsic lowering happens in SITargetLowering::lowerImage. That code is
dense due to all the cases that it handles, but it should still be easier
to follow than what we had before, by virtue of it all being done in a
single location, and by virtue of not relying on the TableGen pattern
magic that very few people really understand.
This means that we will have MachineSDNodes with MIMG instructions
during DAG combining, but that seems alright: previously we had
intrinsic nodes instead, but those are similarly opaque to the generic
CodeGen infrastructure, and the final pattern matching just did a 1:1
translation to machine instructions anyway. If anything, the fact that
we now merge the address words into a vector before DAG combine should
be an advantage.
Change-Id: I417f26bd88f54ce9781c1668acc01f3f99774de6
Reviewers: arsenm, rampitec, rtaylor, tstellar
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48017
llvm-svn: 335228
Summary:
The code that handles ISD:Register and ISD::CopyFromReg assumes
the target is amdgcn, so this is broken on r600. We don't
need this analysis on r600 anyway so we can safely move
it to SITargetLowering.
Reviewers: alex-t, arsenm, nhaehnle
Reviewed By: arsenm
Subscribers: msearles, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D46298
llvm-svn: 334607
This has two main components. First, widen
widen short constant loads in DAG when they have
the correct alignment. This is already done a bit in
AMDGPUCodeGenPrepare, since that has access to
DivergenceAnalysis. This can't help kernarg loads
created in the DAG. Start to use DAG divergence analysis
to help this case.
The second part is to avoid kernel argument lowering
breaking the alignment of short vector elements because
calling convention lowering wants to split everything
into legal register types.
When loading a split type, load the nearest 4-byte aligned
segment and shift to get the desired bits. This extra
load of the earlier argument piece ends up merging,
and the bit extract hopefully folds out.
There are a number of improvements and regressions with
this, but I think as-is this is a better compromise between
several of the worst parts of SelectionDAG.
Particularly when i16 is legal, this produces worse code
for i8 and i16 element vector kernel arguments. This is
partially due to the very weak load merging the DAG does.
It only looks for fairly specific combines between pairs
of loads which no longer appear. In particular this
causes v4i16 loads to be split into 2 components when
previously the two halves were merged.
Worse, because of the newly introduced shifts, there
is a lot more unnecessary vector packing and unpacking code
emitted. At least some of this is due to reporting
false for isTypeDesirableForOp for i16 as a workaround for
the lack of divergence information in the DAG. The cases
where this happens it doesn't actually matter, but the
relevant code in SimplifyDemandedBits doens't have the context
to know to ignore this.
The use of the scalar cache is probably more important
than the mess of mostly scalar instructions doing this packing
and unpacking. Future work can fix this, possibly by making better
use of the new DAG divergence information for controlling promotion
decisions, or adding another version of shift + trunc + shift
combines that doesn't only know about the used types.
llvm-svn: 334180
This was just emitting loads with the ABI alignment
for the raw type. The true alignment is often better,
especially when an illegal vector type was scalarized.
The better alignment allows using a scalar load
more often.
llvm-svn: 333558
This usually results in better code. Fixes using
inline asm with short2, and also fixes having a different
ABI for function parameters between VI and gfx9.
Partially cleans up the mess used for lowering of the d16
operations. Making v4f16 legal will help clean this up more,
but this requires additional work.
llvm-svn: 332953
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.
Patch produced by
for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done
Differential Revision: https://reviews.llvm.org/D46290
llvm-svn: 331272
Move the entire optimization to one place. Before it was possible
to adjust dmask without changing the register class of the output
instruction, since they were done in separate places. Fix all
lane sizes and move all of the optimization into the DAG folding.
llvm-svn: 319705
This was using a custom function that didn't handle the
addressing modes properly for private. Use
isLegalAddressingMode to avoid duplicating this.
Additionally, skip the combine if there is only one use
since the standard combine will handle it.
llvm-svn: 318013
Includes a hack to fix the type selected for
the GlobalAddress of the function, which will be
fixed by changing the default datalayout to use
generic pointers for 0.
llvm-svn: 309732
Also refine the flat check to respect flat-for-global feature,
and constant fallback should check global handling, not
specifically MUBUF.
llvm-svn: 309471
We need to pass something to functions for this to work.
It isn't derivable just from the kernarg segment pointer
because the implicit arguments are placed after the
kernel arguments.
Also fixes missing test for the intrinsic.
llvm-svn: 309398