Summary:
MCTargetDesc/AMDGPUMCTargetDesc.h contains enums for all the instuction
and register defintions, which are huge so we only want to include
them where needed.
This will also make it easier if we want to split the R600 and GCN
definitions into separate tablegenerated files.
I was unable to remove AMDGPUMCTargetDesc.h from SIMachineFunctionInfo.h
because it uses some enums from the header to initialize default values
for the SIMachineFunction class, so I ended up having to remove includes of
SIMachineFunctionInfo.h from headers too.
Reviewers: arsenm, nhaehnle
Reviewed By: nhaehnle
Subscribers: MatzeB, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D46272
llvm-svn: 332930
It is legal for the type passed to isLegalAddressingMode to be
unsized or, more specifically, VoidTy. In this case, we must
check the legality of load / stores for all legal types. Directly
trying to call getTypeStoreSize is incorrect, and leads to breakage
in e.g. Loop Strength Reduction. This change guards against that
behaviour.
Differential Revision: https://reviews.llvm.org/D40405
llvm-svn: 332409
Summary: Adding support for Fast flags in the SDNode to leverage fast math sub flag usage.
Reviewers: spatel, arsenm, jbhateja, hfinkel, escha, qcolombet, echristo, wristow, javed.absar
Reviewed By: spatel
Subscribers: llvm-commits, rampitec, nhaehnle, tstellar, FarhanaAleen, nemanjai, javed.absar, jbhateja, hfinkel, wdng
Differential Revision: https://reviews.llvm.org/D45710
llvm-svn: 331547
Summary: performAddCombine should run after DAG is legalized; Otherwise generic optimization
in the DAGCombiner can optimize an addcarry+trunc into an addcarry instruction with
illegal types.
Author: FarhanaAleen
Reviewed By: rampitec
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D46337
llvm-svn: 331368
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.
Patch produced by
for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done
Differential Revision: https://reviews.llvm.org/D46290
llvm-svn: 331272
Author: Samuel Pitoiset
ds_read_b128 and ds_write_b128 have been recently enabled
under the amdgpu-ds128 option because the performance benefit
is unclear.
Though, using 128-bit loads/stores for the local address space
appears to introduce regressions in tessellation shaders. Not
sure what is broken, but as ds_read_b128/ds_write_b128 are not
enabled by default, just introduce a global option and enable
128-bit only if requested (until it's fixed/used correctly).
v2: - fix regressions in merge-stores.ll and multiple_tails.ll
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105464
llvm-svn: 329764
Author: Samuel Pitoiset
ds_read_b128 and ds_write_b128 have been recently enabled
under the amdgpu-ds128 option because the performance benefit
is unclear.
Though, using 128-bit loads/stores for the local address space
appears to introduce regressions in tessellation shaders. Not
sure what is broken, but as ds_read_b128/ds_write_b128 are not
enabled by default, just introduce a global option and enable
128-bit only if requested (until it's fixed/used correctly).
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105464
llvm-svn: 329591
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
Summary: There are no packed instructions for min3 or max3. So, performMinMaxCombine should not optimize vectors of f16 to min3/max3.
Author: FarhanaAleen
Reviewed By: arsenm
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D45219
llvm-svn: 329131
Summary:
Avoids having to list all intrinsics manually.
This is in preparation for the new dimension-aware image intrinsics,
which I'd rather not have to list here by hand.
Change-Id: If7ced04998397ef68c4cb8f7de66b5050fb767e5
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, mgorny, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44937
llvm-svn: 328938
Currently EVT is in the IR layer only because of Function.cpp needing a very small piece of the functionality of EVT::getEVTString(). The rest of EVT is used in codegen making CodeGen a better place for it.
The previous code converted a Type* to EVT and then called getEVTString. This was only expected to handle the primitive types from Type*. Since there only a few primitive types, we can just print them as strings directly.
Differential Revision: https://reviews.llvm.org/D45017
llvm-svn: 328806
The combine on a select of a load only triggers for
addrspace 0, and discards the MachinePointerInfo. The
conservative default needs to be used for this.
llvm-svn: 328652
In a function, s5 is used as the frame base SGPR. If a function
is calling another function, during the call sequence
it is copied to a preserved SGPR and restored.
Before it was possible for the scheduler to move stack operations
before the restore of s5, since there's nothing to associate
a frame index access with the restore.
Add an implicit use of s5 to the adjcallstack pseudo which ends
the call sequence to preven this from happening. I'm not 100%
satisfied with this solution, but I'm not sure what else would be
better.
llvm-svn: 328650
This is used by llvm tblgen as well as by LLVM Targets, so the only
common place is Support for now. (maybe we need another target for these
sorts of things - but for now I'm at least making them correct & we can
make them better if/when people have strong feelings)
llvm-svn: 328395
Summary: Starting from GCN 2nd generation, ISA supports ds_read_b128 on top of ds_read_b64.
This patch supports ds_read_b128 instruction pattern and generation of this instruction.
In the vectorizer, this patch also widen the vector length so that vectorizer generates
128 bit loads for local address-space which gets translated to ds_read_b128.
Since the performance benefit is not clear; compiler generates ds_read_b128 under -amdgpu-ds128.
Author: FarhanaAleen
Reviewed By: rampitec, arsenm
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D44210
llvm-svn: 327153
Summary: GCN ISA supports instructions that can read 16 consecutive dwords from memory through the scalar data cache;
loadstoreVectorizer should take advantage of the wider vector length and pack 16/8 elements of dwords/quadwords.
Author: FarhanaAleen
Reviewed By: rampitec
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D44179
llvm-svn: 326910
The code checks Level == AfterLegalizeDAG which is the fourth and last of the possible DAG combine stages that we have.
There is a Level called AfterLegalVectorOps, but that's the third DAG combine and it doesn't always run.
A function called isAfterLegalVectorOps should imply it returns true in either of the DAG combines that runs after the legalize vector ops stage, but that's not what this function does.
llvm-svn: 326832
i16 capable ASICs do not support i16 operands for this instruction.
Add tablegen pattern to merge chained i16 additions.
Differential Revision: https://reviews.llvm.org/D43985
llvm-svn: 326535
Summary:
In the current implementation of GPR Indexing Mode when the index is of non-uniform, the s_set_gpr_idx_off instruction
is incorrectly inserted after the loop. This will lead the instructions with vgpr operands (v_readfirstlane for example) to read incorrect
vgpr.
In this patch, we fix the issue by inserting s_set_gpr_idx_on/off immediately around the interested instruction.
Reviewers:
rampitec
Differential Revision:
https://reviews.llvm.org/D43297
llvm-svn: 325355
Kernel arguments likely read by all workitems and should not bypass
cache. Fixes performance hit in sub-dword argument loads.
Differential Revision: https://reviews.llvm.org/D43249
llvm-svn: 325146
Note: This is a candidate for LLVM 6.0, because it was planned to be
in that release but was delayed due to a long review period.
Merge conflict in release_60 - resolution:
Add "-p6:32:32" into the second (non-amdgiz) string.
Only scalar loads support 32-bit pointers. An address in a VGPR will
fail to compile. That's OK because the results of loads will only be used
in places where VGPRs are forbidden.
Updated AMDGPUAliasAnalysis and used SReg_64_XEXEC.
The tests cover all uses cases we need for Mesa.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D41651
llvm-svn: 324487
- using qualified pointer addrspace in intrinsics class to avoid .f32 mangling
- changed too common atomic mangling to ds
- added missing intrinsics to AMDGPUTTIImpl::getTgtMemIntrinsic
Reviewed by: b-sumner
Differential Revision: https://reviews.llvm.org/D42383
llvm-svn: 323516
Summary:
This patch implements d16 support for image load, image store and image sample intrinsics.
Reviewers:
Matt, Brian.
Differential Revision:
https://reviews.llvm.org/D3991
llvm-svn: 322903
Currently all images are lowered to have a single
image PseudoSourceValue. Image stores happen to have
overly strict mayLoad/mayStore/hasSideEffects flags
set on them, so this happens to work. When these
are fixed to be correct, the scheduler breaks
this because the identical PSVs are assumed to
be the same address. These need to be unique
to the image resource value.
llvm-svn: 321555
Rather than adding more bits to express every
MMO flag you could want, just directly use the
MMO flags. Also fixes using a bunch of bool arguments to
getMemIntrinsicNode.
On AMDGPU, buffer and image intrinsics should always
have MODereferencable set, but currently there is no
way to do that directly during the initial intrinsic
lowering.
llvm-svn: 320746
Stores failed to decode at all since they didn't have a
DecoderNamespace set. Loads worked, but did not change
the register width displayed to match the numbmer of
enabled channels.
The number of printed registers for vaddr is still wrong,
but I don't think that's encoded in the instruction so
there's not much we can do about that.
Image atomics are still broken. MIMG is the same
encoding for SI/VI, but the image atomic classes
are split up into encoding specific versions unlike
every other MIMG instruction. They have isAsmParserOnly
set on them for some reason. dmask is also special for
these, so we probably should not have it as an explicit
operand as it is now.
llvm-svn: 320614
Move the entire optimization to one place. Before it was possible
to adjust dmask without changing the register class of the output
instruction, since they were done in separate places. Fix all
lane sizes and move all of the optimization into the DAG folding.
llvm-svn: 319705
The object can't straddle the address space
wrap around, so I think it's OK to assume any
offsets added to the base object pointer can't
overflow. Similar logic already appears to be
applied in SelectionDAGBuilder when lowering
aggregate returns.
llvm-svn: 319272
As part of the unification of the debug format and the MIR format,
always print registers as lowercase.
* Only debug printing is affected. It now follows MIR.
Differential Revision: https://reviews.llvm.org/D40417
llvm-svn: 319187
SITargetLowering::LowerCall uses dummy pointer info for byval argument, which causes
flat load instead of buffer load.
This patch fixes that.
Differential Revision: https://reviews.llvm.org/D40040
llvm-svn: 318844
All these headers already depend on CodeGen headers so moving them into
CodeGen fixes the layering (since CodeGen depends on Target, not the
other way around).
llvm-svn: 318490
Use VOP3 add/addc like usual.
This has some tradeoffs. Inline immediates fold
a little better, but other constants are worse off.
SIShrinkInstructions could be made smarter to handle
these cases.
This allows us to avoid selecting scalar adds where we
need to track the carry in scc and replace its users.
This makes it easier to use the carryless VALU adds.
llvm-svn: 318340
This was using a custom function that didn't handle the
addressing modes properly for private. Use
isLegalAddressingMode to avoid duplicating this.
Additionally, skip the combine if there is only one use
since the standard combine will handle it.
llvm-svn: 318013
The backend assumes pointer in default addr space is 32 bit, which is not
true for the new addr space mapping and causes assertion for unresolved
functions.
This patch fixes that.
Differential Revision: https://reviews.llvm.org/D39643
llvm-svn: 317476
Summary:
Kill the thread if operand 0 == false.
llvm.amdgcn.wqm.vote can be applied to the operand.
Also allow kill in all shader stages.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D38544
llvm-svn: 316427
Summary:
The interpolation mode workaround ensures that at least one
interpolation mode is enabled in PSInputAddr. It does not also check
PSInputEna on the basis that the user might enable bits in that
depending on run-time state.
However, for amdpal os type, the user does not enable some bits after
compilation based on run-time states; the register values being
generated here are the final ones set in the hardware. Therefore, apply
the workaround to PSInputAddr and PSInputEnable together. (The case
where a bit is set in PSInputAddr but not in PSInputEnable is where the
frontend set up an input arg for a particular interpolation mode, but
nothing uses that input arg. Really we should have an earlier pass that
removes such an arg.)
Reviewers: arsenm, nhaehnle, dstuttard
Subscribers: kzhuravl, wdng, yaxunl, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D37758
llvm-svn: 315591
The hardware will only forward EXEC_LO; the high 32 bits will be zero.
Additionally, inline constants do not work. At least,
v_addc_u32_e64 v0, vcc, v0, v1, -1
which could conceivably be used to combine (v0 + v1 + 1) into a single
instruction, acts as if all carry-in bits are zero.
The llvm.amdgcn.ps.live test is adjusted; it would be nice to combine
s_mov_b64 s[0:1], exec
v_cndmask_b32_e64 v0, v1, v2, s[0:1]
into
v_mov_b32 v0, v3
but it's not particularly high priority.
Fixes dEQP-GLES31.functional.shaders.helper_invocation.value.*
llvm-svn: 314522
Because the stack growth direction and addressing is done
in the same direction, modifying SP at the beginning of the
call sequence was incorrect. If we had a stack passed argument,
we would end up skipping that number of bytes before pushing
arguments, leaving unused/inconsistent space.
The callee creates fixed stack objects in its frame, so
the space necessary for these is already logically allocated
in the callee, so we just let the callee increment SP if
it really requires it.
llvm-svn: 313279
Using SplitCSR for the frame register was very broken. Often
the copies in the prolog and epilog were optimized out, in addition
to them being inserted after the true prolog where the FP
was clobbered.
I have a hacky solution which works that continues to use
split CSR, but for now this is simpler and will get to working
programs.
llvm-svn: 313274
Summary:
Whole Wavefront Wode (WWM) is similar to WQM, except that all of the
lanes are always enabled, regardless of control flow. This is required
for implementing wavefront reductions in non-uniform control flow, where
we need to use the inactive lanes to propagate intermediate results, so
they need to be enabled. We need to propagate WWM to uses (unless
they're explicitly marked as exact) so that they also propagate
intermediate results correctly. We do the analysis and exec mask munging
during the WQM pass, since there are interactions with WQM for things
that require both WQM and WWM. For simplicity, WWM is entirely
block-local -- blocks are never WWM on entry or exit of a block, and WWM
is not propagated to the block level. This means that computations
involving WWM cannot involve control flow, but we only ever plan to use
WWM for a few limited purposes (none of which involve control flow)
anyways.
Shaders can ask for WWM using the @llvm.amdgcn.wwm intrinsic. There
isn't yet a way to turn WWM off -- that will be added in a future
change.
Finally, it turns out that turning on inactive lanes causes a number of
problems with register allocation. While the best long-term solution
seems like teaching LLVM's register allocator about predication, for now
we need to add some hacks to prevent ourselves from getting into trouble
due to constraints that aren't currently expressed in LLVM. For the gory
details, see the comments at the top of SIFixWWMLiveness.cpp.
Reviewers: arsenm, nhaehnle, tpr
Subscribers: kzhuravl, wdng, mgorny, yaxunl, dstuttard, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D35524
llvm-svn: 310087
Summary:
Previously, we assumed that certain types of instructions needed WQM in
pixel shaders, particularly DS instructions and image sampling
instructions. This was ok because with OpenGL, the assumption was
correct. But we want to start using DPP instructions for derivatives as
well as other things, so the assumption that we can infer whether to use
WQM based on the instruction won't continue to hold. This intrinsic lets
frontends like Mesa indicate what things need WQM based on their
knowledge of the API, rather than second-guessing them in the backend.
We need to keep around the old method of enabling WQM, but eventually we
should remove it once Mesa catches up. For now, this will let us use DPP
instructions for computing derivatives correctly.
Reviewers: arsenm, tpr, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D35167
llvm-svn: 310085
The src0 register must match src1 or src2, but if these
were undefined they could end up using different implicit_defed
virtual registers. Force these to use one undef vreg or pick the
defined other register.
Also fixes producing invalid nodes without the right number of
inputs when src2 is undef.
llvm-svn: 309743
Includes a hack to fix the type selected for
the GlobalAddress of the function, which will be
fixed by changing the default datalayout to use
generic pointers for 0.
llvm-svn: 309732
Also refine the flat check to respect flat-for-global feature,
and constant fallback should check global handling, not
specifically MUBUF.
llvm-svn: 309471
We need to pass something to functions for this to work.
It isn't derivable just from the kernarg segment pointer
because the implicit arguments are placed after the
kernel arguments.
Also fixes missing test for the intrinsic.
llvm-svn: 309398
Changing mask argument type from const SmallVectorImpl<int>& to
ArrayRef<int>.
This came up in D35700 where a mask is received as an ArrayRef<int> and
we want to pass it to TargetLowering::isShuffleMaskLegal().
Also saves a few lines of code.
llvm-svn: 309085
This patch makes LSR generate better code for SystemZ in the cases of memory
intrinsics, Load->Store pairs or comparison of immediate with memory.
In order to achieve this, the following common code changes were made:
* New TTI hook: LSRWithInstrQueries(), which defaults to false. Controls if
LSR should do instruction-based addressing evaluations by calling
isLegalAddressingMode() with the Instruction pointers.
* In LoopStrengthReduce: handle address operands of memset, memmove and memcpy
as address uses, and call isFoldableMemAccessOffset() for any LSRUse::Address,
not just loads or stores.
SystemZ changes:
* isLSRCostLess() implemented with Insns first, and without ImmCost.
* New function supportedAddressingMode() that is a helper for TTI methods
looking at Instructions passed via pointers.
Review: Ulrich Weigand, Quentin Colombet
https://reviews.llvm.org/D35262https://reviews.llvm.org/D35049
llvm-svn: 308729
Introduce pseudo-registers for registers needed for stack
access, which are replaced during finalizeLowering.
Note these pseudo-registers are currently only used for the
used register location, and not for determining their
input argument register.
This is better because it avoids the need to try to predict
whether a call will be emitted from the IR, and also
detects stack objects introduced by legalization.
Test changes are from the HasStackObjects check being more
accurate since stack objects introduced during legalization
are now known.
llvm-svn: 308325
The type needs to be casted back to the original argument type.
Fixes an assert that for some reason is only run when
using -debug.
Includes an additional combine to avoid test regressions
from having conversions mixed with multiple Assert[SZ]ext
nodes. On subtargets where i16 is legal, this was producing an i32
register with an i16 AssertZExt, truncated to i16 with another i8
AssertZExt.
t2: i32,ch = CopyFromReg t0, Register:i32 %vreg0
t3: i16 = truncate t2
t5: i16 = AssertZext t3, ValueType:ch:i8
t6: i8 = truncate t5
t7: i32 = zero_extend t6
llvm-svn: 308082
Since GFX9 supports denorm modes for v_min_f32/v_max_f32 that
is possible to further optimize fcanonicalize and remove it
if applied to min/max given their operands are known not to be
an sNaN or that sNaNs are not supported.
Additionally we can remove fcanonicalize if denorms are supported
for the VT and we know that its argument is never a NaN.
Differential Revision: https://reviews.llvm.org/D35335
llvm-svn: 307976
We are using multiplication by 1.0 to flush denormals and quiet sNaNs.
That is possible to omit this multiplication if source of the
fcanonicalize instruction is known to be flushed/quieted, i.e.
if it comes from another instruction known to do the normalization
and we are using IEEE mode to quiet sNaNs.
Differential Revision: https://reviews.llvm.org/D35218
llvm-svn: 307848
Regardless of relaxation options such as -cl-fast-relaxed-math
we are producing rather long code for fdiv via amdgcn_fdiv_fast
intrinsic. This intrinsic is used to replace fdiv with 2.5ulp
metadata and does not handle denormals, thus believed to be fast.
An fdiv instruction can also have fast math flag either by itself
or together with fpmath metadata. Clang used with a relaxation flag
always produces both metadata and fast flag:
%div = fdiv fast float %v, %0, !fpmath !12!12 = !{float 2.500000e+00}
Current implementation ignores fast flag and favors metadata. An
instruction with just fast flag would be lowered to a fastest rcp +
mul, but that never happen on practice because of described mutual
clang and BE behavior.
This change allows an "fdiv fast" to be always lowered as rcp + mul.
Differential Revision: https://reviews.llvm.org/D34844
llvm-svn: 307308
Going through the Constant methods requires redetermining that the Constant is a ConstantInt and then calling isZero/isOne/isMinusOne.
llvm-svn: 307292
Depending on the compare code that can be either an argument of
sext or negate of it. This helps to avoid v_cndmask_b64 instruction
for sext. A reversed value can be further simplified and folded into
its parent comparison if possible.
Differential Revision: https://reviews.llvm.org/D34545
llvm-svn: 306446
Also factored out function to check if a boolean is an already
deserialized value which does not require v_cndmask_b32 to be
loaded. Added binary logical operators to its check.
Differential Revision: https://reviews.llvm.org/D34500
llvm-svn: 306439
This should not be treated as a different version of
private_segment_buffer. These are distinct things with
different uses and register classes, and requires the
function argument info to have more context about the
function's type and environment.
Also add missing test coverage for the intrinsic, and
emit an error for HSA. This also encovers that the intrinsic
is broken unless there happen to be stack objects.
llvm-svn: 306264
Intrinsic already existed for llvm.SI.tbuffer.store
Needed tbuffer.load and also re-implementing the intrinsic as llvm.amdgcn.tbuffer.*
Added CodeGen tests for the 2 new variants added.
Left the original llvm.SI.tbuffer.store implementation to avoid issues with existing code
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye, tpr
Differential Revision: https://reviews.llvm.org/D30687
llvm-svn: 306031
If one of the arguments of adde/sube is zero we can fold another
add/sub into it.
Differential Revision: https://reviews.llvm.org/D34374
llvm-svn: 305964
This simplification allows to avoid generating v_cndmask_b32
to serialize condition code between compare and use.
Differential Revision: https://reviews.llvm.org/D34300
llvm-svn: 305962
I did this a long time ago with a janky python script, but now
clang-format has built-in support for this. I fed clang-format every
line with a #include and let it re-sort things according to the precise
LLVM rules for include ordering baked into clang-format these days.
I've reverted a number of files where the results of sorting includes
isn't healthy. Either places where we have legacy code relying on
particular include ordering (where possible, I'll fix these separately)
or where we have particular formatting around #include lines that
I didn't want to disturb in this patch.
This patch is *entirely* mechanical. If you get merge conflicts or
anything, just ignore the changes in this patch and run clang-format
over your #include lines in the files.
Sorry for any noise here, but it is important to keep these things
stable. I was seeing an increasing number of patches with irrelevant
re-ordering of #include lines because clang-format was used. This patch
at least isolates that churn, makes it easy to skip when resolving
conflicts, and gets us to a clean baseline (again).
llvm-svn: 304787
Various address spaces on the SI and R600 subtargets have stricter
limits on memory access size that other address spaces. Use
canMergeStoresTo predicate to prevent the DAGCombiner from creating
these stores as they will be split up during legalization.
llvm-svn: 303767
Perform DAG combine:
and (srl x, c), mask => shl (bfe x, nb + c, mask >> nb), nb
Where nb is a number of trailing zeroes in mask.
It replaces two instructions with two and BFE is generally a more
expensive one. However this is only done if we are selecting a byte
or word at an aligned boundary which results in a proper SDWA
operand pattern. It is only done if SDWA is supported.
TODO: improve SDWA pass to actually convert this pattern. It is not
done now because we have an immediate in the instruction, which has
be moved into a VGPR.
Differential Revision: https://reviews.llvm.org/D33455
llvm-svn: 303681
Partially implement callee-side for arguments and return values.
byval doesn't work properly, and most likely sret or other on-stack
return values most as well.
llvm-svn: 303308
This patch replaces the separate APInts for KnownZero/KnownOne with a single KnownBits struct. This is similar to what was done to ValueTracking's version recently.
This is largely a mechanical transformation from KnownZero to Known.Zero.
Differential Revision: https://reviews.llvm.org/D32569
llvm-svn: 301620
1. RegisterClass::getSize() is split into two functions:
- TargetRegisterInfo::getRegSizeInBits(const TargetRegisterClass &RC) const;
- TargetRegisterInfo::getSpillSize(const TargetRegisterClass &RC) const;
2. RegisterClass::getAlignment() is replaced by:
- TargetRegisterInfo::getSpillAlignment(const TargetRegisterClass &RC) const;
This will allow making those values depend on subtarget features in the
future.
Differential Revision: https://reviews.llvm.org/D31783
llvm-svn: 301221
Fixes traps in any block besides the entry block,
and fixes depending on a live-in physical register
by using a virtual register copy.
Also happens to stop emitting a nop in the case
debug trap is not supported.
llvm-svn: 301206
immediate operands.
This commit adds an AArch64 dag-combine that optimizes code generation
for logical instructions taking immediate operands. The optimization
uses demanded bits to change a logical instruction's immediate operand
so that the immediate can be folded into the immediate field of the
instruction.
This recommits r300932 and r300930, which was causing dag-combine to
loop forever. The problem was that optimizeLogicalImm was returning
true even when there was no change to the immediate node (which happened
when the immediate was all zeros or ones), which caused dag-combine to
push and pop the same node to the work list over and over again without
making any progress.
This commit fixes the bug by returning false early in optimizeLogicalImm
if the immediate is all zeros or ones. Also, it changes the code to
compare the immediate with 0 or Mask rather than calling
countPopulation.
rdar://problem/18231627
Differential Revision: https://reviews.llvm.org/D5591
llvm-svn: 301019
It seems that r300930 was creating an infinite loop in dag-combine when
compling the following file:
MultiSource/Benchmarks/MiBench/consumer-typeset/z21.c
llvm-svn: 300940
immediate operands.
This commit adds an AArch64 dag-combine that optimizes code generation
for logical instructions taking immediate operands. The optimization
uses demanded bits to change a logical instruction's immediate operand
so that the immediate can be folded into the immediate field of the
instruction.
This recommits r300913, which broke bots because I didn't fix a call to
ShrinkDemandedConstant in SIISelLowering.cpp after changing the APIs of
TargetLoweringOpt and TargetLowering.
rdar://problem/18231627
Differential Revision: https://reviews.llvm.org/D5591
llvm-svn: 300930
If a workgroup size is known to be not greater than wavefront size
the s_barrier instruction is not needed since all threads are guarantied
to come to the same point at the same time.
Differential Revision: https://reviews.llvm.org/D31731
llvm-svn: 299659
Add a new node to act as a fancy bitcast from f16 operations to
i32 that implicitly zero the high 16-bits of the result.
Alternatively could try making v2f16 legal and canonicalizing
on build_vectors.
llvm-svn: 299246
As we introduced target triple environment amdgiz and amdgizcl, the address
space values are no longer enums. We have to decide the value by target triple.
The basic idea is to use struct AMDGPUAS to represent address space values.
For address space values which are not depend on target triple, use static
const members, so that they don't occupy extra memory space and is equivalent
to a compile time constant.
Since the struct is lightweight and cheap, it can be created on the fly at
the point of usage. Or it can be added as member to a pass and created at
the beginning of the run* function.
Differential Revision: https://reviews.llvm.org/D31284
llvm-svn: 298846
This is used for a specific type of return to a shader part's
epilog code. Rename to try avoiding confusion from a true
call's return.
llvm-svn: 298452
Move backend internal intrinsics along with the rest of the
normal intrinsics, and use the Intrinsic::getDeclaration
API instead of manually constructing the type list.
It's surprising this was working before. fdiv.fast had
the wrong number of parameters. The control flow intrinsic
declaration attributes were not being applied, and
their types were inconsistent. The actual IR use types
did not match the declaration, and were closer to the
types used for the patterns. The brcond lowering
was changing the types, so introduce new nodes for those.
llvm-svn: 298119
Change implementation to use max instead of add.
min/max/med3 do not flush denormals regardless of the mode,
so it is OK to use it whether or not they are enabled.
Also allow using clamp with f16, and use knowledge
of dx10_clamp.
llvm-svn: 295788
Before frame offsets are calculated, try to eliminate the
frame indexes used by SGPR spills. Then we can delete them
after.
I think for now we can be sure that no other instruction
will be re-using the same frame indexes. It should be easy
to notice if this assumption ever breaks since everything
asserts if it tries to use a dead frame index later.
The unused emergency stack slot seems to still be left behind,
so an additional 4 bytes is still wasted.
llvm-svn: 295753
I think this is safe as long as no inputs are known to ever
be nans.
Also add an intrinsic for fmed3 to be able to handle all safe
math cases.
llvm-svn: 293598
Summary:
This lets you select which sort of spilling you want, either s[0:1] or 64-bit loads from s[0:1].
Patch By: Dave Airlie
Reviewers: nhaehnle, arsenm, tstellarAMD
Reviewed By: arsenm
Subscribers: mareko, llvm-commits, kzhuravl, wdng, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D25428
llvm-svn: 293000
Rename from addOperand to just add, to match the other method that has been
added to MachineInstrBuilder for adding more than just 1 operand.
See https://reviews.llvm.org/D28057 for the whole discussion.
Differential Revision: https://reviews.llvm.org/D28556
llvm-svn: 291891
For i16 zeroext arguments when i16 was a legal type, the
known bits information from the truncate was lost. Insert
a zeroext so the known bits optimizations work with the 32-bit
loads.
Fixes code quality regressions vs. SI in min.ll test.
llvm-svn: 291461
Summary:
Without a MachineMemOperand, the scheduler was assuming MIMG instructions
were ordered memory references, so no loads or stores could be reordered
across them.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27536
llvm-svn: 290179
Not having this legal led to combine failures, resulting
in dumb things like bitcasts of constants not being folded
away.
The only reason I'm leaving the v_mov_b32 hack that f32
already uses is to avoid madak formation test regressions.
PeepholeOptimizer has an ordering issue where the immediate
fold attempt is into the sgpr->vgpr copy instead of the actual
use. Running it twice avoids that problem.
llvm-svn: 289096
Summary:
LC can currently select scalar load for uniform memory access
basing on readonly memory address space only. This restriction
originated from the fact that in HW prior to VI vector and scalar caches
are not coherent. With MemoryDependenceAnalysis we can check that the
memory location corresponding to the memory operand of the LOAD is not
clobbered along the all paths from the function entry.
Reviewers: rampitec, tstellarAMD, arsenm
Subscribers: wdng, arsenm, nhaehnle
Differential Revision: https://reviews.llvm.org/D26917
llvm-svn: 289076
Structure the definitions a bit more like the other classes.
The main change here is to split EXP with the done bit set
to a separate opcode, so we can set mayLoad = 1 so that it won't
be reordered before the other exp stores, since this has the special
constraint that if the done bit is set then this should be the last
exp in she shader.
Previously all exp instructions were inferred to have unmodeled
side effects.
llvm-svn: 288695
m0 may need to be written for spill code, so
we don't want general code uses relying on the
value stored in it.
This introduces a few code quality regressions where copies
from m0 are not coalesced into copies of a copy of m0.
llvm-svn: 287841
Summary:
Extend image intrinsics to support data types of V1F32 and V2F32.
TODO: we should define a mapping table to change the opcode for data type of V2F32 but just one channel is active,
even though such case should be very rare.
Reviewers:
tstellarAMD
Differential Revision:
http://reviews.llvm.org/D26472
llvm-svn: 286860
Summary:
The v_movreld machine instruction is used with three operands that are
in a sense tied to each other (the explicit VGPR_32 def and the implicit
VGPR_NN def and use). There is no way to express that using the currently
available operand bits, and indeed there are cases where the Two Address
instructions pass does the wrong thing.
This patch introduces a new set of pseudo instructions that are identical
in intended semantics as v_movreld, but they only have two tied operands.
Having to add a new set of pseudo instructions is admittedly annoying, but
it's a fairly straightforward and solid approach. The only alternative I
see is to try to teach the Two Address instructions pass about Three Address
instructions, and I'm afraid that's trickier and is going to end up more
fragile.
Note that v_movrels does not suffer from this problem, and so this patch
does not touch it.
This fixes several GL45-CTS.shaders.indexing.* tests.
Reviewers: tstellarAMD, arsenm
Subscribers: kzhuravl, wdng, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D25633
llvm-svn: 284980
This will prevent following regression when enabling i16 support (D18049):
test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
Differential Revision: https://reviews.llvm.org/D25805
llvm-svn: 284891
Summary:
If we are loading an i16 value from a 32-bit memory location, then
we need to be able to truncate the loaded value to i16.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, tony-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D25198
llvm-svn: 284397
Because everything live is spilled at the end of a
block by fast regalloc, assume this will happen and
avoid the copies of the resource descriptor.
llvm-svn: 284119
This is the most basic handling of the indirect access
pseudos using GPR indexing mode. This currently only enables
the mode for a single v_mov_b32 and then disables it.
This is much more complicated to use than the movrel instructions,
so a new optimization pass is probably needed to fold the access
into the uses and keep the mode enabled for them.
llvm-svn: 284031