When adding operands to machine instructions in case of
RegisterSDNodes, generate a COPY node in case the register class
does not match the one in the instruction definition.
Differental Revision: https://reviews.llvm.org/D35561
llvm-svn: 324733
Note: This is a candidate for LLVM 6.0, because it was planned to be
in that release but was delayed due to a long review period.
Merge conflict in release_60 - resolution:
Add "-p6:32:32" into the second (non-amdgiz) string.
Only scalar loads support 32-bit pointers. An address in a VGPR will
fail to compile. That's OK because the results of loads will only be used
in places where VGPRs are forbidden.
Updated AMDGPUAliasAnalysis and used SReg_64_XEXEC.
The tests cover all uses cases we need for Mesa.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D41651
llvm-svn: 324487
Summary:
I checked the AMD closed source compiler and the workaround is only
needed when x3 is emulated as x4, which we don't do in LLVM.
SMEM x3 opcodes don't exist, and instead there is a possibility to use x4
with the last component being unused. If the last component is out of
buffer bounds and falls on the next 4K page, the hw hangs.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D42756
llvm-svn: 324486
1. Run the memory legalizer prior to the waitcnt pass; keep the policy that the waitcnt pass does not remove any waitcnts within the incoming IR.
2. The waitcnt pass doesn't (yet) track waitcnts that exist prior to the waitcnt pass (it just skips over them); because the waitcnt pass is ignorant of them, it may insert a redundant waitcnt. To avoid this, check the prev instr. If it and the to-be-inserted waitcnt are the same, suppress the insertion. We keep the existing waitcnt under the assumption that whomever, e.g., the memory legalizer, inserted it knows what they were doing.
3. Follow-on work: teach the waitcnt pass to record the pre-existing waitcnts for better waitcnt production.
Differential Revision: https://reviews.llvm.org/D42854
llvm-svn: 324440
X86 currently has a late DAG combine after cttz/ctlz are turned into BSR+BSF+CMOV to detect this and remove the CMOV. But we should be able to do this much earlier and avoid creating the cmov all together.
For the changed AMDGPU test case it appears that previously the i8 cttz was type legalized to i16 which introduced an OR with 256 in order to limit the result to 8 on the widened type. At this point the result is known to never be zero, but nothing checked that. Then operation legalization is told to promote all i16 cttz to i32. This introduces an extend and a truncate and another OR with 65536 to limit the result to 16. With the DAG combiner change we are able to prevent the creation of the second OR since the opcode will have been changed to cttz_zero_undef after the first OR. I the lack of the OR caused the instruction to change to v_ffbl_b32_sdwa
Differential Revision: https://reviews.llvm.org/D42985
llvm-svn: 324427
Summary:
Now we generate PAL metadata for the amdpal os type, there is no need to
generate the .AMDGPU.config section.
Reviewers: arsenm, nhaehnle, dstuttard
Subscribers: kzhuravl, wdng, yaxunl, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D37760
Change-Id: I303c5fad66656ce97293da60621afac6595b4c18
llvm-svn: 324346
Summary:
This change extends MachineCopyPropagation to do COPY source forwarding
and adds an additional run of the pass to the default pass pipeline just
after register allocation.
This version of this patch uses the newly added
MachineOperand::isRenamable bit to avoid forwarding registers is such a
way as to violate constraints that aren't captured in the
Machine IR (e.g. ABI or ISA constraints).
This change is a continuation of the work started in D30751.
Reviewers: qcolombet, javed.absar, MatzeB, jonpa, tstellar
Subscribers: tpr, mgorny, mcrosier, nhaehnle, nemanjai, jyknight, hfinkel, arsenm, inouehrs, eraman, sdardis, guyblank, fedor.sergeev, aheejin, dschuff, jfb, myatsina, llvm-commits
Differential Revision: https://reviews.llvm.org/D41835
llvm-svn: 323991
Discussed here:
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120320.html
In preparation for adding support for named vregs we are changing the sigil for
physical registers in MIR to '$' from '%'. This will prevent name clashes of
named physical register with named vregs.
llvm-svn: 323922
Sometimes users do not specify data layout in LLVM assembly and let llc set the
data layout by target triple after loading the LLVM assembly.
Currently the parser checks alloca address space no matter whether the LLVM
assembly contains data layout definition, which causes false alarm since the
default data layout does not contain the correct alloca address space.
The parser also calls verifier to check debug info and updating invalid debug
info. Currently there is no way to let the verifier to check debug info only.
If the verifier finds non-debug-info issues the parser will fail.
For llc, the fix is to remove the check of alloca addr space in the parser and
disable updating debug info, and defer the updating of debug info and
verification to be after setting data layout of the IR by target.
For other llvm tools, since they do not override data layout by target but
instead can override data layout by a command line option, an argument for
overriding data layout is added to the parser. In cases where data layout
overriding is necessary for the parser, the data layout can be provided by
command line.
Differential Revision: https://reviews.llvm.org/D41832
llvm-svn: 323826
Mark more opcodes as hasExtraSrcRegAllocReq so that their operands will
be marked as not renamable, to avoid copy forwarding violating the
constraint that only one operand may use the constant bus.
These changes fix a few mis-compiles when copy forwarding is enabled in
MachineCopyPropagation by D41835 (and were reviewed as part of that change).
llvm-svn: 323794
-amdgpu-waitcnt-forcezero={1|0} Force all waitcnt instrs to be emitted as s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-amdgpu-waitcnt-forceexp=<n> Force emit a s_waitcnt expcnt(0) before the first <n> instrs
-amdgpu-waitcnt-forcelgkm=<n> Force emit a s_waitcnt lgkmcnt(0) before the first <n> instrs
-amdgpu-waitcnt-forcevm=<n> Force emit a s_waitcnt vmcnt(0) before the first <n> instrs
This patch was pushed ( abb190fd51cd2f9a9eef08c024e109f7f7e909fc ), which caused a buildbot failure, reverted ( 6227480d74da507cf8e1b4bcaffbdb9fb875b4b8 ), and then updated to fix buildbot failures (this patch).
Differential Revision: https://reviews.llvm.org/D40091
llvm-svn: 323788
Patch by: Bas Nieuwenhuizen
Just use the _e64 variant if needed. This should be possible as per
def : Pat <
(int_amdgcn_kill (i1 (setcc f32:$src, InlineFPImm<f32>:$imm, cond:$cond))),
(SI_KILL_F32_COND_IMM_PSEUDO $src, (bitcast_fpimm_to_i32 $imm), (cond_as_i32imm $cond))
> ;
I don't think we can get an immediate for the other operand for which we
need the second 32-bit word.
https://reviews.llvm.org/D42302
llvm-svn: 323706
Add support for printing / parsing the addrspace of a MachineMemOperand.
Fixes PR35970.
Differential Revision: https://reviews.llvm.org/D42502
llvm-svn: 323521
- using qualified pointer addrspace in intrinsics class to avoid .f32 mangling
- changed too common atomic mangling to ds
- added missing intrinsics to AMDGPUTTIImpl::getTgtMemIntrinsic
Reviewed by: b-sumner
Differential Revision: https://reviews.llvm.org/D42383
llvm-svn: 323516
It causes regressions in various OpenGL test suites.
Keep the test cases introduced by r321751 as XFAIL, and add a test case
for the regression.
Change-Id: I90b4cc354f68cebe5fcef1f2422dc8fe1c6d3514
Bugzilla: https://bugs.llvm.org/show_bug.cgi?id=36015
llvm-svn: 323355
For the included test case, the DAG transformation
concat_vectors(scalar, undef) -> scalar_to_vector(sclr)
would attempt to create a v2i32 vector for a v9i8
concat_vector. Bail out to avoid creating a bitcast with
mismatching sizes later on.
Differential Revision: https://reviews.llvm.org/D42379
llvm-svn: 323312
Fix a bug in ScheduleDAGMILive::scheduleMI which causes BotRPTracker not tracking CurrentBottom in some rare cases involving llvm.dbg.value.
This issues causes amdgcn target to assert when compiling some user codes with -g.
Differential Revision: https://reviews.llvm.org/D42394
llvm-svn: 323214
- Change inserted add ( V_ADD_{I|U}32_e32 ) to _e64 version ( V_ADD_{I|U}32_e64 ) so that the add uses a vreg for the carry; this prevents inserted v_add from killing VCC; the _e64 version doesn't accept a literal in its encoding, so we need to introduce a mov instr as well to get the imm into a register.
- Change pass name to "SI Load Store Optimizer"; this removes the '/', which complicates scripts.
Differential Revision: https://reviews.llvm.org/D42124
llvm-svn: 323153
Summary:
This patch implements d16 support for image load, image store and image sample intrinsics.
Reviewers:
Matt, Brian.
Differential Revision:
https://reviews.llvm.org/D3991
llvm-svn: 322903
Summary:
A recent change
321556: AMDGPU: Remove mayLoad/hasSideEffects from MIMG stores
can allow the machine instruction scheduler to move an image store past
an image load using the same descriptor.
V2: Fixed by marking image ops as mayAlias and isAliased. This may be
overly conservative, and we may need to revisit.
V3: Reverted test change done on 321556.
Reviewers: arsenm, nhaehnle, dstuttard
Subscribers: llvm-commits, t-tye, yaxunl, wdng, kzhuravl
Differential Revision: https://reviews.llvm.org/D41969
llvm-svn: 322419
While updating clang tests for having clang set dso_local I noticed
that:
- There are *a lot* of tests to update.
- Many of the updates are redundant.
They are redundant because a GV is "obviously dso_local". This patch
starts formalizing that a bit by requiring that internal and private
GVs be dso_local too. Since they all are, we don't have to print
dso_local to the textual representation, making it a bit more compact
and easier to read.
llvm-svn: 322317
Planning to add support for named vregs. This puts is in a conundrum since
physregs are named as well. To rectify this we need to use a sigil other than
'%' for physregs in MIR. We've settled on using '$' for physregs but first we
must repurpose it from external symbols using it, which is what this commit is
all about. We think '&' will have familiar semantics for C/C++ users.
llvm-svn: 322146
Summary:
In the case of an fp_extend of v1f16 to v1f32 where the v1f16 is the
result of a bitcast from i16, avoid creating an illegal fp16_to_fp where
the input is not a vector and the result is a v1f32.
V2: The fix is now to avoid vector scalarization creating a v1->scalar
bitcast.
Reviewers: srhines, t.p.northover
Subscribers: nhaehnle, llvm-commits, dstuttard, t-tye, yaxunl, wdng, kzhuravl, arsenm
Differential Revision: https://reviews.llvm.org/D41126
llvm-svn: 322120
Summary:
I had a case where multiple nested uniform ifs resulted in code that did
v_cmp comparisons, combining the results with s_and_b64, s_or_b64 and
s_xor_b64 and using the resulting mask in s_cbranch_vccnz, without first
ensuring that bits for inactive lanes were clear.
There was already code for inserting an "s_and_b64 vcc, exec, vcc" to
clear bits for inactive lanes in the case that the branch is instruction
selected as s_cbranch_scc1 and is then changed to s_cbranch_vccnz in
SIFixSGPRCopies. I have added the same code into SILowerControlFlow for
the case that the branch is instruction selected as s_cbranch_vccnz.
This de-optimizes the code in some cases where the s_and is not needed,
because vcc is the result of a v_cmp, or multiple v_cmp instructions
combined by s_and/s_or. We should add a pass to re-optimize those cases.
Reviewers: arsenm, kzhuravl
Subscribers: wdng, yaxunl, t-tye, llvm-commits, dstuttard, timcorringham, nhaehnle
Differential Revision: https://reviews.llvm.org/D41292
llvm-svn: 322119
Since register classes and banks are already printed with the register
definition, don't print it at the end of every instruction anymore.
This follows MIR in this regard and is another step to the unification
of the two formats.
llvm-svn: 322086
The work order was changed in r228186 from SCC order
to RPO with an arbitrary sorting function. The sorting
function attempted to move inner loop nodes earlier. This
was was apparently relying on an assumption that every block
in a given loop / the same loop depth would be seen before
visiting another loop. In the broken testcase, a block
outside of the loop was encountered before moving onto
another block in the same loop. The testcase would then
structurize such that one blocks unconditional successor
could never be reached.
Revert to plain RPO for the analysis phase. This fixes
detecting edges as backedges that aren't really.
The processing phase does use another visited set, and
I'm unclear on whether the order there is as important.
An arbitrary order doesn't work, and triggers some infinite
loops. The reversed RPO list seems to work and is closer
to the order that was used before, minus the arbitary
custom sorting.
A few of the changed tests now produce smaller code,
and a few are slightly worse looking.
llvm-svn: 321751