Commit Graph

55651 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin 53eb0f8c07 [AMDGPU] Attempt to reschedule withou clustering
We want to have more load/store clustering but we also want
to maintain low register pressure which are oposit targets.
Allow scheduler to reschedule regions without mutations
applied if we hit a register limit.

Differential Revision: https://reviews.llvm.org/D73386
2020-01-27 10:27:16 -08:00
Matt Arsenault 97711228fd AMDGPU/GlobalISel: Select llvm.amdgcn.struct.buffer.load.format 2020-01-27 13:23:35 -05:00
Matt Arsenault ce7ca2caf2 AMDGPU/GlobalISel: Select llvm.amdgcn.struct.buffer.load 2020-01-27 13:05:55 -05:00
Matt Arsenault 198624c39d AMDGPU/GlobalISel: Select llvm.amdgcn.raw.buffer.load.format 2020-01-27 13:02:19 -05:00
Matt Arsenault fc90222a91 AMDGPU/GlobalISel: Select llvm.amdgcn.raw.buffer.load
Use intermediate instructions, unlike with buffer stores. This is
necessary because of the need to have an internal way to distinguish
between signed and unsigned extloads. This introduces some duplication
and near duplication with the buffer store selection path. The store
handling should maybe be moved into legalization to match and
eliminate the duplication.
2020-01-27 12:49:23 -05:00
Matt Arsenault e60d658260 AMDGPU/GlobalISel: Handle VOP3NoMods 2020-01-27 09:03:44 -08:00
Matt Arsenault 0968234590 AMDGPU/GlobalISel: Minor refactor of MUBUF complex patterns
This will make it easier to support the small variants in the complex
patterns for atomics.
2020-01-27 09:00:00 -08:00
Matt Arsenault bef27175c7 AMDGPU: Fix not using f16 fsin/fcos
I noticed this because this accidentally started working for
GlobalISel.
2020-01-27 08:59:59 -08:00
Simon Pilgrim 2d5e281b0f [X86][AVX] Add a more aggressive SimplifyMultipleUseDemandedBits to simplify masked store masks.
Fixes a poor codegen issue noticed in PR11210.
2020-01-27 16:44:25 +00:00
Matt Arsenault a1d33ce73a AMDGPU/GlobalISel: Custom legalize v2s16 G_SHUFFLE_VECTOR
Try to keep simple v2s16 cases as-is. This will more naturally map to
how the VOP3P op_sel modifiers work compared to the expansion
involving bitcasts and bitshifts.

This could maybe try harder with wider source vector types, although
that could be handled with a pre-legalize combine.
2020-01-27 08:28:05 -08:00
Matt Arsenault 4e69df091d Revert "AMDGPU: Temporary drop s_mul_hi_i/u32 patterns"
This reverts commit fe23ed2c68.

It was never really clear this was responsible for the performance
regressions that caused this to be reverted. It's been a long time,
and we need to have scalar patterns for this to get GlobalISel
working.
2020-01-27 08:07:21 -08:00
Matt Arsenault bc3d900fa5 AMDGPU/GlobalISel: Fix not using global atomics on gfx9+
For some reason the flat/global atomics end up in the generated
matcher table in a different order from SelectionDAG. Use
AddedComplexity to prefer checking for global atomics first.
2020-01-27 07:42:42 -08:00
Matt Arsenault ac0b9b4ccf AMDPGPU/GlobalISel: Select more MUBUF global addressing modes
The handling of the high bits of the resource descriptor seem weird to
me, where the 3rd dword changes based on the instruction.
2020-01-27 07:28:36 -08:00
Matt Arsenault fdaad485e6 AMDGPU/GlobalISel: Initial selection of MUBUF addr64 load/store
Fixes the main reason for compile failures on SI, but doesn't really
try to use the addressing modes yet.
2020-01-27 07:13:56 -08:00
Matt Arsenault 2214bc81d0 AMDGPU: Allow i16 shader arguments
Not allowing this just creates unnecessary complications when writing
simple tests.
2020-01-27 06:55:32 -08:00
Jay Foad 1bf00219fc [AMDGPU] Handle multiple base operands in areMemAccessesTriviallyDisjoint
Summary:
This is in preparation for getMemOperandsWithOffset returning more base
operands.

Depends on D73455.

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73456
2020-01-27 14:45:21 +00:00
Jay Foad 6461eadf8f [AMDGPU] Handle multiple base operands in shouldClusterMemOps
Summary:
This is in preparation for getMemOperandsWithOffset returning more base
operands.

Depends on D73454.

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73455
2020-01-27 14:45:21 +00:00
Jay Foad fcf5254fa7 [AMDGPU] Handle frame index base operands in memOpsHaveSameBasePtr
Summary:
This is in preparation for getMemOperandsWithOffset returning more base
operands.

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, arphaman, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73454
2020-01-27 14:45:21 +00:00
vpykhtin 4332f1a4c8 [AMDGPU] Fix GCN regpressure trackers for INLINEASM instructions.
Differential revision: https://reviews.llvm.org/D73338
2020-01-27 17:25:25 +03:00
Matt Arsenault 2a160ba5b0 GlobalISel: Reimplement widenScalar for G_UNMERGE_VALUES results
Only use shifts if the requested type exactly matches the source type,
and create sub-unmerges otherwise.
2020-01-27 06:18:26 -08:00
David Green 8a6b948eb5 [MVE] Fixup order of gather writeback intrinsic outputs
The MVE_VLDRWU32_qi_pre gather loads, like the other _pre/_post mve
loads returns the writeback as result 0, the value as result 1. The llvm
ir intrinsic seems to have this the other way around though, and so when
lowering from one to the other we need to switch the first two outputs.

I've also fixed up the types of _pre/_post on normal MVE loads. There we
were already getting the values the right way around, just not for the
types. I don't believe this was causing anything to go wrong, but it was
very confusing to read in the debug output.

Differential Revision: https://reviews.llvm.org/D73370
2020-01-27 14:08:06 +00:00
Sjoerd Meijer b567ff2fa0 [ARM][MVE] Tail-predication: support constant trip count
We had support for runtime trip count values, but not constants, and this adds
supports for that.

And added a minor optimisation while I was add it: don't invoke Cleanup when
there's nothing to clean up.

Differential Revision: https://reviews.llvm.org/D73198
2020-01-27 11:05:26 +00:00
Sam Parker 6c2df5d14f [ARM][LowOverheadLoops] Dont ignore VCTP
When expanding the LoopStart, we try to remove the iteration count
calculation. However, if part of the calculation was also used to
calculate the number of elements we could end up deleting
instructions that were required to feed DLSTP/WLSTP.

Differential Revision: https://reviews.llvm.org/D73275
2020-01-27 10:59:12 +00:00
Petar Avramovic cbf03aee6d [MIPS GlobalISel] Select population count (popcount)
G_CTPOP is generated from llvm.ctpop.<type> intrinsics, clang generates
these intrinsics from __builtin_popcount and __builtin_popcountll.
Add lower and narrow scalar for G_CTPOP.
Lower G_CTPOP for MIPS32.

Differential Revision: https://reviews.llvm.org/D73216
2020-01-27 09:59:50 +01:00
Petar Avramovic 8bc7ba5b9e [MIPS GlobalISel] Select count trailing zeros
llvm.cttz.<type> intrinsic has additional i1 argument is_zero_undef,
it tells whether zero as the first argument produces a defined result.
G_CTTZ is generated from llvm.cttz.<type> (<type> <src>, i1 false)
intrinsics, clang generates these intrinsics from __builtin_ctz and
__builtin_ctzll.
G_CTTZ_ZERO_UNDEF comes from llvm.cttz.<type> (<type> <src>, i1 true).
Clang generates such intrinsics as parts of expansion of builtin_ffs
and builtin_ffsll. It is also traditionally part of and many
algorithms that are now predicated on avoiding zero-value inputs.

Add narrow scalar (algorithm uses G_CTTZ_ZERO_UNDEF) for G_CTTZ.
Lower G_CTTZ and G_CTTZ_ZERO_UNDEF for MIPS32.

Differential Revision: https://reviews.llvm.org/D73215
2020-01-27 09:51:06 +01:00
Petar Avramovic 2b66d32f3f [MIPS GlobalISel] Select count leading zeros
llvm.ctlz.<type> intrinsic has additional i1 argument is_zero_undef,
it tells whether zero as the first argument produces a defined result.
MIPS clz instruction returns 32 for zero input.
G_CTLZ is generated from llvm.ctlz.<type> (<type> <src>, i1 false)
intrinsics, clang generates these intrinsics from __builtin_clz and
__builtin_clzll.
G_CTLZ_ZERO_UNDEF can also be generated from llvm.ctlz with true as
second argument. It is also traditionally part of and many algorithms
that are now predicated on avoiding zero-value inputs.

Add narrow scalar for G_CTLZ (algorithm uses G_CTLZ_ZERO_UNDEF).
Lower G_CTLZ_ZERO_UNDEF and select G_CTLZ for MIPS32.

Differential Revision: https://reviews.llvm.org/D73214
2020-01-27 09:43:38 +01:00
Roman Lebedev 76fcf900d5
[X86][BdVer2] Polish LEA instruction scheduling info
Based on exhaustive llvm-exegesis measurements.
There may still be some imperfections for LEA16r/LEA32r.

Much like was observed in D68646, i'm also measuring some outliers
with some specific registers.
2020-01-26 22:17:27 +03:00
Simon Pilgrim fa19d67a2a [X86][AVX] Extend combineCommutableSHUFP to handle v8f32 and v16f32 commutable shufps patterns 2020-01-26 19:04:12 +00:00
Simon Pilgrim 1a81b296cd [X86][SSE] combineCommutableSHUFP - permilps(shufps(load(),x)) --> permilps(shufps(x,load()))
Pull out combineTargetShuffle code added in rG3fd5d1c6e7db into a helper function and extend it to handle shufps(shufps(load(),x),y) and shufps(y,shufps(load(),x)) cases as well.
2020-01-26 14:36:23 +00:00
Maheaha Shivamallappa 66f93071cd AMDGPU/GlobalISel: Clean-up code around ISel for Intrinsics.
Summary:
A minor code clean-up around ISel for intrinsic llvm.amdgcn.end.cf()

Reviewers: arsenm, mshivama

Reviewed By: arsenm

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73358
2020-01-26 14:09:31 +05:30
Craig Topper 3fdd435a4b [X86] Use a macro to convert X86ISD names to strings in getTargetNodeName.
Every case in the switch had a string version of themselves. Two
of them had a typo that used : instead of ::

By using a macro we can automate the string creation and avoid
the possibility of typos like this.

This is similar to what is done on the AMDGPU target.
2020-01-25 18:27:29 -08:00
Tom Stellard cb297050bb AMDGPU/SILoadStoreOptimizer: Fix uninitialized variable error
This was introduced by 86c944d790 and
caught by the sanitizer-x86_64-linux-fast bot.
2020-01-24 21:53:05 -08:00
Tom Stellard 86c944d790 AMDGPU/SILoadStoreOptimizer: Improve merging of out of order offsets
Summary:
This improves merging of sequences like:

store a, ptr + 4
store b, ptr + 8
store c, ptr + 12
store d, ptr + 16
store e, ptr + 20
store f, ptr

Prior to this patch the basic block was scanned in order to find instructions
to merge and the above sequence would be transformed to:

store4 <a, b, c, d>, ptr + 4
store e, ptr + 20
store r, ptr

With this change, we now sort all the candidate merge instructions by their offset,
so instructions are visited in offset order rather than in the order they appear
in the basic block.  We now transform this sequnce into:

store4 <f, a, b, c>, ptr
store2 <d, e>, ptr + 16

Another benefit of this change is that since we have sorted the mergeable lists
by offset, we can easily check if an instruction is mergeable by checking the
offset of the instruction that becomes before or after it in the sorted list.
Once we determine an instruction is not mergeable we can remove it from the list
and avoid having to do the more expensive mergeablilty checks.

Reviewers: arsenm, pendingchaos, rampitec, nhaehnle, vpykhtin

Reviewed By: arsenm, nhaehnle

Subscribers: kerbowa, merge_guards_bot, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65966
2020-01-24 19:45:56 -08:00
@justice_adams (Justice Adams) daee63f974 [SelectionDag] Updated FoldConstantArithmetic method signature in preparation for merge with FoldConstantVectorArithmetic
Updated FoldConstantArithmetic method signature to match that of
FoldConstantVectorArithmetic in preparation for merging the two
functions together

https://bugs.llvm.org/show_bug.cgi?id=36544

This is the first step in combining the various
FoldConstantVectorArithmetic and FoldConstantVectorArithmetic
functions into one FoldConstantArithmetic function.

Differential Revision: https://reviews.llvm.org/D72870
2020-01-24 18:00:58 -05:00
Craig Topper 2c1decc040 [X86] Break the loop in LowerReturn into 2 loops. NFCI
I believe for STRICT_FP I need to use a STRICT_FP_EXTEND for the extending to f80 for returning f32/f64 in 32-bit mode when SSE is enabled. The STRICT_FP_EXTEND node requires a Chain. I need to get that node onto the chain before any CopyToRegs are emitted. This is because all the CopyToRegs are glued and chained together. So I can't put a STRICT_FP_EXTEND on the chain between the glued nodes without also glueing the STRICT_ FP_EXTEND.

This patch moves all the extend creation to a first pass and then creates the copytoregs and fills out RetOps in a second pass.

Differential Revision: https://reviews.llvm.org/D72665
2020-01-24 14:44:38 -08:00
Roman Lebedev 70cbf8c71c
[X86] Make `llc --help` output readable again
Long `cl::value_desc()` is added right after the flag name,
before `cl::desc()` column. And thus the `cl::desc()` column,
for all flags, is padded to the right,
which makes the output unreadable.
2020-01-25 01:43:52 +03:00
Heejin Ahn 65eb11306e [WebAssembly] Update bleeding-edge CPU features
Summary:
This adds bulk memory and tail call to "bleeding-edge" CPU, since their
implementation in LLVM/clang seems mostly complete.

Reviewers: tlively

Subscribers: dschuff, sbc100, jgravelle-google, sunfish, cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D73322
2020-01-24 14:27:35 -08:00
Heejin Ahn 764f4089e8 [WebAssembly] Add reference types target feature
Summary:
This adds the reference types target feature. This does not enable any
more functionality in LLVM/clang for now, but this is necessary to embed
the info in the target features section, which is used by Binaryen and
Emscripten. It turned out that after D69832 `-fwasm-exceptions` crashed
because we didn't have the reference types target feature.

Reviewers: tlively

Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D73320
2020-01-24 14:26:27 -08:00
Matt Arsenault 3b93945587 AMDGPU/GlobalISel: Select wqm, softwqm and wwm intrinsics 2020-01-24 13:06:44 -08:00
Matt Arsenault 87c46a3129 AMDGPU: Don't error on ds.ordered intrinsic in function
These should be assumed to be called from a compute context. Also
don't use a 2 entry switch over constants.
2020-01-24 13:06:44 -08:00
Stanislav Mekhanoshin be8e38cbd9 Correct NumLoads in clustering
Scheduler sends NumLoads argument into shouldClusterMemOps()
one less the actual cluster length. So for 2 instructions
it will pass just 1. Correct this number.

This is NFC for in tree targets.

Differential Revision: https://reviews.llvm.org/D73292
2020-01-24 12:45:28 -08:00
Matt Arsenault 84e035d8f1 AMDGPU: Don't check constant address space for atomic stores
We define a separate list for storable address spaces. This saves
entry in the matcher table address space list.
2020-01-24 12:15:09 -08:00
Stanislav Mekhanoshin 555d8f4ef5 [AMDGPU] Bundle loads before post-RA scheduler
We are relying on atrificial DAG edges inserted by the
MemOpClusterMutation to keep loads and stores together in the
post-RA scheduler. This does not work all the time since it
allows to schedule a completely independent instruction in the
middle of the cluster.

Removed the DAG mutation and added pass to bundle already
clustered instructions. These bundles are unpacked before the
memory legalizer because it does not work with bundles but also
because it allows to insert waitcounts in the middle of a store
cluster.

Removing artificial edges also allows a more relaxed scheduling.

Differential Revision: https://reviews.llvm.org/D72737
2020-01-24 11:33:38 -08:00
Stanislav Mekhanoshin 44b865fa7f [AMDGPU] Allow narrowing muti-dword loads
Currently BE allows only a little load narrowing because
of the fear it will produce sub-dword ext loads. However,
we can always allow narrowing if we are shrinking one
multi-dword load to another multi-dword load.

In particular we were unable to reduce s_load_dwordx8 into
s_load_dwordx4 if identity shuffle was used to extract
low 4 dwords.

Differential Revision: https://reviews.llvm.org/D73133
2020-01-24 11:03:41 -08:00
Austin Kerbow c226646337 Resubmit: [DA][TTI][AMDGPU] Add option to select GPUDA with TTI
Summary:
Enable the new diveregence analysis by default for AMDGPU.

Resubmit with test updates since GPUDA was causing failures on Windows.

Reviewers: rampitec, nhaehnle, arsenm, thakis

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73315
2020-01-24 10:39:40 -08:00
Yuta Saito c5bd3d0726 Support Swift calling convention for WebAssembly targets
This adds basic support for the Swift calling convention with WebAssembly
targets.

Reviewed By: dschuff

Differential Revision: https://reviews.llvm.org/D71823
2020-01-24 10:30:46 -08:00
David Green b535aa405a [ARM] Use reduction intrinsics for larger than legal reductions
The codegen for splitting a llvm.vector.reduction intrinsic into parts
will be better than the codegen for the generic reductions. This will
only directly effect when vectorization factors are specified by the
user.

Also added tests to make sure the codegen for larger reductions is OK.

Differential Revision: https://reviews.llvm.org/D72257
2020-01-24 17:07:24 +00:00
Kazushi (Jam) Marukawa 0fca35c652 [VE] global variable isel patterns
Summary: Asm expr fixups, isel patterns and tests for global variables addresses.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D73355
2020-01-24 17:35:14 +01:00
Simon Pilgrim 3fd5d1c6e7 [X86][SSE] combineTargetShuffle - permilps(shufps(load(),x)) --> permilps(shufps(x,load()))
Moves lowerShuffleWithSHUFPS commutation code from rG30fcd29fe479 to catch cases during combine
2020-01-24 15:23:20 +00:00
Kazushi (Jam) Marukawa 08ebd8c79e [VE] aligned load/store isel patterns
Summary:
Aligned load/store isel patterns and tests for
i1/i8/16/32/64 (including extension and truncation) and fp32/64.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D73276
2020-01-24 15:16:54 +01:00