Try out using combine definition rules.
This really should be a post-legalizer combine, but the combiner pass
is currently pre-legalize. Most of the target combines are really
post-legalize, so we should probably move the pass.
I believe this also fixes bugs with CI 32-bit handling, which was
incorrectly skipping offsets that look like signed 32-bit values. Also
validate the offsets are dword aligned before folding.
On targets that don't have the normal packed f16 layout, handle these
during legalization. Directly modify the register types. We can infer
this was a d16 load based on the mem operand size during selection.
A16 operands should possibly be handled here as well, but don't worry
about that yet.
This trivially avoids violating the constant bus restriction.
Previously this was allowing one SGPR in the first source
operand, which technically also avoided violating this for most
operations (but not for special cases reading vcc).
We do need to write some new, smarter operand folds to pick the
optimal SGPR to use in some kind of post-isel fold, but that's purely
an optimization.
I was originally thinking we would pick which operands should be SGPRs
in RegBankSelect, but I think this isn't really manageable. There
would be additional complexity to handle every G_* instruction, and
then any nontrivial instruction patterns would need to know when to
avoid violating it, which is likely to be very error prone.
I think having all inputs being canonically copies to VGPRs will
simplify the operand folding logic. The current folding we do is
backwards, and only considers one operand at a time, relative to
operands it already has. It therefore poorly handles the case where
there is already a constant bus operand user. If all operands are
copies, it's somewhat simpler to consider all input operands at once
to choose the optimal constant bus user.
Since the failure mode for constant bus violations is now a verifier
error and not an selection failure, this moves towards a place where
we can turn on the fallback mode. The SGPR copy folding optimizations
can be left for later.
For pow2 constants we should use G_SHL for pattern matching (and perf)
purposes later.
Vector support not yet implemented.
Differential Revision: https://reviews.llvm.org/D73659
Irritatingly the failure output is different in release vs. debug
because of the legality check is removed without asserts, so a register
ends up constrained only in release builds.
Fixes selection for scalar G_SMULH/G_UMULH. Also switches to using
tablegen selected add/sub, which switch to the signed version of the
opcode. This matches the current DAG behavior. We can't drop the
manual selection for add/sub yet, because it's still both for VALU
add/sub and for G_PTR_ADD.
Convert to the style most others use with one test instruction per
function, and use an implicit use to ensure the result register class
is constrained.
Change-Id: I6109148b0e3c80aa5535796a37abca583c19a936
This should be no problem to support with a pattern, but it turns out
there are just too many yaks to shave. The main problem is in the DAG
emitter, which I have no desire to sink effort into fixing.
If we had a bit to disable patterns in the DAG importer, fixing the
GlobalISelEmitter is more manageable.
Trivial type predicates should be moved into the tablegen pattern
itself, and not checked inside complex patterns. This eliminates a
redundant complex pattern, and fixes select source modifiers for
GlobalISel.
I have further patches which fully handle select in tablegen and
remove all of the C++ selection, although it requires the ugliness to
support the entire range of legal register types.
Use intermediate instructions, unlike with buffer stores. This is
necessary because of the need to have an internal way to distinguish
between signed and unsigned extloads. This introduces some duplication
and near duplication with the buffer store selection path. The store
handling should maybe be moved into legalization to match and
eliminate the duplication.
Try to keep simple v2s16 cases as-is. This will more naturally map to
how the VOP3P op_sel modifiers work compared to the expansion
involving bitcasts and bitshifts.
This could maybe try harder with wider source vector types, although
that could be handled with a pre-legalize combine.
For some reason the flat/global atomics end up in the generated
matcher table in a different order from SelectionDAG. Use
AddedComplexity to prefer checking for global atomics first.
We are relying on atrificial DAG edges inserted by the
MemOpClusterMutation to keep loads and stores together in the
post-RA scheduler. This does not work all the time since it
allows to schedule a completely independent instruction in the
middle of the cluster.
Removed the DAG mutation and added pass to bundle already
clustered instructions. These bundles are unpacked before the
memory legalizer because it does not work with bundles but also
because it allows to insert waitcounts in the middle of a store
cluster.
Removing artificial edges also allows a more relaxed scheduling.
Differential Revision: https://reviews.llvm.org/D72737
The other 3-op patterns should also be theoretically handled, but
currently there's a bug in the inferred pattern complexity.
I'm not sure what the error handling strategy should be for potential
constant bus violations. I think the correct strategy is to never
produce mixed SGPR and VGPR operands in a typical VOP instruction,
which will trivially avoid them. However, it's possible to still have
hand written MIR (or erroneously transformed code) with these
operands. When these fold, the restriction will be violated. We
currently don't have any verifiers for reg bank legality. For now,
just ignore the restriction.
It might be worth triggering a DAG fallback on verifier error.
The pattern is also mishandled by the generated matcher, so workaround
this as in the DAG path.
The existing DAG tests aren't particularly targeted to just this one
intrinsic. These also end up differing in scheduling from SGPR->VGPR
operand constraint copies.