AsyncCopyOp lowering converted "size in elements" to "size in bytes"
assuming the element type size is at least one byte. This removes
that restriction, allowing for types such as i4 and b1 to be handled
correctly.
Differential Revision: https://reviews.llvm.org/D125838
This changes adds missing support for the i4 data type. Tests are added
to ensure proper lowering of an nvgpu.mma.sync operation targeting the
16x8x64xi4 and 16x8x32xi4 MMA variants in the NVVM dialect.
Differential Revision: https://reviews.llvm.org/D126092
Move async copy operations to NVGPU as they only exist on NV target and are
designed to match ptx semantic. This allows us to also add more fine grain
caching hint attribute to the op.
Add hint to bypass L1 and hook it up to NVVM op.
Differential Revision: https://reviews.llvm.org/D125244
Adds missing logic in the lowering from NvGPU to NVVM to support fp32
(in an accumulator operand) and tf32 (in multiplicand operand) types.
Fixes logic in one of the helper functions for converting the result
of a mma.sync operation with multiple 8x256bit output tiles, which is
the case for f32 outputs.
Differential Revision: https://reviews.llvm.org/D124533