PTX ISA spec, s9.7.8.6. Data Movement and Conversion Instructions:
shfl.sync
PTX ISA Notes
Introduced in PTX ISA version 6.0.
Target ISA Notes
Requires sm_30 or higher.
Differential Revision: https://reviews.llvm.org/D123039
PTX ISA spec, s9.7.12.4. Parallel Synchronization and Communication
Instructions: atom
Target ISA Notes
64-bit atom.{and,or,xor,min,max} require sm_32 or higher.
Differential Revision: https://reviews.llvm.org/D123038
NVVM IR specification defines them with i32 return type:
declare i32 @llvm.nvvm.match.any.sync.i64(i32 %membermask, i64 %value)
declare {i32, i1} @llvm.nvvm.match.all.sync.i64(i32 %membermask, i64 %value)
...
The i32 return value is a 32-bit mask where bit position in mask corresponds
to thread’s laneid.
as well as PTX ISA:
9.7.12.8. Parallel Synchronization and Communication Instructions: match.sync
match.any.sync.type d, a, membermask;
match.all.sync.type d[|p], a, membermask;
...
Destination d is a 32-bit mask where bit position in mask corresponds
to thread’s laneid.
Additionally, ptxas doesn't accept intructions, produced by NVPTX backend.
After this patch, it compiles with no issues.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D120499
his patch adds builtins and intrinsics for the f16 and f16x2 variants of the ex2
instruction.
These two variants were added in PTX7.0, and are supported by sm_75 and above.
Note that this isn't wired with the exp2 llvm intrinsic because the ex2
instruction is only available in its approx variant.
Running ptxas on the assembly generated by the test f16-ex2.ll works as
expected.
Differential Revision: https://reviews.llvm.org/D119157
Texture/sampler/surface operands can be either a register or an
immediate (an index of .texref, .samplerref or .surfref).
TableGen declarations for these instructions used to only have
Int64Regs operands, so this caused issues when machine verifier
is turned on:
*** Bad machine code: Expected a register operand. ***
- function: bar
- basic block: %bb.0 (0x55b144d99ab8)
- instruction: %4:int32regs = SULD_1D_I32_TRAP 0, killed %2:int32regs
- operand 1: 0
The solution is to duplicate these instructions for all possible
operand types (i16imm and Int64Regs). Since this would
essentially double the amount code in TableGen, the patch also
does some refactoring for the original instructions to keep
things manageable.
Differential Revision: https://reviews.llvm.org/D112232
Also, amend constraints for non-sync variants that are no longer
available on sm_70+ with PTX6.4+.
Differential Revision: https://reviews.llvm.org/D68892
llvm-svn: 374790
All of the new instructions are still handled mostly by tablegen. I've slightly
refactored the code to drive intrinsic/instruction generation from a master
list of supported variants, so all irregularities have to be implemented in one place only.
The test generation script wmma.py has been refactored in a similar way.
Differential Revision: https://reviews.llvm.org/D60015
llvm-svn: 359247
PTX 6.3 requires using ".aligned" in the MMA instruction names.
In order to generate correct name, now we pass current
PTX version to each instruction as an extra constant operand
and InstPrinter adjusts its output accordingly.
Differential Revision: https://reviews.llvm.org/D59393
llvm-svn: 359246
Generalized constructions of 'fragments' of MMA operations to provide
common primitives for construction of the ops. This will make it easier
to add new variants of the instructions that operate on integer types.
Use nested foreach loops which makes it possible to better control
naming of the intrinsics.
This patch does not affect LLVM's output, so there are no test changes.
Differential Revision: https://reviews.llvm.org/D59389
llvm-svn: 359245
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
A TableGen instruction record usually contains a DAG pattern that will
describe the SelectionDAG operation that can be implemented by this
instruction. However, there will be cases where several different DAG
patterns can all be implemented by the same instruction. The way to
represent this today is to write additional patterns in the Pattern
(or usually Pat) class that map those extra DAG patterns to the
instruction. This usually also works fine.
However, I've noticed cases where the current setup seems to require
quite a bit of extra (and duplicated) text in the target .td files.
For example, in the SystemZ back-end, there are quite a number of
instructions that can implement an "add-with-overflow" operation.
The same instructions also need to be used to implement just plain
addition (simply ignoring the extra overflow output). The current
solution requires creating extra Pat pattern for every instruction,
duplicating the information about which particular add operands
map best to which particular instruction.
This patch enhances TableGen to support a new PatFrags class, which
can be used to encapsulate multiple alternative patterns that may
all match to the same instruction. It operates the same way as the
existing PatFrag class, except that it accepts a list of DAG patterns
to match instead of just a single one. As an example, we can now define
a PatFrags to match either an "add-with-overflow" or a regular add
operation:
def z_sadd : PatFrags<(ops node:$src1, node:$src2),
[(z_saddo node:$src1, node:$src2),
(add node:$src1, node:$src2)]>;
and then use this in the add instruction pattern:
defm AR : BinaryRRAndK<"ar", 0x1A, 0xB9F8, z_sadd, GR32, GR32>;
These SystemZ target changes are implemented here as well.
Note that PatFrag is now defined as a subclass of PatFrags, which
means that some users of internals of PatFrag need to be updated.
(E.g. instead of using PatFrag.Fragment you now need to use
!head(PatFrag.Fragments).)
The implementation is based on the following main ideas:
- InlinePatternFragments may now replace each original pattern
with several result patterns, not just one.
- parseInstructionPattern delays calling InlinePatternFragments
and InferAllTypes. Instead, it extracts a single DAG match
pattern from the main instruction pattern.
- Processing of the DAG match pattern part of the main instruction
pattern now shares most code with processing match patterns from
the Pattern class.
- Direct use of main instruction patterns in InferFromPattern and
EmitResultInstructionAsOperand is removed; everything now operates
solely on DAG match patterns.
Reviewed by: hfinkel
Differential Revision: https://reviews.llvm.org/D48545
llvm-svn: 336999
Const/local/shared address spaces are all < 4GB and we can always use
32-bit pointers to access them. This has substantial performance impact
on kernels that uses shared memory for intermediary results.
The feature is disabled by default.
Differential Revision: https://reviews.llvm.org/D46147
llvm-svn: 331941
This is needed for the upcoming implementation of the
new 8x32x16 and 32x8x16 variants of WMMA instructions
introduced in CUDA 9.1.
Differential Revision: https://reviews.llvm.org/D44719
llvm-svn: 328158
This way we can support address-space specific variants without explicitly
encoding the space in the name of the intrinsic. Less intrinsics to deal with ->
less boilerplate.
Added a bit of tablegen magic to match/replace an intrinsics with a pointer
argument in particular address space with the space-specific instruction
variant.
Updated tests to use non-default address spaces.
Differential Revision: https://reviews.llvm.org/D43268
llvm-svn: 328006
Now that patterns can handle intrinsics returning multiple results,
use tablegen'ed pattern matching instead of custom lowering.
Differential Revision: https://reviews.llvm.org/D43890
llvm-svn: 326457
NVPTX stopped supporting GPUs older than sm_20 (Fermi) quite a while back.
Removal of support of pre-Fermi GPUs made a lot of predicates in the NVPTX
backend pointless as they can't ever be false any more.
It's time to retire them. NFC intended.
Differential Revision: https://reviews.llvm.org/D43843
llvm-svn: 326349
Summary:
This just seems to have been an oversight. We already supported the f64
atomic add with an explicit scope (e.g. "cta"), but not the scopeless
version.
Reviewers: tra
Subscribers: jholewinski, sanjoy, cfe-commits, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D39638
llvm-svn: 317623
WMMA = "Warp Level Matrix Multiply-Accumulate".
These are the new instructions introduced in PTX6.0 and available
on sm_70 GPUs.
Differential Revision: https://reviews.llvm.org/D38645
llvm-svn: 315601
This patch enables support for .f16x2 operations.
Added new register type Float16x2.
Added support for .f16x2 instructions.
Added handling of vectorized loads/stores of v2f16 values.
Differential Revision: https://reviews.llvm.org/D30057
Differential Revision: https://reviews.llvm.org/D30310
llvm-svn: 296032
Support for barrier synchronization between a subset of threads
in a CTA through one of sixteen explicitly specified barriers.
These intrinsics are not directly exposed in CUDA but are
critical for forthcoming support of OpenMP on NVPTX GPUs.
The intrinsics allow the synchronization of an arbitrary
(multiple of 32) number of threads in a CTA at one of 16
distinct barriers. The two intrinsics added are as follows:
call void @llvm.nvvm.barrier.n(i32 10)
waits for all threads in a CTA to arrive at named barrier #10.
call void @llvm.nvvm.barrier(i32 15, i32 992)
waits for 992 threads in a CTA to arrive at barrier #15.
Detailed description of these intrinsics are available in the PTX manual.
http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions
Reviewers: hfinkel, jlebar
Differential Revision: https://reviews.llvm.org/D17657
llvm-svn: 293384
Summary:
Specifically, we upgrade llvm.nvvm.:
* brev{32,64}
* clz.{i,ll}
* popc.{i,ll}
* abs.{i,ll}
* {min,max}.{i,ll,u,ull}
* h2f
These either map directly to an existing LLVM target-generic
intrinsic or map to a simple LLVM target-generic idiom.
In all cases, we check that the code we generate is lowered to PTX as we
expect.
These builtins don't need to be backfilled in clang: They're not
accessible to user code from nvcc.
Reviewers: tra
Subscribers: majnemer, cfe-commits, llvm-commits, jholewinski
Differential Revision: https://reviews.llvm.org/D28793
llvm-svn: 292694