forked from OSchip/llvm-project
1129931a62
Perform second reduce only with first warp. This requires an additional __sync_threads(), but doesn't need special handling when the last warp is small. This simplifies support for block sizes that are not multiple of 32. Supporting partial warp reduce will be done in a separate CL. PiperOrigin-RevId: 272168917 |
||
---|---|---|
.. | ||
all-reduce.mlir | ||
gpu-to-cubin.mlir | ||
lit.local.cfg |