diff --git a/crates/burn-cube/README.md b/crates/burn-cube/README.md new file mode 100644 index 000000000..feea89f29 --- /dev/null +++ b/crates/burn-cube/README.md @@ -0,0 +1,202 @@ +
+ + +
+
+ +[![Rust Version](https://img.shields.io/badge/Rust-1.75.0+-blue)](https://releases.rs/docs/1.75.0) +![license](https://shields.io/badge/license-MIT%2FApache--2.0-blue) + +--- + +**Multi-platform high-performance compute language extension for Rust.** +
+ +
+ +## TL;DR + +With CubeCL, you can program your GPU using Rust leveraging zero-cost abstraction to create maintainable, flexible and optimal compute kernels. + +## Motivation + +The goal of CubeCL is to ease the pain of writing highly optimized compute kernels that are portable across hardware. +There is currently no adequate solution when you want optimal performance while still being multi-platform. +You either have to write custom kernels for different hardware, often with different languages such as CUDA, Metal, or ROCm. +To fix this, we created a Just-in-Time compiler with three core features: **automatic vectorization**, **comptime**, and **autotune**! + +These features are extremely useful for anyone writing high-performance kernels, even when portability is not a concern. +They improve code composability, reusability, testability, and maintainability, all while staying optimal. + +### Disclaimer & History + +CubeCL is currently in **alpha**. +The only supported runtimes are CUDA and WebGPU for now. +It's easy to add more GPU runtimes and we intend to support Metal, ROCm, and Vulkan; contributions are welcome! +We also want to have an optimized JIT CPU runtime with SIMD instructions, leveraging [Cranelift](https://cranelift.dev). + +While CubeCL is currently in use in [Burn](https://burn.dev), there are still a lot of rough edges; it isn't refined yet. +The project started as a WebGPU-only backend for Burn. +As we optimized it, we realized that we needed an intermediate representation (IR) that could be optimized then compiled to WGSL. +Having an IR made it easy to support another compilation target, so we made a CUDA runtime. +However, writing kernels directly in that IR wasn't easy, so we created a Rust frontend using the [syn](https://github.com/dtolnay/syn) crate. +Navigating the differences between CUDA and WebGPU, while leveraging both platforms, forced us to come up with general concepts that worked everywhere. +Hence, CubeCL was born! + +## Design + +CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids, because not all axes are the same size. +Since all compute APIs need to map to the hardware, which are tiles that can be accessed using a 3D representation, our topology can easily be mapped to concepts from other APIs. + +
+ +### CubeCL - Topology + + +
+
+
+ +_A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions along the x, y, and z axes. +Similarly, a hyper-cube is composed of cubes, just as a cube is composed of units. +Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube along the x, y, and z axes. +Hence, a hyper-cube of 3x3x3 will have 27 cubes. +In this example, the total number of working units would be 27 x 27 = 729._ + +
+Topology Equivalence 👇 +
+ +Since all topology variables are constant within the kernel entry point, we chose to use the Rust constant syntax with capital letters. +Often when creating kernels, we don't always care about the relative position of a unit within a cube along each axis, but often we only care about its position in general. +Therefore, each kind of variable also has its own axis-independent variable, which is often not present in other languages, except WebGPU with `local_invocation_index`. + +
+ +| CubeCL | CUDA | WebGPU | +| -------------- | ----------- | ---------------------- | +| CUBE_COUNT | N/A | N/A | +| CUBE_COUNT_X | gridDim.x | num_workgroups.x | +| CUBE_COUNT_Y | gridDim.y | num_workgroups.y | +| CUBE_COUNT_Z | gridDim.z | num_workgroups.z | +| CUBE_POS | N/A | N/A | +| CUBE_POS_X | blockIdx.x | workgroup.x | +| CUBE_POS_Y | blockIdx.y | workgroup.y | +| CUBE_POS_Z | blockIdx.z | workgroup.z | +| CUBE_DIM | N/A | N/A | +| CUBE_DIM_X | blockDim.x | workgroup_size.x | +| CUBE_DIM_Y | blockDim.y | workgroup_size.y | +| CUBE_DIM_Z | blockDim.z | workgroup_size.z | +| UNIT_POS | N/A | local_invocation_index | +| UNIT_POS_X | threadIdx.x | local_invocation_id.x | +| UNIT_POS_Y | threadIdx.y | local_invocation_id.y | +| UNIT_POS_Z | threadIdx.z | local_invocation_id.z | +| SUBCUBE_DIM | warpSize | subgroup_size | +| ABSOLUTE_POS | N/A | N/A | +| ABSOLUTE_POS_X | N/A | global_id.x | +| ABSOLUTE_POS_Y | N/A | global_id.y | +| ABSOLUTE_POS_Z | N/A | global_id.z | + +
+ +## Special Features + +#### Automatic Vectorization + +High-performance kernels should rely on SIMD instructions whenever possible, but doing so can quickly get pretty complicated! +With CubeCL, you can specify the vectorization factor of each input variable when launching a kernel. +Inside the kernel code, you still use only one type, which is dynamically vectorized and supports automatic broadcasting. +The runtimes are able to compile kernels and have all the necessary information to use the best instruction! +However, since the algorithmic behavior may depend on the vectorization factor, CubeCL allows you to access it directly in the kernel when needed, without any performance loss, using the comptime system! + +#### Comptime + +CubeCL isn't just a new compute language: though it feels like you are writing GPU kernels, you are, in fact, writing compiler plugins that you can fully customize! +Comptime is a way to modify the compiler IR at runtime when compiling a kernel for the first time. + +This enables lots of optimizations and flexibility without having to write many separate variants of the same kernels to ensure maximal performance. + +| Feature | Description | +| ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Instruction Specialization** | Not all instructions are available on all hardware, but when a specialized one exists, it should be enabled with a simple if statement. | +| **Automatic Vectorization** | When you can use SIMD instructions, you should! But since not all hardware supports the same vectorization factors, it can be injected at runtime! | +| **Loop Unrolling** | You may want multiple flavors of the same kernel, with loop unrolling for only a certain range of values. This can be configured easily with Comptime. | +| **Shape Specialization** | For deep learning kernels, it's often crucial to rely on different kernels for different input sizes; you can do it by passing the shape information as Comptime values. | +| **Compile Time Calculation** | In general, you can calculate a constant using Rust runtime properties and inject it into a kernel during its compilation, to avoid recalculating it during each execution. | + +#### Autotuning + +Autotuning drastically simplifies kernel selection by running small benchmarks at runtime to figure out the best kernels with the best configurations to run on the current hardware; an essential feature for portability. +This feature combines gracefully with comptime to test the effect of different comptime values on performance; sometimes it can be surprising! + +Even if the benchmarks may add some overhead when running the application for the first time, the information gets cached on the device and will be reused. +It is usually a no-brainer trade-off for throughput-oriented programs such as deep learning models. +You can even ship the autotune cache with your program, reducing cold start time when you have more control over the deployment target. + +## Example + +CubeCL is designed to be easy to use for Rust programmers: it relies on the same syntax and is fully integrated with the language. +You can simply add an attribute on the top of a Rust function for it to be executed on the GPU. + +```rust +#[cube(launch)] +fn gelu(input: &Array, output: &mut Array) { + if ABSOLUTE_POS < input.len() { + let x = input[ABSOLUTE_POS] + let gelu = x * (1 + erf(x / sqrt(2))) / 2; + output[ABSOLUTE_POS] = gelu; + } +} + +fn main() { + type Runtime = CudaRuntime; + + let device = Default::default(); + let client = Runtime::client(&device); + + let input_handle = client.create(f32::as_bytes(&[-1., 0., 1., 5.])); + let output_handle = client.empty(input.len() * core::mem::size_of::()); + + gelu::launch::( + client, + CubeCount::new(1, 1, 1), + CubeDim::new(4, 1, 1), + &input_handle, + &output_handle, + ); + + let output = client.read(output_handle.binding()).read_sync().unwrap(); + let output = f32::from_bytes(&output); + + // Should be [-0.1587, 0.0000, 0.8413, 5.0000] + println!("{output:?}"); +} + +``` + +The `cube` attribute generates the code that is needed to compile a kernel. +In the case above, the function `gelu_expand` and `gelu_launch` are automatically generated. +This allows you to compose Cube functions easily: + +```rust + +#[cube] +fn gelu_scalar(x: F) -> F { + x * (1 + erf(x / sqrt(2))) / 2 +} + +#[cube(launch)] +fn gelu(input: Array, mut output: Array) { + if ABSOLUTE_POS < input.shape(0) { + output[ABSOLUTE_POS] = gelu_scalar::(input[ABSOLUTE_POS]); + } +} +``` + +Note that you don't have to specify `launch` in a function that is only used by another Cube function. +In addition, you can have return types without problem, which isn't the case when you are writing an entry point to a kernel using the `launch` attribute. +The function `gelu_expand` will actually use `gelu_scalar_expand`, making it easy to combine your functions. + +## Resource + +If you have any questions or want to contribute, don't hesitate to join the [Discord](https://discord.gg/uPEBbYYDB6). diff --git a/crates/burn-cube/assets/CubeCL.webp b/crates/burn-cube/assets/CubeCL.webp new file mode 100644 index 000000000..305c1b882 Binary files /dev/null and b/crates/burn-cube/assets/CubeCL.webp differ diff --git a/crates/burn-cube/assets/cubecl.drawio.svg b/crates/burn-cube/assets/cubecl.drawio.svg new file mode 100644 index 000000000..e0278413a --- /dev/null +++ b/crates/burn-cube/assets/cubecl.drawio.svg @@ -0,0 +1,4 @@ + + + +
(2, 2, 0)
(1, 2, 0)
(0, 2, 0)
(2, 1, 0)
(1, 1, 0)
(0, 1, 0)
(2, 0, 0)
(1, 0, 0)
(0, 0, 0)
Cube
[2, 2, 0]
[1, 2, 0]
[0, 2, 0]
[2, 1, 0]
[1, 1, 0]
[0, 1, 0]
[2, 0, 0]
[1, 0, 0]
Hyper-Cube
(0, 0, 0)
Unit
(0, 0, 0)
(0, 0, 0)
diff --git a/crates/burn-cube/assets/logo.drawio.svg b/crates/burn-cube/assets/logo.drawio.svg new file mode 100644 index 000000000..a83fc3520 --- /dev/null +++ b/crates/burn-cube/assets/logo.drawio.svg @@ -0,0 +1,4 @@ + + + +

CubeCL

\ No newline at end of file