candle/candle-kernels
Laurent Mazare ce9fbc3682
Optimize the cat operation on contiguous tensors (#1855)
* Add a specialized kernel for copy2d.

* Move the cat operations.

* Avoid transpositions in cat.

* Bugfix.

* Bugfix for the cuda kernel.

* Add a benchmark.

* Add more testing.

* Test fix.

* Faster kernel.

* Add the missing kernel.

* Tweak the test.

* Add a metal kernel.

* Fix for the metal kernel.

* Get the tests to pass on metal.

* Also use this opportunity to fix the metal kernel for ELU.

* Add some bf16 kernels.

* Clippy fixes.
2024-03-17 10:49:13 +01:00
..
src Optimize the cat operation on contiguous tensors (#1855) 2024-03-17 10:49:13 +01:00
Cargo.toml Bump the crate versions to 0.4.2. (#1821) 2024-03-08 22:01:51 +01:00
README.md Revert "Add the layer norm files. (#222)" (#223) 2023-07-22 16:51:11 +01:00
build.rs Moving to a proper build crate `bindgen_cuda`. (#1531) 2024-01-07 12:29:24 +01:00

README.md

candle-kernels

This crate contains CUDA kernels used from candle. Some of these implementations come from the dfdx crate.