* new reduce half working
* surprisingly working
* good on elongated matrix, bad on balanced ones
* working and clean
* autotune not tested, tests fail at non contiguous
* fixed
* autotune tested
* mean dim
* some fixes
* clippy
* wip autotune compute
* too much generics
* wip
* megawip
* in progress
* first test passes
* first test passes
* fixed test
* refactor for cache hit and miss
* cleanup and fixes
* doc and stuff
* doc and stuff
* clippy
* format
* remove lifetime
* cleanup operation
* wip
* wip
* compiles
* wip mutable borrow
* refactor with autotune server
* wip tune benchmark
* test passes
* fix autotune key
* cache hit miss tests
* refactor wgpu to match burn-compute
* better operation execution
* cleanup & refactor
* test for parametered kernel
* fmt
* fmt
* clippy
* allow clippy
* fix no-std
* fmt
* review and ci
* Fix CI
* delete dummy benchmarks again
---------
Co-authored-by: nathaniel <nathaniel.simard.42@gmail.com>
* Update kernel mod.rs
* Wgpu crate implementations and add shader files
* Direct backends to the correct implementation
* Use mask method for candle
* Add index out of bounds protection
* Use a macro to avoid duplication
* Use unary_scalar templates
* New shaders for clamp and clamp_inplace
* Remove unneccessary clamp shaders
* Clamp implementation and test
* Use new clamp implementation for float and int ops
* Better variable names for clamp_min/max
* Revert changes to tensor/ops/tensor.rs
* Fix clamp.wgsl
* Fix shader types
* Use native candle clamp
* Use candle ops for clamp_min/max and revert tensor.rs
* Maximum/minimum were reversed
* Add a pipeline_counter and methods for process of retaining best kernel
* Put a tune flag on the Context
* Put counts into cache instead of using pipeline_counter
* Formatting
* Add optimize_cache flag and rework ComputePipeline clearing process
* Update tune() so that it starts Context tuning and flags the Context as ready for clearing
* Consistent single quotes
* Use AtomicBool for is_tuning, prevent caching during tuning
* Collect TemplateIds during tuning and clean them out after tuning
* Fix comment
* Move cache cleanup to stop_tuning function