* wip autotune compute
* too much generics
* wip
* megawip
* in progress
* first test passes
* first test passes
* fixed test
* refactor for cache hit and miss
* cleanup and fixes
* doc and stuff
* doc and stuff
* clippy
* format
* remove lifetime
* cleanup operation
* wip
* wip
* compiles
* wip mutable borrow
* refactor with autotune server
* wip tune benchmark
* test passes
* fix autotune key
* cache hit miss tests
* refactor wgpu to match burn-compute
* better operation execution
* cleanup & refactor
* test for parametered kernel
* fmt
* fmt
* clippy
* allow clippy
* fix no-std
* fmt
* review and ci
* Fix CI
* delete dummy benchmarks again
---------
Co-authored-by: nathaniel <nathaniel.simard.42@gmail.com>
We've been exploring dividing our data set up into multiple batches,
and training those batches in parallel. I noticed that performance did
not scale with core count, and after some digging, found that this was
mainly due to the Mutex being used to generate IDs. With the following
change, training across 16 cores went from 21s to 4.2s.
thread_rng was previously discussed on #703, but I don't believe that
applies here, as this is just used for UUID creation.