* Refactor serialization of benchmarks
* flatten benchmarks data to make it easier to save documents to a database and
query them
* split some information into their own fields like backend and device
* add new seralized info:
- computed values (mean, median, variance, min, max)
- number of samples
- operation name
- tensor shapes if any
* serialize to separate files, one file per benchmark run
* simplify persistence module to only a save method
* Update bench save file format to use name and uuid
* Compute serialized fields count automatically via a macro
* Rework naming of benchmarks, shapes and add options field
Remove operations field
Correctly create one file per ran benchmark
* Serialize benchmark num_repeats
* Fix expect message to follow the 'should' convention
* Cargo fmt :-)
* Make Clippy happy
* Save files in the burn subdirectory
* Change name of custom_gelu bench to just gelu
* Remove num_repeats from backend-comparison benchmarks
* Fix wrong variable name to compute the median
* Remove false positive possibility in test_mean_duration
* Add new persistant cache to tune cache
* Serialize autotune presistent cache using vectors
* Properly load and save the persistent cachegf
* Print an error when autotune cache cannot be loaded
* Add tests for persistent cache
Use the same logic as the already implemented tests
* Cargo fmt
* Silence clippy check about implementing default for CliMetricsRenderer
* Add burn-compute feature flag autotune-persistent-cache
This allow burn-compute to remain no-std compliant
* debug
* Git ignore .dir-locals.el files
* Update documentation for compute_checksum implementation
* Expect messages should be an expectation not an error message
* Replace silent eprintln! by log:warn! macro
* Remove clippy allow attribute
* Fix typos in documentation
* Move creation of additional client into the test fn that requires it
* Create compute clients in test function to test different checksum
* Revert tui as a default feature in burn-train cargo file
* Use structs for autotune cache entries
* Unpack InMemoryCacheEntry for even better readibility
* Remove uneeded checksum_checked field in no-std env
* Make sure that autotune cache directoy exists
* Add test for autotune cache file path creation
* Add prefix device info to autotune cache file
* Use new compute in autotune cache integration tests
This prevents race condition by always reloading the cache fir
each test.
* Move burn-compute rand depdencencie in dev-dependencies
* Avoid creation of formatted msg except in case of actual error
* Fix burn-compute unused code warning in no-std env
* wip autotune compute
* too much generics
* wip
* megawip
* in progress
* first test passes
* first test passes
* fixed test
* refactor for cache hit and miss
* cleanup and fixes
* doc and stuff
* doc and stuff
* clippy
* format
* remove lifetime
* cleanup operation
* wip
* wip
* compiles
* wip mutable borrow
* refactor with autotune server
* wip tune benchmark
* test passes
* fix autotune key
* cache hit miss tests
* refactor wgpu to match burn-compute
* better operation execution
* cleanup & refactor
* test for parametered kernel
* fmt
* fmt
* clippy
* allow clippy
* fix no-std
* fmt
* review and ci
* Fix CI
* delete dummy benchmarks again
---------
Co-authored-by: nathaniel <nathaniel.simard.42@gmail.com>
We've been exploring dividing our data set up into multiple batches,
and training those batches in parallel. I noticed that performance did
not scale with core count, and after some digging, found that this was
mainly due to the Mutex being used to generate IDs. With the following
change, training across 16 cores went from 21s to 4.2s.
thread_rng was previously discussed on #703, but I don't believe that
applies here, as this is just used for UUID creation.