candle

Commit Graph

Author	SHA1	Message	Date
Laurent Mazare	318cb82f16	Quantized cuda tweaks. (#1981 ) * Quantized cuda tweaks. * Add some safety checks. * Factorize the dequantization bits.	2024-04-01 11:06:42 +02:00
Laurent Mazare	c7557b65dc	Switch the default to using the faster kernels. (#1978 ) * Switch the default to using the faster kernels. * Add the force-dmmv flag.	2024-04-01 10:00:11 +02:00
Laurent Mazare	cd29c7ccd4	More ggml cuda kernels (#1977 ) * Add more cuda kernels for quantized matmul. * Add the vec-dot bits. * Expose the quantized matmul-vec kernels. * Also include the quantize-q8-1 kernel. * Glue code for the q8-1 quantization. * mm-vec product via q8-1 quantization. * Add a test. * Add a mm test. * Get the test to return some sensible results. * Also test dmmv. * Fix the launch params. * Allow for tweaking the force_dmmv parameter while it's experimental.	2024-04-01 00:15:48 +02:00
Laurent Mazare	3144150b8d	Move the tensor-tools binary in a separate crate. (#1969 )	2024-03-30 15:49:37 +01:00
Laurent Mazare	b190fd8592	Remove some unnecessary calls to contiguous. (#1968 ) * Remove some unnecessary calls to contiguous. * Slightly improved kv cache concatenation.	2024-03-30 13:22:00 +01:00
Laurent Mazare	efe4a0c84b	Add a print command to tensor-tools. (#1967 ) * Add a print command to tensor-tools. * Add some flags to tweak the formatting.	2024-03-30 11:34:33 +01:00
Laurent Mazare	665da30487	Backend refactoring. (#1966 ) * Backend refactoring. * Metal tweaks. * Move the cudnn module.	2024-03-29 23:02:11 +01:00
Marco Inacio	7ecbc6d50b	fix minor typo (#1924 )	2024-03-29 18:09:57 +01:00
Laurent Mazare	b3484e7a5e	Fix for the RWKV models. (#1955 ) * Fix for the RWKV models. * More general fix + revert the rwkv hack. * Remove the old hack.	2024-03-28 10:17:38 +01:00
Laurent Mazare	ab86cd37c8	Support i64 in index-select on metal. (#1951 ) * Support i64 in index-select on metal. * Add some testing of index-select for all dtypes.	2024-03-27 16:30:07 +01:00
Laurent Mazare	a9abde5f93	More flexible matmul contiguity checks. (#1949 ) * More flexible matmul contiguity checks. * Also relax the checks on the metal side.	2024-03-27 10:59:05 +01:00
Laurent Mazare	66f0a4eeea	Another fix for squeezing. (#1943 )	2024-03-26 17:05:26 +01:00
Thomas Santerre	f5dfe883d7	Extend supported dtypes for metal (im2col & upsample_2d) (#1938 ) * update im2col dtype implementations * update dtypes for upsample	2024-03-26 06:48:56 +01:00
Laurent Mazare	cd254074f3	Really unique identifier for metal device ids. (#1932 ) * Really unique identifier for metal device ids. * Same device.	2024-03-25 11:48:16 +01:00
Laurent Mazare	fdfe8fd129	Preliminary support for inplace ops. (#1921 ) * Preliminary support for inplace ops. * Add a test.	2024-03-23 14:16:19 +01:00
Kirpal Grewal	cc856db9ce	Backwards for ConvTranspose2D (#1910 ) * add documentation for nackprop * add backwards for ConvTranspose2D * add test python code to test	2024-03-23 07:05:55 +01:00
Thomas Santerre	fee33b45c2	Add support for strided index-select on Metal (#1909 ) * initial implementation * use correct index, but still not breaking like it should have... * fix test	2024-03-22 07:30:02 +01:00
Laurent Mazare	6708870e63	Add the alloc_uninit function. (#1901 ) * Add the alloc_uninit function. * Dummy metal fix. * Lazy initialization.	2024-03-22 07:25:23 +01:00
Thomas Santerre	9563a5fee4	Add support for conv_transpose2d on Metal backend (#1903 ) * add support for conv transpose 2d and add bench mark for float types * update bench calculation * enable testing all conv operations on metal	2024-03-21 18:08:45 +01:00
Laurent Mazare	ec97c98e81	Async tensor copying. (#1900 )	2024-03-21 13:09:42 +01:00
Laurent Mazare	74b7f59261	Prepare for the custom-op extension. (#1892 )	2024-03-21 07:02:20 +01:00
Laurent Mazare	b219903d0f	Cuda backend optimization (#1886 ) * Attempt at making the kernel faster. * Also adapt the cast kernels. * Also apply to binary ops.	2024-03-20 18:32:55 +01:00
Laurent Mazare	469635a3eb	Minor cleanup. (#1885 )	2024-03-20 14:38:27 +01:00
Laurent Mazare	455c42aa72	Avoid copying the data on squeeze and unsqueeze. (#1884 ) * Avoid copying the data on squeeze and unsqueeze. * Fix the quantized llama example. * Unrelated fix for the quantized stable-lm example on cuda. * Fix for mamba on cuda (unrelated to the PR).	2024-03-20 13:04:36 +01:00
Thomas Santerre	2a8679509e	Add support for conv_transpose1d for metal backend (#1874 ) * first attempt * progress * integrate into metal backend * finish and get test passing * add other dtype support * update transpose1d dtypes supported	2024-03-19 08:46:58 +01:00
Thomas Santerre	04a61a9c72	Add avg_pool2d metal implementation for the metal backend (#1869 ) * implement metal avg pool 2d * fixX * add suggested precision workaround for the accumulator	2024-03-18 18:50:14 +01:00
Thomas Santerre	754fa1e813	Add support for max_pool2d for Metal backend (#1863 ) * first pass at implementation of maxpool2d * Add definitions for other dtypes * add tests for other dtypes * Cosmetic tweaks + re-enable maxpool2d tests for metal. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>	2024-03-18 08:33:30 +01:00
Thomas Santerre	184105792f	add test for index add and add missing match statements (#1862 )	2024-03-17 22:19:12 +01:00
Thomas Santerre	e316cb6997	add support for casting between all datatypes (#1860 )	2024-03-17 20:55:11 +01:00
Laurent Mazare	ce9fbc3682	Optimize the cat operation on contiguous tensors (#1855 ) * Add a specialized kernel for copy2d. * Move the cat operations. * Avoid transpositions in cat. * Bugfix. * Bugfix for the cuda kernel. * Add a benchmark. * Add more testing. * Test fix. * Faster kernel. * Add the missing kernel. * Tweak the test. * Add a metal kernel. * Fix for the metal kernel. * Get the tests to pass on metal. * Also use this opportunity to fix the metal kernel for ELU. * Add some bf16 kernels. * Clippy fixes.	2024-03-17 10:49:13 +01:00
Thomas Santerre	db8b24ae92	Add support for index u8/i64 and input f16/bf16 scatter-add on metal (#1849 ) * add support and tests for scatter add on metal * add support for all datatypes	2024-03-17 08:09:43 +01:00
Laurent Mazare	cdc4c172c4	Implement the error trait for DTypeParseError. (#1852 )	2024-03-15 08:37:27 +01:00
Laurent Mazare	df5f69444e	Properly handle the batch dimension in cuda quantized matmul. (#1832 )	2024-03-10 20:23:43 +01:00
Laurent Mazare	936f6a4840	Fix dequantization. (#1823 )	2024-03-08 23:12:13 +01:00
Laurent Mazare	3440cec3a0	Fast CPU kernel for transposed 1d convolutions. (#1822 ) * Fast CPU kernel for transposed 1d convolutions. * Bugfix.	2024-03-08 22:43:07 +01:00
Niklas Hallqvist	be5b68cd0b	Metal random-generation bug fixes (#1811 ) * use_resource API misunderstood. It is not additive. Several usages must be bit-ORed together. * The seeding was incorrect and used the address instead of the value of the passed in seed. * Add a check that likely exhibits failure to update the seed between generation of random tensors. * Buffer overrun, the length given to the std::ptr::copy call was in bytes, and not 32-bit units. * By default seed the RNG with a time-based value, so that different runs may produce different output, just like the CPU engine. Use device.set_seed if determinism is warranted. * Revert "By default seed the RNG with a time-based value, so that different runs may produce different output, just like the CPU engine. Use device.set_seed if determinism is warranted." This reverts commit `d7302de9` Discussion in https://github.com/huggingface/candle/pull/1811#issuecomment-1983079119 * The Metal random kernel failed to set element N/2 of tensors with N elements, N being even. The reason was that all threads but thread 0 all created 2 random samples, but thread 0 only one, i.e. an odd number. In order to produce an even number of samples, the early termination of thread 0 should only everr occur for odd sized tensors. * Add a test catching any deterministic tensor element in rand and randn output. --------- Co-authored-by: niklas <niklas@appli.se> Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>	2024-03-08 16:11:50 +01:00
Laurent Mazare	ea984d0421	Expose more printer options. (#1817 )	2024-03-08 15:04:18 +01:00
Laurent Mazare	9634583781	Expose a couple layout methods. (#1816 )	2024-03-08 10:52:22 +01:00
ivarflakstad	0c09d10f32	Improve metal buffer usage (#1807 ) * Improve metal buffer usage * Clone cpu storage when loading to reduce wait_until_complete calls * Use powers of two for buffer sizes so reuse is more likely. * Select best available buffer by size. * Add count to MetalStorage -> can use buffer with different size Co-authored-by: Chris Fleetwood <christopher.fleetwood@huggingface.co> * Simplify new buffer creation without blit copy. Revert &[] -> Vec * Add documentation on newBufferWithBytes safety / synchronization * Drop unused buffers after command buffer is done syncing. --------- Co-authored-by: Chris Fleetwood <christopher.fleetwood@huggingface.co>	2024-03-07 09:42:34 +01:00
Laurent Mazare	bd9ab9bc04	Add a cuda kernel for dequantizing q8_0. (#1804 )	2024-03-05 09:50:37 +01:00
Laurent Mazare	09e0148cce	Tweaks to run metavoice on metal (#1792 ) * Enable tanh + tweak conv-transpose. * Run the encodec decoding on cpu. * Clippy fixes.	2024-03-03 07:46:44 +01:00
laurent	2c95b7394a	Handle Q5_0 and Q5_1 quants in cuda.	2024-02-29 10:54:01 +01:00
Laurent Mazare	6400e1b0a0	Fix the block size for some cuda kernels. (#1767 )	2024-02-27 14:08:33 +01:00
Laurent Mazare	badf886583	Cuda kernel for dequantizing q8k. (#1760 ) * Cuda kernel for dequantizing q8k. * Clippy lints.	2024-02-26 08:42:44 +01:00
Laurent Mazare	2f22afd80e	Cuda acceleration for quantized model. (#1754 ) * Boilerplate for the quantized cuda support. * More basic cuda support. * More cuda quantization (quantize on cpu for now). * Add the dequantization bit. * Start adding some dedicated cuda kernels from llama.cpp. * Move the kernel code. * Start interfacing with the kernel. * Tweak the kernel launch params. * Bugfix for quantized metal. * Fix some clippy lints. * Tweak the launch parameters. * Tweak cuda basics to perform a quantized matmul. * Perform the dequantization on the cpu + use cublas for matmul. * Add the dequantization kernel. * Test the qmatmul. * More kernels. * Matmul-vec kernel. * Add a couple kernels. * More dequantization kernels.	2024-02-25 18:11:47 +01:00
Laurent Mazare	c753f72c85	Support for attention bias in gemma + refactor things a bit. (#1744 ) * Support for attention bias in gemma + refactor things a bit. * Fix the cuda tests.	2024-02-22 09:35:28 +01:00
Kirpal Grewal	8013b50829	Add grads for interpolate1d (#1742 ) * add backprop for interpolate1d * fix clippy lint * correct fix clippy lint	2024-02-22 08:44:01 +01:00
Laurent Mazare	a2cb2edead	Add a couple backtraces on cpu errors. (#1738 )	2024-02-20 19:54:13 +01:00
Laurent Mazare	fc67d878bb	Bugfix for conv-transpose1d (#1734 ) * Add a currently broken test. * Bugfix + fix test.	2024-02-19 09:04:49 +01:00
Laurent Mazare	1fb728772d	Support for groups in conv-transpose1d. (#1731 ) * Groups support in conv-transpose-1d. * Remove dangling file.	2024-02-18 21:28:07 +01:00

1 2 3 4 5 ...

661 Commits