Commit Graph

58 Commits

Author SHA1 Message Date
Woosuk Kwon 6ef00b03a2
Enable CUDA graph for GPTQ & SqueezeLLM (#2318) 2024-01-03 09:52:29 -08:00
Jee Li 77af974b40
[FIX] Support non-zero CUDA devices in custom kernels (#1959) 2024-01-02 19:09:59 -08:00
kliuae 1b7c791d60
[ROCm] Fixes for GPTQ on ROCm (#2180) 2023-12-18 10:41:04 -08:00
Woosuk Kwon 76a7983b23
[BugFix] Fix RoPE kernel on long sequences(#2164) 2023-12-17 17:09:10 -08:00
CHU Tianxiang 0fbfc4b81b
Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
Mingcan Xiang 614856da25
Avoid multiple redefinition (#1817) 2023-12-14 09:35:58 -08:00
wbn dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
TJian 6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Yanming W e0c6f556e8
[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
ljss e1054247ba
[Optimization] Implement fused add rmsnorm (#1667) 2023-11-18 18:18:02 -08:00
Antoni Baum 9f669a9a7c
Support YaRN models (#1264)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
Woosuk Kwon 0ce8647dc5
Fix integer overflows in attention & cache ops (#1514) 2023-10-31 15:19:30 -07:00
chooper1 1f24755bf8
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon c1376e0f82
Change scheduler & input tensor shape (#1381) 2023-10-16 17:48:42 -07:00
Woosuk Kwon 928de46888
Implement PagedAttention V2 (#1348) 2023-10-16 00:59:57 -07:00
Woosuk Kwon 29678cd213
Minor fix on AWQ kernel launch (#1356) 2023-10-15 21:53:56 -07:00
CHU Tianxiang 980dd4a2c4
Fix overflow in awq kernel (#1295)
Co-authored-by: 楚天翔 <tianxiang.ctx@alibaba-inc.com>
2023-10-11 00:19:53 -07:00
twaka 8285736840
workaround of AWQ for Turing GPUs (#1252) 2023-10-10 19:48:16 -07:00
Liang ebe4d1db3a
Fix boundary check in paged attention kernel (#1241) 2023-10-01 11:35:06 -07:00
Antoni Baum cf5cb1e33e
Allocate more shared memory to attention kernel (#1154) 2023-09-26 22:27:13 -07:00
Woosuk Kwon 2b1c116b5a
Add minimum capability requirement for AWQ (#1064) 2023-09-18 12:02:01 -07:00
Woosuk Kwon e3e79e9e8a
Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Zhuohan Li db09d4ad83
[FIX] Fix Alibi implementation in PagedAttention kernel (#945)
* [FIX] Fix Alibi implementation in PagedAttention kernel

* Fix test_attention

* Fix

---------

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
2023-09-07 15:53:14 -07:00
Woosuk Kwon 320a622ec4
[BugFix] Implement RoPE for GPT-J (#941) 2023-09-06 11:54:33 +09:00
Woosuk Kwon bf87484efa
[BugFix] Fix NaN errors in paged attention kernel (#936) 2023-09-04 09:20:06 +09:00
Woosuk Kwon 8ce9c50d40
Avoid compiling kernels for double data type (#933) 2023-09-02 14:59:47 +09:00
Woosuk Kwon d64bf1646c
Implement approximate GELU kernels (#828) 2023-08-23 07:43:21 +09:00
Dean Leitersdorf 79af7e96a0
[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel (#420) 2023-08-04 10:57:29 -07:00
Zhuohan Li 1b0bd0fe8a
Add Falcon support (new) (#592) 2023-08-02 14:04:39 -07:00
Zhuohan Li 6fc2a38b11
Add support for LLaMA-2 (#505) 2023-07-20 11:38:27 -07:00
Zhuohan Li 96853af5a8
Optimize MQA Kernel (#452) 2023-07-14 20:06:40 -04:00
Andre Slavescu c894836108
[Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon 404422f42e
[Model] Add support for MPT (#334) 2023-07-03 16:47:53 -07:00
Woosuk Kwon e41f06702c
Add support for BLOOM (#331) 2023-07-03 13:12:35 -07:00
Woosuk Kwon 0b98ba15c7
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
Woosuk Kwon e38074b1e6
Support FP32 (#141) 2023-06-07 00:40:21 -07:00
Woosuk Kwon d721168449
Improve setup script & Add a guard for bfloat16 kernels (#130) 2023-05-27 00:59:32 -07:00
Woosuk Kwon 667ba3995c
Add copyright headers to source files adapted from FT (#104) 2023-05-14 22:19:19 -07:00
Woosuk Kwon 130d5fd8c7
Fix a bug in attention kernel (#68) 2023-05-04 02:56:09 -07:00
Woosuk Kwon e070829ae8
Support bfloat16 data type (#54) 2023-05-03 14:09:44 -07:00
Woosuk Kwon 436e523bf1
Refactor attention kernels (#53) 2023-05-03 13:40:13 -07:00
Woosuk Kwon a96d63c21d
Add support for GPT-NeoX (Pythia) (#50) 2023-04-28 00:32:10 -07:00
Woosuk Kwon 0f4b32199e
Support various block sizes & Change default block size to 16 (#38) 2023-04-15 09:03:24 -07:00
Siyuan (Ryans) Zhuang e3cec88aa5
Memcpy kernel for flash attention (#29)
* optimize

* add benchmark

* add assert

* add test
2023-04-10 18:22:49 -07:00
Woosuk Kwon b9926f7f66
Support block size 32 (#35) 2023-04-09 23:07:18 -07:00
Woosuk Kwon c267b1a02c
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27)
* Add query stride to multi_query_cached_kv_attention

* Add kernel benchmark script
2023-04-08 13:36:09 -07:00
Woosuk Kwon 0f40557af6
Implement block copy kernel to optimize beam search (#32) 2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang 21b3671bbc
Basic attention kernel that supports cached KV + (multi-)prompts (#24) 2023-04-04 20:34:46 -07:00
Woosuk Kwon 897cb2ae28
Optimize data movement (#20) 2023-04-02 00:30:17 -07:00
Woosuk Kwon 09e9245478
Add custom kernel for RMS normalization (#16) 2023-04-01 00:51:22 +08:00