Commit Graph

132 Commits

Author SHA1 Message Date
Antoni Baum c07ece5ca4
Make `AsyncLLMEngine` more robust & fix batched abort (#969)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2023-09-07 13:43:45 -07:00
Woosuk Kwon 320a622ec4
[BugFix] Implement RoPE for GPT-J (#941) 2023-09-06 11:54:33 +09:00
Antoni Baum c9927c1a6a
Use queue for finished requests (#957) 2023-09-05 19:27:23 -07:00
Woosuk Kwon fbd80ad409
Clean up kernel unit tests (#938) 2023-09-05 16:57:38 -07:00
Zhuohan Li 002800f081
Align vLLM's beam search implementation with HF generate (#857) 2023-09-04 17:29:42 -07:00
Woosuk Kwon 32b6816e55
Add tests for models (#922) 2023-09-01 11:19:43 +09:00
Aman Gupta Karmani 75471386de
use flash-attn via xformers (#877) 2023-08-29 21:52:13 -07:00
Woosuk Kwon d64bf1646c
Implement approximate GELU kernels (#828) 2023-08-23 07:43:21 +09:00
Tao Peng d7a1c6d614
Fix paged attention testing. (#495)
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
2023-07-24 21:01:56 -07:00
Song bda41c70dd
hotfix attn alibi wo head mapping (#496)
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
2023-07-18 11:31:48 -07:00
Andre Slavescu c894836108
[Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon e41f06702c
Add support for BLOOM (#331) 2023-07-03 13:12:35 -07:00
Zhuohan Li d6fa1be3a8
[Quality] Add code formatter and linter (#326) 2023-07-03 11:31:55 -07:00
Woosuk Kwon 0b98ba15c7
Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
Woosuk Kwon e38074b1e6
Support FP32 (#141) 2023-06-07 00:40:21 -07:00
Woosuk Kwon a283ec2eec
Add contributing guideline and mypy config (#122) 2023-05-23 17:58:51 -07:00
Woosuk Kwon 825d8892b5
Use pytest format for unit tests (#107) 2023-05-17 17:11:23 -07:00
Woosuk Kwon c9d5b6d4a8
Replace FlashAttention with xformers (#70) 2023-05-05 02:01:08 -07:00
Woosuk Kwon 436e523bf1
Refactor attention kernels (#53) 2023-05-03 13:40:13 -07:00
Woosuk Kwon a96d63c21d
Add support for GPT-NeoX (Pythia) (#50) 2023-04-28 00:32:10 -07:00
Siyuan (Ryans) Zhuang e3cec88aa5
Memcpy kernel for flash attention (#29)
* optimize

* add benchmark

* add assert

* add test
2023-04-10 18:22:49 -07:00
Woosuk Kwon b9926f7f66
Support block size 32 (#35) 2023-04-09 23:07:18 -07:00
Woosuk Kwon c267b1a02c
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27)
* Add query stride to multi_query_cached_kv_attention

* Add kernel benchmark script
2023-04-08 13:36:09 -07:00
Woosuk Kwon 0f40557af6
Implement block copy kernel to optimize beam search (#32) 2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang 21b3671bbc
Basic attention kernel that supports cached KV + (multi-)prompts (#24) 2023-04-04 20:34:46 -07:00
Woosuk Kwon 897cb2ae28
Optimize data movement (#20) 2023-04-02 00:30:17 -07:00
Woosuk Kwon 09e9245478
Add custom kernel for RMS normalization (#16) 2023-04-01 00:51:22 +08:00
Woosuk Kwon 88c0268a18
Implement custom kernel for LLaMA rotary embedding (#14) 2023-03-30 11:04:21 -07:00
Woosuk Kwon a1b3de86cd
Refactor the test code for attention kernels (#13) 2023-03-29 18:59:27 -07:00
Woosuk Kwon 3e9f991d6a
Use FlashAttention for `multi_query_kv_attention` (#4) 2023-03-01 21:13:08 -08:00
Woosuk Kwon 0deacbce6e
Implement `single_query_cached_kv_attention` kernel (#3) 2023-03-01 15:02:19 -08:00
Woosuk Kwon af68ec1c5c Add tests for kernels 2023-02-18 19:23:07 +00:00