Antoni Baum
|
c07ece5ca4
|
Make `AsyncLLMEngine` more robust & fix batched abort (#969)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
|
2023-09-07 13:43:45 -07:00 |
Woosuk Kwon
|
320a622ec4
|
[BugFix] Implement RoPE for GPT-J (#941)
|
2023-09-06 11:54:33 +09:00 |
Antoni Baum
|
c9927c1a6a
|
Use queue for finished requests (#957)
|
2023-09-05 19:27:23 -07:00 |
Woosuk Kwon
|
fbd80ad409
|
Clean up kernel unit tests (#938)
|
2023-09-05 16:57:38 -07:00 |
Zhuohan Li
|
002800f081
|
Align vLLM's beam search implementation with HF generate (#857)
|
2023-09-04 17:29:42 -07:00 |
Woosuk Kwon
|
32b6816e55
|
Add tests for models (#922)
|
2023-09-01 11:19:43 +09:00 |
Aman Gupta Karmani
|
75471386de
|
use flash-attn via xformers (#877)
|
2023-08-29 21:52:13 -07:00 |
Woosuk Kwon
|
d64bf1646c
|
Implement approximate GELU kernels (#828)
|
2023-08-23 07:43:21 +09:00 |
Tao Peng
|
d7a1c6d614
|
Fix paged attention testing. (#495)
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
|
2023-07-24 21:01:56 -07:00 |
Song
|
bda41c70dd
|
hotfix attn alibi wo head mapping (#496)
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
|
2023-07-18 11:31:48 -07:00 |
Andre Slavescu
|
c894836108
|
[Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
|
2023-07-08 17:55:16 -07:00 |
Woosuk Kwon
|
e41f06702c
|
Add support for BLOOM (#331)
|
2023-07-03 13:12:35 -07:00 |
Zhuohan Li
|
d6fa1be3a8
|
[Quality] Add code formatter and linter (#326)
|
2023-07-03 11:31:55 -07:00 |
Woosuk Kwon
|
0b98ba15c7
|
Change the name to vLLM (#150)
|
2023-06-17 03:07:40 -07:00 |
Woosuk Kwon
|
e38074b1e6
|
Support FP32 (#141)
|
2023-06-07 00:40:21 -07:00 |
Woosuk Kwon
|
a283ec2eec
|
Add contributing guideline and mypy config (#122)
|
2023-05-23 17:58:51 -07:00 |
Woosuk Kwon
|
825d8892b5
|
Use pytest format for unit tests (#107)
|
2023-05-17 17:11:23 -07:00 |
Woosuk Kwon
|
c9d5b6d4a8
|
Replace FlashAttention with xformers (#70)
|
2023-05-05 02:01:08 -07:00 |
Woosuk Kwon
|
436e523bf1
|
Refactor attention kernels (#53)
|
2023-05-03 13:40:13 -07:00 |
Woosuk Kwon
|
a96d63c21d
|
Add support for GPT-NeoX (Pythia) (#50)
|
2023-04-28 00:32:10 -07:00 |
Siyuan (Ryans) Zhuang
|
e3cec88aa5
|
Memcpy kernel for flash attention (#29)
* optimize
* add benchmark
* add assert
* add test
|
2023-04-10 18:22:49 -07:00 |
Woosuk Kwon
|
b9926f7f66
|
Support block size 32 (#35)
|
2023-04-09 23:07:18 -07:00 |
Woosuk Kwon
|
c267b1a02c
|
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27)
* Add query stride to multi_query_cached_kv_attention
* Add kernel benchmark script
|
2023-04-08 13:36:09 -07:00 |
Woosuk Kwon
|
0f40557af6
|
Implement block copy kernel to optimize beam search (#32)
|
2023-04-07 17:45:07 -07:00 |
Siyuan (Ryans) Zhuang
|
21b3671bbc
|
Basic attention kernel that supports cached KV + (multi-)prompts (#24)
|
2023-04-04 20:34:46 -07:00 |
Woosuk Kwon
|
897cb2ae28
|
Optimize data movement (#20)
|
2023-04-02 00:30:17 -07:00 |
Woosuk Kwon
|
09e9245478
|
Add custom kernel for RMS normalization (#16)
|
2023-04-01 00:51:22 +08:00 |
Woosuk Kwon
|
88c0268a18
|
Implement custom kernel for LLaMA rotary embedding (#14)
|
2023-03-30 11:04:21 -07:00 |
Woosuk Kwon
|
a1b3de86cd
|
Refactor the test code for attention kernels (#13)
|
2023-03-29 18:59:27 -07:00 |
Woosuk Kwon
|
3e9f991d6a
|
Use FlashAttention for `multi_query_kv_attention` (#4)
|
2023-03-01 21:13:08 -08:00 |
Woosuk Kwon
|
0deacbce6e
|
Implement `single_query_cached_kv_attention` kernel (#3)
|
2023-03-01 15:02:19 -08:00 |
Woosuk Kwon
|
af68ec1c5c
|
Add tests for kernels
|
2023-02-18 19:23:07 +00:00 |