Philipp Moritz
|
657061fdce
|
[docs] Add LoRA support information for models (#3299)
|
2024-03-11 00:54:51 -07:00 |
Zhuohan Li
|
2f8844ba08
|
Re-enable the 80 char line width limit (#3305)
|
2024-03-10 19:49:14 -07:00 |
Nick Hill
|
4b59f00e91
|
[Fix] Fix best_of behavior when n=1 (#3298)
|
2024-03-10 19:17:46 -07:00 |
Roy
|
9e8744a545
|
[BugFix] Fix get tokenizer when using ray (#3301)
|
2024-03-10 19:17:16 -07:00 |
Douglas Lehr
|
e4a28e5316
|
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262)
|
2024-03-10 15:27:45 -07:00 |
Terry
|
0bba88df03
|
Enhance lora tests with more layer and rank variations (#3243)
|
2024-03-09 17:14:16 -08:00 |
Cade Daniel
|
8437bae6ef
|
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103)
|
2024-03-08 23:32:46 -08:00 |
Zhuohan Li
|
f48c6791b7
|
[FIX] Fix prefix test error on main (#3286)
|
2024-03-08 17:16:14 -08:00 |
Michael Goin
|
c2c5e0909a
|
Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241)
|
2024-03-08 13:33:10 -08:00 |
Woosuk Kwon
|
1cb0cc2975
|
[FIX] Make `flash_attn` optional (#3269)
|
2024-03-08 10:52:20 -08:00 |
Roger Wang
|
99c3cfb83c
|
[Docs] Fix Unmocked Imports (#3275)
|
2024-03-08 09:58:01 -08:00 |
TianYu GUO
|
1ece1ae829
|
[Minor Fix] Fix comments in benchmark_serving (#3252)
|
2024-03-07 22:22:59 -08:00 |
whyiug
|
c59e120c55
|
Feature add lora support for Qwen2 (#3177)
|
2024-03-07 21:58:24 -08:00 |
Nick Hill
|
d2339d6840
|
Connect engine healthcheck to openai server (#3260)
|
2024-03-07 16:38:12 -08:00 |
ElizaWszola
|
b35cc93420
|
Fix auto prefix bug (#3239)
|
2024-03-07 16:37:28 -08:00 |
jacobthebanana
|
8cbba4622c
|
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263)
|
2024-03-07 23:03:22 +00:00 |
Michael Goin
|
385da2dae2
|
Measure model memory usage (#3120)
|
2024-03-07 11:42:42 -08:00 |
Woosuk Kwon
|
2daf23ab0c
|
Separate attention backends (#3005)
|
2024-03-07 01:45:50 -08:00 |
Chen Wang
|
cbf4c05b15
|
Update requirements-dev.txt to include package for benchmarking scripts. (#3181)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-03-07 08:39:28 +00:00 |
TechxGenus
|
d3c04b6a39
|
Add GPTQ support for Gemma (#3200)
|
2024-03-07 08:19:14 +08:00 |
Chujie Zheng
|
4cb3b924cd
|
Add tqdm `dynamic_ncols=True` (#3242)
|
2024-03-06 22:41:42 +00:00 |
Cade Daniel
|
a33ce60c66
|
[Testing] Fix core tests (#3224)
|
2024-03-06 01:04:23 -08:00 |
SangBin Cho
|
24aecf421a
|
[Tests] Add block manager and scheduler tests (#3108)
|
2024-03-05 18:23:34 -08:00 |
Nick Hill
|
2efce05dc3
|
[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-03-06 00:17:20 +00:00 |
Nick Hill
|
8999ec3c16
|
Store `eos_token_id` in `Sequence` for easy access (#3166)
|
2024-03-05 15:35:43 -08:00 |
Hongxia Yang
|
05af6da8d9
|
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#3123)
Co-authored-by: lcskrishna <lollachaitanya@gmail.com>
|
2024-03-04 18:14:53 -08:00 |
Chen Wang
|
9a4548bae7
|
Fix the openai benchmarking requests to work with latest OpenAI apis (#2992)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-04 15:51:56 -08:00 |
Antoni Baum
|
ff578cae54
|
Add health check, make async Engine more robust (#3015)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-03-04 22:01:40 +00:00 |
Antoni Baum
|
22de45235c
|
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-03-04 19:54:06 +00:00 |
ttbachyinsda
|
76e8a70476
|
[Minor fix] The domain dns.google may cause a socket.gaierror exception (#3176)
Co-authored-by: guofangze <guofangze@kuaishou.com>
|
2024-03-04 19:17:12 +00:00 |
Allen.Dou
|
9cbc7e5f3b
|
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
|
2024-03-04 10:37:58 -08:00 |
Jialun Lyu
|
27a7b070db
|
Add document for vllm paged attention kernel. (#2978)
|
2024-03-04 09:23:34 -08:00 |
TianYu GUO
|
901cf4c52b
|
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171)
|
2024-03-03 22:48:27 -08:00 |
Liangfu Chen
|
d0fae88114
|
[DOC] add setup document to support neuron backend (#2777)
|
2024-03-04 01:03:51 +00:00 |
Philipp Moritz
|
17c3103c56
|
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-03 16:19:13 -08:00 |
Zhuohan Li
|
996d095c54
|
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158)
|
2024-03-03 14:37:18 -08:00 |
Jason Cox
|
d65fac2738
|
Add vLLM version info to logs and openai API server (#3161)
|
2024-03-02 21:00:29 -08:00 |
Sage Moore
|
ce4f5a29fb
|
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-03-02 00:50:01 -08:00 |
cloudhan
|
baee28c46c
|
Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104)
|
2024-03-02 14:34:48 +08:00 |
Allen.Dou
|
29e70e3e88
|
allow user chose log level by --log-level instead of fixed 'info'. (#3109)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-03-01 23:28:41 +00:00 |
Woosuk Kwon
|
82091b864a
|
Bump up to v0.3.3 (#3129)
|
2024-03-01 12:58:06 -08:00 |
Robert Shaw
|
c0c2335ce0
|
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
|
2024-03-01 12:47:51 -08:00 |
Huarong
|
90fbf12540
|
fix relative import path of protocol.py (#3134)
Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com>
|
2024-03-01 19:42:06 +00:00 |
Yuan Tang
|
49d849b3ab
|
docs: Add tutorial on deploying vLLM model with KServe (#2586)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
|
2024-03-01 11:04:14 -08:00 |
Seonghyeon
|
27ca23dc00
|
Remove exclude_unset in streaming response (#3143)
|
2024-03-01 09:59:06 -08:00 |
Sherry
|
54d3544784
|
Fix: Output text is always truncated in some models (#3016)
|
2024-03-01 07:52:22 +00:00 |
felixzhu555
|
703e42ee4b
|
Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-02-29 22:13:08 +00:00 |
Nick Hill
|
29a8d6a554
|
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#3099)
|
2024-02-29 19:20:42 +00:00 |
Billy Cao
|
2c08ff23c0
|
Fix building from source on WSL (#3112)
|
2024-02-29 11:13:58 -08:00 |
Seonghyeon
|
bfdcfa6a05
|
Support starcoder2 architecture (#3089)
|
2024-02-29 00:51:48 -08:00 |