陈序
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled ( #3373 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel ( #3095 )
2024-03-13 13:45:26 -07:00
Or Sharir
ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. ( #3350 )
2024-03-13 12:18:25 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU ( #3337 )
2024-03-12 22:06:17 -07:00
Breno Faria
49a3c8662b
Fixes #1556 double free ( #3347 )
2024-03-13 00:30:08 +00:00
Zhuohan Li
4c922709b6
Add distributed model executor abstraction ( #3191 )
2024-03-11 11:03:45 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Roy
9e8744a545
[BugFix] Fix get tokenizer when using ray ( #3301 )
2024-03-10 19:17:16 -07:00
Terry
0bba88df03
Enhance lora tests with more layer and rank variations ( #3243 )
2024-03-09 17:14:16 -08:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling ( #3103 )
2024-03-08 23:32:46 -08:00
ElizaWszola
b35cc93420
Fix auto prefix bug ( #3239 )
2024-03-07 16:37:28 -08:00
jacobthebanana
8cbba4622c
Possible fix for conflict between Automated Prefix Caching ( #2762 ) and multi-LoRA support ( #1804 ) ( #3263 )
2024-03-07 23:03:22 +00:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
Cade Daniel
a33ce60c66
[Testing] Fix core tests ( #3224 )
2024-03-06 01:04:23 -08:00
SangBin Cho
24aecf421a
[Tests] Add block manager and scheduler tests ( #3108 )
2024-03-05 18:23:34 -08:00
Nick Hill
8999ec3c16
Store `eos_token_id` in `Sequence` for easy access ( #3166 )
2024-03-05 15:35:43 -08:00
Antoni Baum
ff578cae54
Add health check, make async Engine more robust ( #3015 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-04 22:01:40 +00:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server ( #2819 )
...
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Seonghyeon
bfdcfa6a05
Support starcoder2 architecture ( #3089 )
2024-02-29 00:51:48 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
Tao He
71bcaf99e2
Enable GQA support in the prefix prefill kernels ( #3007 )
...
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Dylan Hawk
e0ade06d63
Support logit bias for OpenAI API ( #3027 )
2024-02-27 11:51:53 +08:00
Jared Moore
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI ( #2918 )
2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411
Port metrics from `aioprometheus` to `prometheus_client` ( #2730 )
2024-02-25 11:54:00 -08:00
Ronen Schaffer
4caf7044e0
Include tokens from prompt phase in `counter_generation_tokens` ( #2802 )
2024-02-22 14:00:12 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma ( #2975 )
2024-02-21 20:17:52 -08:00
Massimiliano Pronesti
93dc5a2870
chore(vllm): codespell for spell checking ( #2820 )
2024-02-21 18:56:01 -08:00
Nick Hill
7d2dcce175
Support per-request seed ( #2514 )
2024-02-21 11:47:00 -08:00
Antoni Baum
017d9f1515
Add metrics to RequestOutput ( #2876 )
2024-02-20 21:55:57 -08:00
Zhuohan Li
63e2a6419d
[FIX] Fix beam search test ( #2930 )
2024-02-20 14:37:39 -08:00
Ronen Schaffer
e433c115bc
Fix `vllm:prompt_tokens_total` metric calculation ( #2869 )
2024-02-18 23:55:41 -08:00
Isotr0py
ab3a5a8259
Support OLMo models. ( #2832 )
2024-02-18 21:05:15 -08:00
Zhuohan Li
a61f0521b8
[Test] Add basic correctness test ( #2908 )
2024-02-18 16:44:50 -08:00
jvmncs
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Woosuk Kwon
d7afab6d3a
[BugFix] Fix GC bug for `LLM` class ( #2882 )
2024-02-14 22:17:44 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral ( #2831 )
...
* add mixtral lora support
* formatting
* fix incorrectly ported logic
* polish tests
* minor fixes and refactoring
* minor fixes
* formatting
* rename and remove redundant logic
* refactoring
* refactoring
* minor fix
* minor refactoring
* fix code smell
2024-02-14 00:55:45 +01:00
Lily Liu
fe6d09ae61
[Minor] More fix of test_cache.py CI test failure ( #2750 )
2024-02-06 11:38:38 -08:00
Woosuk Kwon
f0d4e14557
Add fused top-K softmax kernel for MoE ( #2769 )
2024-02-05 17:38:02 -08:00
Hongxia Yang
56f738ae9b
[ROCm] Fix some kernels failed unit tests ( #2498 )
2024-02-05 14:25:36 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded `device="cuda" ` to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Philipp Moritz
d0d93b92b1
Add unit test for Mixtral MoE layer ( #2677 )
2024-01-31 14:34:17 -08:00
Philipp Moritz
89efcf1ce5
[Minor] Fix test_cache.py CI test failure ( #2684 )
2024-01-31 10:12:11 -08:00
Vladimir
4f65af0e25
Add swap_blocks unit tests ( #2616 )
2024-01-30 09:30:50 -08:00
wangding zeng
5d60def02c
DeepseekMoE support with Fused MoE kernel ( #2453 )
...
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00