Commit Graph

1375 Commits

Author SHA1 Message Date
Simon Mo c7f2cf2b7f
[CI] Reduce wheel size by not shipping debug symbols (#4602) 2024-05-04 21:28:58 -07:00
Simon Mo 8d8357c8ed
bump version to v0.4.2 (#4600) 2024-05-04 17:09:49 -07:00
DearPlanet 4302987069
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) 2024-05-04 15:39:34 -07:00
Simon Mo 021b1a2ab7
[CI] check size of the wheels (#4319) 2024-05-04 20:44:36 +00:00
Michael Goin 2a052011ca
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)
Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
2024-05-04 11:45:16 -07:00
SangBin Cho 36fb68f947
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 00:18:00 -07:00
Cody Yu bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData (#4540) 2024-05-03 17:47:07 -07:00
youkaichao 344bf7cd2d
[Misc] add installation time env vars (#4574) 2024-05-03 15:55:56 -07:00
Cade Daniel ab50275111
[Speculative decoding] Support target-model logprobs (#4378) 2024-05-03 15:52:01 -07:00
Lily Liu 43c413ec57
[Kernel] Use flashinfer for decoding (#4353)
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck f8e7adda21
Fix/async chat serving (#2727) 2024-05-03 11:04:14 -07:00
Michael Goin 7e65477e5e
[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586) 2024-05-03 10:32:21 -07:00
SangBin Cho 3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) 2024-05-03 10:20:12 -07:00
youkaichao 2d7bce9cd5
[Doc] add env vars to the doc (#4572) 2024-05-03 05:13:49 +00:00
DefTruth ce3f1eedf8
[Misc] remove chunk detected debug logs (#4571) 2024-05-03 04:48:08 +00:00
Yang, Bo 808632d3b4
[BugFix] Prevent the task of `_force_log` from being garbage collected (#4567) 2024-05-03 01:35:18 +00:00
youkaichao 344a5d0c33
[Core][Distributed] enable allreduce for multiple tp groups (#4566) 2024-05-02 17:32:33 -07:00
SangBin Cho 0f8a91401c
[Core] Ignore infeasible swap requests. (#4557) 2024-05-02 14:31:20 -07:00
Alexei-V-Ivanov-AMD 9b5c9f9484
[CI/Build] AMD CI pipeline with extended set of tests. (#4267)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-05-02 12:29:07 -07:00
Michał Moskal 32881f3f31
[kernel] fix sliding window in prefix prefill Triton kernel (#4405)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2024-05-02 11:23:37 -07:00
youkaichao 5b8a7c1cb0
[Misc] centralize all usage of environment variables (#4548) 2024-05-02 11:13:25 -07:00
Mark McLoughlin 1ff0c73a79
[BugFix] Include target-device specific requirements.txt in sdist (#4559) 2024-05-02 10:52:51 -07:00
Hu Dong 5ad60b0cbd
[Misc] Exclude the `tests` directory from being packaged (#4552) 2024-05-02 10:50:25 -07:00
SangBin Cho fb087af52e
[mypy][7/N] Cover all directories (#4555) 2024-05-02 10:47:41 -07:00
alexm-nm 7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin (#4533) 2024-05-02 12:56:22 -04:00
youkaichao 2a85f93007
[Core][Distributed] enable multiple tp group (#4512)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-02 04:28:21 +00:00
SangBin Cho cf8cac8c70
[mypy][6/N] Fix all the core subdirectory typing (#4450)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-02 03:01:00 +00:00
Ronen Schaffer 5e401bce17
[CI]Add regression tests to ensure the async engine generates metrics (#4524) 2024-05-01 19:57:12 -07:00
SangBin Cho 0d62fe58db
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451) 2024-05-01 19:24:13 -07:00
Danny Guinther b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273) 2024-05-01 17:34:40 -07:00
Woosuk Kwon 826b82a260
[Misc] Fix expert_ids shape in MoE (#4517) 2024-05-01 23:47:59 +00:00
Philipp Moritz c9d852d601
[Misc] Remove Mixtral device="cuda" declarations (#4543)
Remove the device="cuda" declarations in mixtral as promised in #4343
2024-05-01 16:30:52 -07:00
youkaichao 6ef09b08f8
[Core][Distributed] fix pynccl del error (#4508) 2024-05-01 15:23:06 -07:00
Roy 3a922c1e7e
[Bugfix][Core] Fix and refactor logging stats (#4336) 2024-05-01 20:08:14 +00:00
sasha0552 c47ba4aaa9
[Bugfix] Add validation for seed (#4529) 2024-05-01 19:31:22 +00:00
Philipp Moritz 24bb4fe432
[Kernel] Update fused_moe tuning script for FP8 (#4457)
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.

All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.

Before this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency 
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency

After this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency
2024-05-01 11:47:38 -07:00
Nick Hill a657bfc48a
[Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357) 2024-05-01 18:41:59 +00:00
leiwen83 24750f4cad
[Core] Enable prefix caching with block manager v2 enabled (#4142)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
leiwen83 b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding (#4237)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-01 11:13:03 -07:00
Travis Johnson 8b798eec75
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-05-01 18:01:50 +00:00
sasha0552 69909126a7
[Bugfix] Use random seed if seed is -1 (#4531) 2024-05-01 10:41:17 -07:00
Frαnçois e491c7e053
[Doc] update(example model): for OpenAI compatible serving (#4503) 2024-05-01 10:14:16 -07:00
Robert Shaw 4dc8026d86
[Bugfix] Fix 307 Redirect for `/metrics` (#4523) 2024-05-01 09:14:13 -07:00
AnyISalIn a88bb9b032
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173)
Signed-off-by: AnyISalIn <anyisalin@gmail.com>
2024-05-01 09:11:03 -07:00
SangBin Cho 6f1df80436
[Test] Add ignore_eos test (#4519) 2024-05-01 08:45:42 -04:00
Jee Li d6f4bd7cdd
[Misc]Add customized information for models (#4132) 2024-04-30 21:18:14 -07:00
Robert Caulk c3845d82dc
Allow user to define whitespace pattern for outlines (#4305) 2024-04-30 20:48:39 -07:00
Pastel! a822eb3413
[Misc] fix typo in block manager (#4453) 2024-04-30 20:41:32 -07:00
harrywu f458112e8a
[Misc][Typo] type annotation fix (#4495) 2024-04-30 20:21:39 -07:00
Nick Hill 2e240c69a9
[Core] Centralize GPU Worker construction (#4419) 2024-05-01 01:06:34 +00:00