Commit Graph

1192 Commits

Author SHA1 Message Date
SangBin Cho 603ad84815
[Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) 2024-04-26 13:02:02 +00:00
SangBin Cho a88081bf76
[CI] Disable non-lazy string operation on logging (#4326)
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
2024-04-26 00:16:58 -07:00
Norman Mu 2f30e7c72f
[Frontend] Add --log-level option to api server (#4377) 2024-04-26 05:36:01 +00:00
Cyrus Leung a74dee9b62
[Bugfix] Fix parameter name in `get_tokenizer` (#4107) 2024-04-25 19:10:48 -07:00
Hongxia Yang cf29b7eda4
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm (#4376)
Co-authored-by: WoosukKwon <woosuk.kwon@berkeley.edu>
2024-04-25 18:12:25 -07:00
Nick Hill efffb63f58
[Core] Move function tracing setup to util function (#4352) 2024-04-25 16:45:12 -07:00
Nick Hill 15e7c675b0
[Core] Add `shutdown()` method to `ExecutorBase` (#4349) 2024-04-25 16:32:48 -07:00
Roy b6dcb4d442
[Misc] Fix flash attention backend log (#4368) 2024-04-25 12:43:32 -07:00
SangBin Cho b5b4a398a7
[Mypy] Typing lora folder (#4337) 2024-04-25 19:13:50 +00:00
Kunshang Ji f4bc4de1b1
[Core]refactor aqlm quant ops (#4351) 2024-04-25 15:03:56 -04:00
Caio Mendes bd7a8eef25
[Doc] README Phi-3 name fix. (#4372)
Co-authored-by: Caio Mendes <caiocesart@microsoft.com>
2024-04-25 10:32:00 -07:00
Alexei-V-Ivanov-AMD 7ee82bef1e
[CI/Build] Adding functionality to reset the node's GPUs before processing. (#4213) 2024-04-25 09:37:20 -07:00
Isotr0py fbf152d976
[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 (#4324)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-25 09:35:56 -07:00
Nick Hill 479d69fad0
[Core] Move ray_utils.py from `engine` to `executor` package (#4347) 2024-04-25 06:52:22 +00:00
Caio Mendes 96e90fdeb3
[Model] Adds Phi-3 support (#4298) 2024-04-25 03:06:57 +00:00
zifeitong a395a638c2
[Misc] Use public API in benchmark_throughput (#4300) 2024-04-24 21:10:24 +00:00
youkaichao 2768884ac4
[Doc] Add note for docker user (#4340)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-24 21:09:44 +00:00
alexm-nm aae08249ac
[Bugfix] Fix marlin kernel crash on H100 (#4218)
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
2024-04-24 10:35:01 -07:00
Roger Wang 7923dcad12
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279) 2024-04-24 09:49:13 -07:00
youkaichao 3cd9b5bb2d
[Core][Distributed] use existing torch.cuda.device (#4318)
[Core][Distributed] use existing torch.cuda.device context manager (#4318)
2024-04-24 09:00:20 -07:00
Woosuk Kwon 468d761b32
[Misc] Reduce supported Punica dtypes (#4304) 2024-04-23 18:54:33 -07:00
youkaichao e4bf860a54
[CI][Build] change pynvml to nvidia-ml-py (#4302) 2024-04-23 18:33:12 -07:00
youkaichao 91f50a6fe2
[Core][Distributed] use cpu/gloo to initialize pynccl (#4248) 2024-04-23 18:32:19 -07:00
Robert Shaw 79a268c4ab
[BUG] fixed fp8 conflict with aqlm (#4307)
Fixes fp8 iterface which broke in AQLM merge.
2024-04-23 18:26:33 -07:00
Philipp Moritz eace8bf0b9
[Kernel] FP8 support for MoE kernel / Mixtral (#4244)
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!
2024-04-24 01:18:23 +00:00
Cyrus Leung 1e8f4252aa
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292) 2024-04-23 18:19:03 +00:00
James Fleming 2b7949c1c2
AQLM CUDA support (#3287)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Simon Mo 62b5166bd4
[CI] Add ccache for wheel builds job (#4281) 2024-04-23 09:51:41 -07:00
youkaichao d86285a4a4
[Core][Logging] Add last frame information for better debugging (#4278) 2024-04-23 09:45:52 -07:00
DefTruth d87f39e9a9
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286) 2024-04-23 09:28:35 -07:00
Jack Gordley d3c8180ac4
[Bugfix] Fixing max token error message for openai compatible server (#4016) 2024-04-23 19:06:29 +08:00
Cade Daniel 62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) 2024-04-23 08:02:36 +00:00
SangBin Cho 050f285ff6
[Core] Scheduling optimization 2 (#4280) 2024-04-23 08:02:11 +00:00
Nick Hill 8f2ea22bde
[Core] Some simplification of WorkerWrapper changes (#4183) 2024-04-23 07:49:08 +00:00
SangBin Cho 0ae11f78ab
[Mypy] Part 3 fix typing for nested directories for most of directory (#4161) 2024-04-22 21:32:44 -07:00
Harry Mellor 34128a697e
Fix `autodoc` directives (#4272)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-23 01:53:01 +00:00
youkaichao c1b4e4157c
[Core][Distributed] use absolute path for library file (#4271) 2024-04-22 17:21:48 -07:00
Zhanghao Wu ceaf4ed003
[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276) 2024-04-22 15:34:31 -07:00
SangBin Cho ad8d696a99
[Core] Scheduler perf fix (#4270) 2024-04-22 21:11:06 +00:00
Harry Mellor 3d925165f2
Add example scripts to documentation (#4225)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-22 16:36:54 +00:00
alexm-nm 1543680691
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter (#4217) 2024-04-22 09:10:48 -07:00
Tao He 077f0a2e8a
[Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-04-22 09:19:51 +00:00
Woosuk Kwon e73ed0f1c6
[Bugfix] Fix type annotations in CPU model runner (#4256) 2024-04-22 00:54:16 -07:00
Isotr0py 296cdf8ac7
[Misc] Add vision language model support to CPU backend (#3968) 2024-04-22 00:44:16 -07:00
youkaichao 747b1a7147
[Core][Distributed] fix _is_full_nvlink detection (#4233) 2024-04-21 23:04:16 -07:00
Hongxia Yang 95e5b087cf
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring (#4129) 2024-04-21 21:57:24 -07:00
GeauxEric a37d815b83
Make initialization of tokenizer and detokenizer optional (#3748)
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-21 22:06:46 +00:00
xiaoji 7f2593b164
[Doc]: Update the doc of adding new models (#4236) 2024-04-21 09:57:08 -07:00
Harry Mellor fe7d648fe5
Don't show default value for flags in `EngineArgs` (#4223)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-21 09:15:28 -07:00
Noam Gat cc74b2b232
Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222) 2024-04-20 08:33:16 +00:00