Commit Graph

1375 Commits

Author SHA1 Message Date
alexm-nm aae08249ac
[Bugfix] Fix marlin kernel crash on H100 (#4218)
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
2024-04-24 10:35:01 -07:00
Roger Wang 7923dcad12
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279) 2024-04-24 09:49:13 -07:00
youkaichao 3cd9b5bb2d
[Core][Distributed] use existing torch.cuda.device (#4318)
[Core][Distributed] use existing torch.cuda.device context manager (#4318)
2024-04-24 09:00:20 -07:00
Woosuk Kwon 468d761b32
[Misc] Reduce supported Punica dtypes (#4304) 2024-04-23 18:54:33 -07:00
youkaichao e4bf860a54
[CI][Build] change pynvml to nvidia-ml-py (#4302) 2024-04-23 18:33:12 -07:00
youkaichao 91f50a6fe2
[Core][Distributed] use cpu/gloo to initialize pynccl (#4248) 2024-04-23 18:32:19 -07:00
Robert Shaw 79a268c4ab
[BUG] fixed fp8 conflict with aqlm (#4307)
Fixes fp8 iterface which broke in AQLM merge.
2024-04-23 18:26:33 -07:00
Philipp Moritz eace8bf0b9
[Kernel] FP8 support for MoE kernel / Mixtral (#4244)
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!
2024-04-24 01:18:23 +00:00
Cyrus Leung 1e8f4252aa
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292) 2024-04-23 18:19:03 +00:00
James Fleming 2b7949c1c2
AQLM CUDA support (#3287)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Simon Mo 62b5166bd4
[CI] Add ccache for wheel builds job (#4281) 2024-04-23 09:51:41 -07:00
youkaichao d86285a4a4
[Core][Logging] Add last frame information for better debugging (#4278) 2024-04-23 09:45:52 -07:00
DefTruth d87f39e9a9
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286) 2024-04-23 09:28:35 -07:00
Jack Gordley d3c8180ac4
[Bugfix] Fixing max token error message for openai compatible server (#4016) 2024-04-23 19:06:29 +08:00
Cade Daniel 62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) 2024-04-23 08:02:36 +00:00
SangBin Cho 050f285ff6
[Core] Scheduling optimization 2 (#4280) 2024-04-23 08:02:11 +00:00
Nick Hill 8f2ea22bde
[Core] Some simplification of WorkerWrapper changes (#4183) 2024-04-23 07:49:08 +00:00
SangBin Cho 0ae11f78ab
[Mypy] Part 3 fix typing for nested directories for most of directory (#4161) 2024-04-22 21:32:44 -07:00
Harry Mellor 34128a697e
Fix `autodoc` directives (#4272)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-23 01:53:01 +00:00
youkaichao c1b4e4157c
[Core][Distributed] use absolute path for library file (#4271) 2024-04-22 17:21:48 -07:00
Zhanghao Wu ceaf4ed003
[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276) 2024-04-22 15:34:31 -07:00
SangBin Cho ad8d696a99
[Core] Scheduler perf fix (#4270) 2024-04-22 21:11:06 +00:00
Harry Mellor 3d925165f2
Add example scripts to documentation (#4225)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-22 16:36:54 +00:00
alexm-nm 1543680691
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter (#4217) 2024-04-22 09:10:48 -07:00
Tao He 077f0a2e8a
[Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-04-22 09:19:51 +00:00
Woosuk Kwon e73ed0f1c6
[Bugfix] Fix type annotations in CPU model runner (#4256) 2024-04-22 00:54:16 -07:00
Isotr0py 296cdf8ac7
[Misc] Add vision language model support to CPU backend (#3968) 2024-04-22 00:44:16 -07:00
youkaichao 747b1a7147
[Core][Distributed] fix _is_full_nvlink detection (#4233) 2024-04-21 23:04:16 -07:00
Hongxia Yang 95e5b087cf
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring (#4129) 2024-04-21 21:57:24 -07:00
GeauxEric a37d815b83
Make initialization of tokenizer and detokenizer optional (#3748)
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-21 22:06:46 +00:00
xiaoji 7f2593b164
[Doc]: Update the doc of adding new models (#4236) 2024-04-21 09:57:08 -07:00
Harry Mellor fe7d648fe5
Don't show default value for flags in `EngineArgs` (#4223)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-21 09:15:28 -07:00
Noam Gat cc74b2b232
Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222) 2024-04-20 08:33:16 +00:00
nunjunj 91528575ec
[Frontend] multiple sampling params support (#3570) 2024-04-20 00:11:57 -07:00
Cody Yu a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118)
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Harry Mellor 682789d402
Fix missing docs and out of sync `EngineArgs` (#4219)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-19 20:51:33 -07:00
Ayush Rautwar 138485a82d
[Bugfix] Add fix for JSON whitespace (#4189)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
2024-04-19 20:49:22 -07:00
Chirag Jain bc9df1571b
Pass `tokenizer_revision` when getting tokenizer in openai serving (#4214) 2024-04-19 17:13:56 -07:00
youkaichao 15b86408a8
[Misc] add nccl in collect env (#4211) 2024-04-19 19:44:51 +00:00
Ronen Schaffer 7be4f5628f
[Bugfix][Core] Restore logging of stats in the async engine (#4150) 2024-04-19 08:08:26 -07:00
Uranus 8f20fc04bf
[Misc] fix docstrings (#4191)
Co-authored-by: Zhong Wang <wangzhong@infini-ai.com>
2024-04-19 08:18:33 +00:00
Simon Mo 221d93ecbf
Bump version of 0.4.1 (#4177) 2024-04-19 01:00:22 -07:00
Jee Li d17c8477f1
[Bugfix] Fix LoRA loading check (#4138)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-04-19 00:59:54 -07:00
Simon Mo a134ef6f5e
Support eos_token_id from generation_config.json (#4182) 2024-04-19 04:13:36 +00:00
youkaichao 8a7a3e4436
[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-18 16:15:12 -07:00
Adam Tilghman 8f9c28fd40
[Bugfix] Fix CustomAllreduce nvlink topology detection (#3974)
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) (#4159)
2024-04-18 15:32:47 -07:00
Liangfu Chen cd2f63fb36
[CI/CD] add neuron docker and ci test scripts (#3571) 2024-04-18 15:26:01 -07:00
Nick Hill 87fa80c91f
[Misc] Bump transformers to latest version (#4176) 2024-04-18 14:36:39 -07:00
James Whedbee e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) 2024-04-18 21:12:55 +00:00
Simon Mo 705578ae14
[Docs] document that Meta Llama 3 is supported (#4175) 2024-04-18 10:55:48 -07:00