mirror of https://github.com/vllm-project/vllm
[Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
a7e3eba66f
commit
dd6a3a02cb
|
@ -1,10 +1,10 @@
|
|||
sphinx==6.2.1
|
||||
sphinx-argparse==0.4.0
|
||||
sphinx-book-theme==1.0.1
|
||||
sphinx-copybutton==0.5.2
|
||||
myst-parser==3.0.1
|
||||
sphinx-argparse==0.4.0
|
||||
sphinx-design==0.6.1
|
||||
sphinx-togglebutton==0.3.2
|
||||
myst-parser==3.0.1
|
||||
msgspec
|
||||
cloudpickle
|
||||
|
||||
|
|
|
@ -8,10 +8,10 @@
|
|||
.. currentmodule:: vllm.engine
|
||||
```
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Engines
|
||||
:maxdepth: 2
|
||||
|
||||
llm_engine
|
||||
async_llm_engine
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -2,10 +2,10 @@
|
|||
|
||||
## Submodules
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
interfaces_base
|
||||
interfaces
|
||||
adapters
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -17,7 +17,7 @@ Looking to add your own multi-modal model? Please follow the instructions listed
|
|||
|
||||
## Submodules
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
inputs
|
||||
|
@ -25,4 +25,4 @@ parse
|
|||
processing
|
||||
profiling
|
||||
registry
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -1,9 +1,9 @@
|
|||
# Offline Inference
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Contents
|
||||
:maxdepth: 1
|
||||
|
||||
llm
|
||||
llm_inputs
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -17,11 +17,11 @@ The edges of the build graph represent:
|
|||
|
||||
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
|
||||
|
||||
> ```{figure} /assets/contributing/dockerfile-stages-dependency.png
|
||||
> :::{figure} /assets/contributing/dockerfile-stages-dependency.png
|
||||
> :align: center
|
||||
> :alt: query
|
||||
> :width: 100%
|
||||
> ```
|
||||
> :::
|
||||
>
|
||||
> Made using: <https://github.com/patrickhoefler/dockerfilegraph>
|
||||
>
|
||||
|
|
|
@ -10,9 +10,9 @@ First, clone the PyTorch model code from the source repository.
|
|||
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
|
||||
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Make sure to review and adhere to the original code's copyright and licensing terms!
|
||||
```
|
||||
:::
|
||||
|
||||
## 2. Make your code compatible with vLLM
|
||||
|
||||
|
@ -80,10 +80,10 @@ def forward(
|
|||
...
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
|
||||
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
|
||||
```
|
||||
:::
|
||||
|
||||
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.
|
||||
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Contents
|
||||
:maxdepth: 1
|
||||
|
||||
|
@ -12,16 +12,16 @@ basic
|
|||
registration
|
||||
tests
|
||||
multimodal
|
||||
```
|
||||
:::
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The complexity of adding a new model depends heavily on the model's architecture.
|
||||
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
|
||||
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
|
||||
or ask on our [developer slack](https://slack.vllm.ai).
|
||||
We will be happy to help you out!
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -48,9 +48,9 @@ Further update the model as follows:
|
|||
return vision_embeddings
|
||||
```
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
The returned `multimodal_embeddings` must be either a **3D {class}`torch.Tensor`** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D {class}`torch.Tensor`'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
|
||||
```
|
||||
:::
|
||||
|
||||
- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings` to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
|
||||
|
||||
|
@ -89,10 +89,10 @@ Further update the model as follows:
|
|||
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The model class does not have to be named {code}`*ForCausalLM`.
|
||||
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
|
||||
```
|
||||
:::
|
||||
|
||||
## 2. Specify processing information
|
||||
|
||||
|
@ -120,8 +120,8 @@ When calling the model, the output embeddings from the visual encoder are assign
|
|||
containing placeholder feature tokens. Therefore, the number of placeholder feature tokens should be equal
|
||||
to the size of the output embeddings.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Basic example: LLaVA
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Basic example: LLaVA
|
||||
:sync: llava
|
||||
|
||||
Looking at the code of HF's `LlavaForConditionalGeneration`:
|
||||
|
@ -254,12 +254,12 @@ def get_mm_max_tokens_per_item(self, seq_len: int) -> Mapping[str, int]:
|
|||
return {"image": self.get_max_image_tokens()}
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Our [actual code](gh-file:vllm/model_executor/models/llava.py) is more abstracted to support vision encoders other than CLIP.
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
## 3. Specify dummy inputs
|
||||
|
||||
|
@ -315,17 +315,17 @@ def get_dummy_processor_inputs(
|
|||
Afterwards, create a subclass of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`
|
||||
to fill in the missing details about HF processing.
|
||||
|
||||
```{seealso}
|
||||
:::{seealso}
|
||||
[Multi-Modal Data Processing](#mm-processing)
|
||||
```
|
||||
:::
|
||||
|
||||
### Multi-modal fields
|
||||
|
||||
Override {class}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` to
|
||||
return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Basic example: LLaVA
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Basic example: LLaVA
|
||||
:sync: llava
|
||||
|
||||
Looking at the model's `forward` method:
|
||||
|
@ -367,13 +367,13 @@ def _get_mm_fields_config(
|
|||
)
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
|
||||
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Prompt replacements
|
||||
|
||||
|
|
|
@ -17,17 +17,17 @@ After you have implemented your model (see [tutorial](#new-model-basic)), put it
|
|||
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
|
||||
Finally, update our [list of supported models](#supported-models) to promote your model!
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
The list of models in each section should be maintained in alphabetical order.
|
||||
```
|
||||
:::
|
||||
|
||||
## Out-of-tree models
|
||||
|
||||
You can load an external model using a plugin without modifying the vLLM codebase.
|
||||
|
||||
```{seealso}
|
||||
:::{seealso}
|
||||
[vLLM's Plugin System](#plugin-system)
|
||||
```
|
||||
:::
|
||||
|
||||
To register the model, use the following code:
|
||||
|
||||
|
@ -45,11 +45,11 @@ from vllm import ModelRegistry
|
|||
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
|
||||
```
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
If your model is a multimodal model, ensure the model class implements the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
|
||||
Read more about that [here](#supports-multimodal).
|
||||
```
|
||||
:::
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -14,14 +14,14 @@ Without them, the CI for your PR will fail.
|
|||
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
|
||||
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
The list of models in each section should be maintained in alphabetical order.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
If your model requires a development version of HF Transformers, you can set
|
||||
`min_transformers_version` to skip the test in CI until the model is released.
|
||||
```
|
||||
:::
|
||||
|
||||
## Optional Tests
|
||||
|
||||
|
|
|
@ -35,17 +35,17 @@ pre-commit run --all-files
|
|||
pytest tests/
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Currently, the repository is not fully checked by `mypy`.
|
||||
```
|
||||
:::
|
||||
|
||||
## Issues
|
||||
|
||||
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
|
||||
```
|
||||
:::
|
||||
|
||||
## Pull Requests & Code Reviews
|
||||
|
||||
|
@ -81,9 +81,9 @@ appropriately to indicate the type of change. Please use one of the following:
|
|||
- `[Misc]` for PRs that do not fit the above categories. Please use this
|
||||
sparingly.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
If the PR spans more than one category, please include all relevant prefixes.
|
||||
```
|
||||
:::
|
||||
|
||||
### Code Quality
|
||||
|
||||
|
|
|
@ -6,21 +6,21 @@ The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` en
|
|||
|
||||
When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Only enable profiling in a development environment.
|
||||
```
|
||||
:::
|
||||
|
||||
Traces can be visualized using <https://ui.perfetto.dev/>.
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
|
||||
Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
|
||||
`export VLLM_RPC_TIMEOUT=1800000`
|
||||
```
|
||||
:::
|
||||
|
||||
## Example commands and usage
|
||||
|
||||
|
|
|
@ -21,11 +21,11 @@ $ docker run --runtime nvidia --gpus all \
|
|||
|
||||
You can add any other <project:#engine-args> you need after the image tag (`vllm/vllm-openai:latest`).
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
You can either use the `ipc=host` flag or `--shm-size` flag to allow the
|
||||
container to access the host's shared memory. vLLM uses PyTorch, which uses shared
|
||||
memory to share data between processes under the hood, particularly for tensor parallel inference.
|
||||
```
|
||||
:::
|
||||
|
||||
(deployment-docker-build-image-from-source)=
|
||||
|
||||
|
@ -38,25 +38,25 @@ You can build and run vLLM from source via the provided <gh-file:Dockerfile>. To
|
|||
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
By default vLLM will build for all GPU types for widest distribution. If you are just building for the
|
||||
current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
|
||||
for vLLM to find the current GPU type and build for that.
|
||||
|
||||
If you are using Podman instead of Docker, you might need to disable SELinux labeling by
|
||||
adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184).
|
||||
```
|
||||
:::
|
||||
|
||||
## Building for Arm64/aarch64
|
||||
|
||||
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
|
||||
of PyTorch Nightly and should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
|
||||
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
|
||||
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
|
||||
```
|
||||
:::
|
||||
|
||||
```console
|
||||
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
||||
|
@ -85,6 +85,6 @@ $ docker run --runtime nvidia --gpus all \
|
|||
|
||||
The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -2,11 +2,11 @@
|
|||
|
||||
# Cerebrium
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<p align="center">
|
||||
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
|
||||
</p>
|
||||
```
|
||||
:::
|
||||
|
||||
vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
|
||||
|
||||
|
|
|
@ -2,11 +2,11 @@
|
|||
|
||||
# dstack
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<p align="center">
|
||||
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
|
||||
</p>
|
||||
```
|
||||
:::
|
||||
|
||||
vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
|
||||
|
||||
|
@ -97,6 +97,6 @@ completion = client.chat.completions.create(
|
|||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -38,213 +38,213 @@ chart **including persistent volumes** and deletes the release.
|
|||
|
||||
## Architecture
|
||||
|
||||
```{image} /assets/deployment/architecture_helm_deployment.png
|
||||
```
|
||||
:::{image} /assets/deployment/architecture_helm_deployment.png
|
||||
:::
|
||||
|
||||
## Values
|
||||
|
||||
```{list-table}
|
||||
:::{list-table}
|
||||
:widths: 25 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
* - Key
|
||||
- Type
|
||||
- Default
|
||||
- Description
|
||||
* - autoscaling
|
||||
- object
|
||||
- {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
|
||||
- Autoscaling configuration
|
||||
* - autoscaling.enabled
|
||||
- bool
|
||||
- false
|
||||
- Enable autoscaling
|
||||
* - autoscaling.maxReplicas
|
||||
- int
|
||||
- 100
|
||||
- Maximum replicas
|
||||
* - autoscaling.minReplicas
|
||||
- int
|
||||
- 1
|
||||
- Minimum replicas
|
||||
* - autoscaling.targetCPUUtilizationPercentage
|
||||
- int
|
||||
- 80
|
||||
- Target CPU utilization for autoscaling
|
||||
* - configs
|
||||
- object
|
||||
- {}
|
||||
- Configmap
|
||||
* - containerPort
|
||||
- int
|
||||
- 8000
|
||||
- Container port
|
||||
* - customObjects
|
||||
- list
|
||||
- []
|
||||
- Custom Objects configuration
|
||||
* - deploymentStrategy
|
||||
- object
|
||||
- {}
|
||||
- Deployment strategy configuration
|
||||
* - externalConfigs
|
||||
- list
|
||||
- []
|
||||
- External configuration
|
||||
* - extraContainers
|
||||
- list
|
||||
- []
|
||||
- Additional containers configuration
|
||||
* - extraInit
|
||||
- object
|
||||
- {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
|
||||
- Additional configuration for the init container
|
||||
* - extraInit.pvcStorage
|
||||
- string
|
||||
- "50Gi"
|
||||
- Storage size of the s3
|
||||
* - extraInit.s3modelpath
|
||||
- string
|
||||
- "relative_s3_model_path/opt-125m"
|
||||
- Path of the model on the s3 which hosts model weights and config files
|
||||
* - extraInit.awsEc2MetadataDisabled
|
||||
- boolean
|
||||
- true
|
||||
- Disables the use of the Amazon EC2 instance metadata service
|
||||
* - extraPorts
|
||||
- list
|
||||
- []
|
||||
- Additional ports configuration
|
||||
* - gpuModels
|
||||
- list
|
||||
- ["TYPE_GPU_USED"]
|
||||
- Type of gpu used
|
||||
* - image
|
||||
- object
|
||||
- {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
|
||||
- Image configuration
|
||||
* - image.command
|
||||
- list
|
||||
- ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
|
||||
- Container launch command
|
||||
* - image.repository
|
||||
- string
|
||||
- "vllm/vllm-openai"
|
||||
- Image repository
|
||||
* - image.tag
|
||||
- string
|
||||
- "latest"
|
||||
- Image tag
|
||||
* - livenessProbe
|
||||
- object
|
||||
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
|
||||
- Liveness probe configuration
|
||||
* - livenessProbe.failureThreshold
|
||||
- int
|
||||
- 3
|
||||
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
|
||||
* - livenessProbe.httpGet
|
||||
- object
|
||||
- {"path":"/health","port":8000}
|
||||
- Configuration of the Kubelet http request on the server
|
||||
* - livenessProbe.httpGet.path
|
||||
- string
|
||||
- "/health"
|
||||
- Path to access on the HTTP server
|
||||
* - livenessProbe.httpGet.port
|
||||
- int
|
||||
- 8000
|
||||
- Name or number of the port to access on the container, on which the server is listening
|
||||
* - livenessProbe.initialDelaySeconds
|
||||
- int
|
||||
- 15
|
||||
- Number of seconds after the container has started before liveness probe is initiated
|
||||
* - livenessProbe.periodSeconds
|
||||
- int
|
||||
- 10
|
||||
- How often (in seconds) to perform the liveness probe
|
||||
* - maxUnavailablePodDisruptionBudget
|
||||
- string
|
||||
- ""
|
||||
- Disruption Budget Configuration
|
||||
* - readinessProbe
|
||||
- object
|
||||
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
|
||||
- Readiness probe configuration
|
||||
* - readinessProbe.failureThreshold
|
||||
- int
|
||||
- 3
|
||||
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
|
||||
* - readinessProbe.httpGet
|
||||
- object
|
||||
- {"path":"/health","port":8000}
|
||||
- Configuration of the Kubelet http request on the server
|
||||
* - readinessProbe.httpGet.path
|
||||
- string
|
||||
- "/health"
|
||||
- Path to access on the HTTP server
|
||||
* - readinessProbe.httpGet.port
|
||||
- int
|
||||
- 8000
|
||||
- Name or number of the port to access on the container, on which the server is listening
|
||||
* - readinessProbe.initialDelaySeconds
|
||||
- int
|
||||
- 5
|
||||
- Number of seconds after the container has started before readiness probe is initiated
|
||||
* - readinessProbe.periodSeconds
|
||||
- int
|
||||
- 5
|
||||
- How often (in seconds) to perform the readiness probe
|
||||
* - replicaCount
|
||||
- int
|
||||
- 1
|
||||
- Number of replicas
|
||||
* - resources
|
||||
- object
|
||||
- {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
|
||||
- Resource configuration
|
||||
* - resources.limits."nvidia.com/gpu"
|
||||
- int
|
||||
- 1
|
||||
- Number of gpus used
|
||||
* - resources.limits.cpu
|
||||
- int
|
||||
- 4
|
||||
- Number of CPUs
|
||||
* - resources.limits.memory
|
||||
- string
|
||||
- "16Gi"
|
||||
- CPU memory configuration
|
||||
* - resources.requests."nvidia.com/gpu"
|
||||
- int
|
||||
- 1
|
||||
- Number of gpus used
|
||||
* - resources.requests.cpu
|
||||
- int
|
||||
- 4
|
||||
- Number of CPUs
|
||||
* - resources.requests.memory
|
||||
- string
|
||||
- "16Gi"
|
||||
- CPU memory configuration
|
||||
* - secrets
|
||||
- object
|
||||
- {}
|
||||
- Secrets configuration
|
||||
* - serviceName
|
||||
- string
|
||||
-
|
||||
- Service name
|
||||
* - servicePort
|
||||
- int
|
||||
- 80
|
||||
- Service port
|
||||
* - labels.environment
|
||||
- string
|
||||
- test
|
||||
- Environment name
|
||||
* - labels.release
|
||||
- string
|
||||
- test
|
||||
- Release name
|
||||
```
|
||||
- * Key
|
||||
* Type
|
||||
* Default
|
||||
* Description
|
||||
- * autoscaling
|
||||
* object
|
||||
* {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
|
||||
* Autoscaling configuration
|
||||
- * autoscaling.enabled
|
||||
* bool
|
||||
* false
|
||||
* Enable autoscaling
|
||||
- * autoscaling.maxReplicas
|
||||
* int
|
||||
* 100
|
||||
* Maximum replicas
|
||||
- * autoscaling.minReplicas
|
||||
* int
|
||||
* 1
|
||||
* Minimum replicas
|
||||
- * autoscaling.targetCPUUtilizationPercentage
|
||||
* int
|
||||
* 80
|
||||
* Target CPU utilization for autoscaling
|
||||
- * configs
|
||||
* object
|
||||
* {}
|
||||
* Configmap
|
||||
- * containerPort
|
||||
* int
|
||||
* 8000
|
||||
* Container port
|
||||
- * customObjects
|
||||
* list
|
||||
* []
|
||||
* Custom Objects configuration
|
||||
- * deploymentStrategy
|
||||
* object
|
||||
* {}
|
||||
* Deployment strategy configuration
|
||||
- * externalConfigs
|
||||
* list
|
||||
* []
|
||||
* External configuration
|
||||
- * extraContainers
|
||||
* list
|
||||
* []
|
||||
* Additional containers configuration
|
||||
- * extraInit
|
||||
* object
|
||||
* {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
|
||||
* Additional configuration for the init container
|
||||
- * extraInit.pvcStorage
|
||||
* string
|
||||
* "50Gi"
|
||||
* Storage size of the s3
|
||||
- * extraInit.s3modelpath
|
||||
* string
|
||||
* "relative_s3_model_path/opt-125m"
|
||||
* Path of the model on the s3 which hosts model weights and config files
|
||||
- * extraInit.awsEc2MetadataDisabled
|
||||
* boolean
|
||||
* true
|
||||
* Disables the use of the Amazon EC2 instance metadata service
|
||||
- * extraPorts
|
||||
* list
|
||||
* []
|
||||
* Additional ports configuration
|
||||
- * gpuModels
|
||||
* list
|
||||
* ["TYPE_GPU_USED"]
|
||||
* Type of gpu used
|
||||
- * image
|
||||
* object
|
||||
* {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
|
||||
* Image configuration
|
||||
- * image.command
|
||||
* list
|
||||
* ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
|
||||
* Container launch command
|
||||
- * image.repository
|
||||
* string
|
||||
* "vllm/vllm-openai"
|
||||
* Image repository
|
||||
- * image.tag
|
||||
* string
|
||||
* "latest"
|
||||
* Image tag
|
||||
- * livenessProbe
|
||||
* object
|
||||
* {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
|
||||
* Liveness probe configuration
|
||||
- * livenessProbe.failureThreshold
|
||||
* int
|
||||
* 3
|
||||
* Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
|
||||
- * livenessProbe.httpGet
|
||||
* object
|
||||
* {"path":"/health","port":8000}
|
||||
* Configuration of the Kubelet http request on the server
|
||||
- * livenessProbe.httpGet.path
|
||||
* string
|
||||
* "/health"
|
||||
* Path to access on the HTTP server
|
||||
- * livenessProbe.httpGet.port
|
||||
* int
|
||||
* 8000
|
||||
* Name or number of the port to access on the container, on which the server is listening
|
||||
- * livenessProbe.initialDelaySeconds
|
||||
* int
|
||||
* 15
|
||||
* Number of seconds after the container has started before liveness probe is initiated
|
||||
- * livenessProbe.periodSeconds
|
||||
* int
|
||||
* 10
|
||||
* How often (in seconds) to perform the liveness probe
|
||||
- * maxUnavailablePodDisruptionBudget
|
||||
* string
|
||||
* ""
|
||||
* Disruption Budget Configuration
|
||||
- * readinessProbe
|
||||
* object
|
||||
* {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
|
||||
* Readiness probe configuration
|
||||
- * readinessProbe.failureThreshold
|
||||
* int
|
||||
* 3
|
||||
* Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
|
||||
- * readinessProbe.httpGet
|
||||
* object
|
||||
* {"path":"/health","port":8000}
|
||||
* Configuration of the Kubelet http request on the server
|
||||
- * readinessProbe.httpGet.path
|
||||
* string
|
||||
* "/health"
|
||||
* Path to access on the HTTP server
|
||||
- * readinessProbe.httpGet.port
|
||||
* int
|
||||
* 8000
|
||||
* Name or number of the port to access on the container, on which the server is listening
|
||||
- * readinessProbe.initialDelaySeconds
|
||||
* int
|
||||
* 5
|
||||
* Number of seconds after the container has started before readiness probe is initiated
|
||||
- * readinessProbe.periodSeconds
|
||||
* int
|
||||
* 5
|
||||
* How often (in seconds) to perform the readiness probe
|
||||
- * replicaCount
|
||||
* int
|
||||
* 1
|
||||
* Number of replicas
|
||||
- * resources
|
||||
* object
|
||||
* {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
|
||||
* Resource configuration
|
||||
- * resources.limits."nvidia.com/gpu"
|
||||
* int
|
||||
* 1
|
||||
* Number of gpus used
|
||||
- * resources.limits.cpu
|
||||
* int
|
||||
* 4
|
||||
* Number of CPUs
|
||||
- * resources.limits.memory
|
||||
* string
|
||||
* "16Gi"
|
||||
* CPU memory configuration
|
||||
- * resources.requests."nvidia.com/gpu"
|
||||
* int
|
||||
* 1
|
||||
* Number of gpus used
|
||||
- * resources.requests.cpu
|
||||
* int
|
||||
* 4
|
||||
* Number of CPUs
|
||||
- * resources.requests.memory
|
||||
* string
|
||||
* "16Gi"
|
||||
* CPU memory configuration
|
||||
- * secrets
|
||||
* object
|
||||
* {}
|
||||
* Secrets configuration
|
||||
- * serviceName
|
||||
* string
|
||||
*
|
||||
* Service name
|
||||
- * servicePort
|
||||
* int
|
||||
* 80
|
||||
* Service port
|
||||
- * labels.environment
|
||||
* string
|
||||
* test
|
||||
* Environment name
|
||||
- * labels.release
|
||||
* string
|
||||
* test
|
||||
* Release name
|
||||
:::
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# Using other frameworks
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
bentoml
|
||||
|
@ -11,4 +11,4 @@ lws
|
|||
modal
|
||||
skypilot
|
||||
triton
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -2,11 +2,11 @@
|
|||
|
||||
# SkyPilot
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<p align="center">
|
||||
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
||||
</p>
|
||||
```
|
||||
:::
|
||||
|
||||
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
|
||||
|
||||
|
@ -104,10 +104,10 @@ service:
|
|||
max_completion_tokens: 1
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<details>
|
||||
<summary>Click to see the full recipe YAML</summary>
|
||||
```
|
||||
:::
|
||||
|
||||
```yaml
|
||||
service:
|
||||
|
@ -153,9 +153,9 @@ run: |
|
|||
2>&1 | tee api_server.log
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
</details>
|
||||
```
|
||||
:::
|
||||
|
||||
Start the serving the Llama-3 8B model on multiple replicas:
|
||||
|
||||
|
@ -169,10 +169,10 @@ Wait until the service is ready:
|
|||
watch -n10 sky serve status vllm
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<details>
|
||||
<summary>Example outputs:</summary>
|
||||
```
|
||||
:::
|
||||
|
||||
```console
|
||||
Services
|
||||
|
@ -185,9 +185,9 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R
|
|||
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
</details>
|
||||
```
|
||||
:::
|
||||
|
||||
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
|
||||
|
||||
|
@ -223,10 +223,10 @@ service:
|
|||
|
||||
This will scale the service up to when the QPS exceeds 2 for each replica.
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<details>
|
||||
<summary>Click to see the full recipe YAML</summary>
|
||||
```
|
||||
:::
|
||||
|
||||
```yaml
|
||||
service:
|
||||
|
@ -275,9 +275,9 @@ run: |
|
|||
2>&1 | tee api_server.log
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
</details>
|
||||
```
|
||||
:::
|
||||
|
||||
To update the service with the new config:
|
||||
|
||||
|
@ -295,10 +295,10 @@ sky serve down vllm
|
|||
|
||||
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<details>
|
||||
<summary>Click to see the full GUI YAML</summary>
|
||||
```
|
||||
:::
|
||||
|
||||
```yaml
|
||||
envs:
|
||||
|
@ -328,9 +328,9 @@ run: |
|
|||
--stop-token-ids 128009,128001 | tee ~/gradio.log
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
</details>
|
||||
```
|
||||
:::
|
||||
|
||||
1. Start the chat web UI:
|
||||
|
||||
|
|
|
@ -1,9 +1,9 @@
|
|||
# External Integrations
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
kserve
|
||||
kubeai
|
||||
llamastack
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -105,9 +105,9 @@ docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-si
|
|||
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
|
||||
```
|
||||
:::
|
||||
|
||||
(nginxloadbalancer-nginx-launch-nginx)=
|
||||
|
||||
|
|
|
@ -4,19 +4,19 @@
|
|||
|
||||
This document provides an overview of the vLLM architecture.
|
||||
|
||||
```{contents} Table of Contents
|
||||
:::{contents} Table of Contents
|
||||
:depth: 2
|
||||
:local: true
|
||||
```
|
||||
:::
|
||||
|
||||
## Entrypoints
|
||||
|
||||
vLLM provides a number of entrypoints for interacting with the system. The
|
||||
following diagram shows the relationship between them.
|
||||
|
||||
```{image} /assets/design/arch_overview/entrypoints.excalidraw.png
|
||||
:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
|
||||
:alt: Entrypoints Diagram
|
||||
```
|
||||
:::
|
||||
|
||||
### LLM Class
|
||||
|
||||
|
@ -84,9 +84,9 @@ More details on the API server can be found in the [OpenAI-Compatible Server](#o
|
|||
The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
|
||||
the vLLM system, handling model inference and asynchronous request processing.
|
||||
|
||||
```{image} /assets/design/arch_overview/llm_engine.excalidraw.png
|
||||
:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
|
||||
:alt: LLMEngine Diagram
|
||||
```
|
||||
:::
|
||||
|
||||
### LLMEngine
|
||||
|
||||
|
@ -144,11 +144,11 @@ configurations affect the class we ultimately get.
|
|||
|
||||
The following figure shows the class hierarchy of vLLM:
|
||||
|
||||
> ```{figure} /assets/design/hierarchy.png
|
||||
> :::{figure} /assets/design/hierarchy.png
|
||||
> :align: center
|
||||
> :alt: query
|
||||
> :width: 100%
|
||||
> ```
|
||||
> :::
|
||||
|
||||
There are several important design choices behind this class hierarchy:
|
||||
|
||||
|
@ -178,7 +178,7 @@ of a vision model and a language model. By making the constructor uniform, we
|
|||
can easily create a vision model and a language model and compose them into a
|
||||
vision-language model.
|
||||
|
||||
````{note}
|
||||
:::{note}
|
||||
To support this change, all vLLM models' signatures have been updated to:
|
||||
|
||||
```python
|
||||
|
@ -215,7 +215,7 @@ else:
|
|||
```
|
||||
|
||||
This way, the model can work with both old and new versions of vLLM.
|
||||
````
|
||||
:::
|
||||
|
||||
3\. **Sharding and Quantization at Initialization**: Certain features require
|
||||
changing the model weights. For example, tensor parallelism needs to shard the
|
||||
|
|
|
@ -139,26 +139,26 @@
|
|||
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
|
||||
```
|
||||
|
||||
```{figure} ../../assets/kernel/query.png
|
||||
:::{figure} ../../assets/kernel/query.png
|
||||
:align: center
|
||||
:alt: query
|
||||
:width: 70%
|
||||
|
||||
Query data of one token at one head
|
||||
```
|
||||
:::
|
||||
|
||||
- Each thread defines its own `q_ptr` which points to the assigned
|
||||
query token data on global memory. For example, if `VEC_SIZE` is 4
|
||||
and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
|
||||
total of 128 elements divided into 128 / 4 = 32 vecs.
|
||||
|
||||
```{figure} ../../assets/kernel/q_vecs.png
|
||||
:::{figure} ../../assets/kernel/q_vecs.png
|
||||
:align: center
|
||||
:alt: q_vecs
|
||||
:width: 70%
|
||||
|
||||
`q_vecs` for one thread group
|
||||
```
|
||||
:::
|
||||
|
||||
```cpp
|
||||
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
|
||||
|
@ -195,13 +195,13 @@
|
|||
points to key token data based on `k_cache` at assigned block,
|
||||
assigned head and assigned token.
|
||||
|
||||
```{figure} ../../assets/kernel/key.png
|
||||
:::{figure} ../../assets/kernel/key.png
|
||||
:align: center
|
||||
:alt: key
|
||||
:width: 70%
|
||||
|
||||
Key data of all context tokens at one head
|
||||
```
|
||||
:::
|
||||
|
||||
- The diagram above illustrates the memory layout for key data. It
|
||||
assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
|
||||
|
@ -214,13 +214,13 @@
|
|||
elements for one token) that will be processed by 2 threads (one
|
||||
thread group) separately.
|
||||
|
||||
```{figure} ../../assets/kernel/k_vecs.png
|
||||
:::{figure} ../../assets/kernel/k_vecs.png
|
||||
:align: center
|
||||
:alt: k_vecs
|
||||
:width: 70%
|
||||
|
||||
`k_vecs` for one thread
|
||||
```
|
||||
:::
|
||||
|
||||
```cpp
|
||||
K_vec k_vecs[NUM_VECS_PER_THREAD]
|
||||
|
@ -289,14 +289,14 @@
|
|||
should be performed across the entire thread block, encompassing
|
||||
results between the query token and all context key tokens.
|
||||
|
||||
```{math}
|
||||
:::{math}
|
||||
:nowrap: true
|
||||
|
||||
\begin{gather*}
|
||||
m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
|
||||
\quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
|
||||
\end{gather*}
|
||||
```
|
||||
:::
|
||||
|
||||
### `qk_max` and `logits`
|
||||
|
||||
|
@ -379,29 +379,29 @@
|
|||
|
||||
## Value
|
||||
|
||||
```{figure} ../../assets/kernel/value.png
|
||||
:::{figure} ../../assets/kernel/value.png
|
||||
:align: center
|
||||
:alt: value
|
||||
:width: 70%
|
||||
|
||||
Value data of all context tokens at one head
|
||||
```
|
||||
:::
|
||||
|
||||
```{figure} ../../assets/kernel/logits_vec.png
|
||||
:::{figure} ../../assets/kernel/logits_vec.png
|
||||
:align: center
|
||||
:alt: logits_vec
|
||||
:width: 50%
|
||||
|
||||
`logits_vec` for one thread
|
||||
```
|
||||
:::
|
||||
|
||||
```{figure} ../../assets/kernel/v_vec.png
|
||||
:::{figure} ../../assets/kernel/v_vec.png
|
||||
:align: center
|
||||
:alt: v_vec
|
||||
:width: 70%
|
||||
|
||||
List of `v_vec` for one thread
|
||||
```
|
||||
:::
|
||||
|
||||
- Now we need to retrieve the value data and perform dot multiplication
|
||||
with `logits`. Unlike query and key, there is no thread group
|
||||
|
|
|
@ -7,9 +7,9 @@ page for information on known issues and how to solve them.
|
|||
|
||||
## Introduction
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
The source code references are to the state of the code at the time of writing in December, 2024.
|
||||
```
|
||||
:::
|
||||
|
||||
The use of Python multiprocessing in vLLM is complicated by:
|
||||
|
||||
|
|
|
@ -6,9 +6,9 @@
|
|||
|
||||
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching).
|
||||
```
|
||||
:::
|
||||
|
||||
## Enabling APC in vLLM
|
||||
|
||||
|
|
|
@ -4,13 +4,13 @@
|
|||
|
||||
The tables below show mutually exclusive features and the support on some hardware.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
|
||||
```
|
||||
:::
|
||||
|
||||
## Feature x Feature
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<style>
|
||||
/* Make smaller to try to improve readability */
|
||||
td {
|
||||
|
@ -23,448 +23,447 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
|
|||
font-size: 0.8rem;
|
||||
}
|
||||
</style>
|
||||
```
|
||||
:::
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
:::{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- [CP](#chunked-prefill)
|
||||
- [APC](#automatic-prefix-caching)
|
||||
- [LoRA](#lora-adapter)
|
||||
- <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- [SD](#spec_decode)
|
||||
- CUDA graph
|
||||
- <abbr title="Pooling Models">pooling</abbr>
|
||||
- <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- <abbr title="Logprobs">logP</abbr>
|
||||
- <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- <abbr title="Async Output Processing">async output</abbr>
|
||||
- multi-step
|
||||
- <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- best-of
|
||||
- beam-search
|
||||
- <abbr title="Guided Decoding">guided dec</abbr>
|
||||
* - [CP](#chunked-prefill)
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [APC](#automatic-prefix-caching)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [LoRA](#lora-adapter)
|
||||
- [✗](gh-pr:9057)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - [SD](#spec_decode)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Pooling Models">pooling</abbr>
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- ✗
|
||||
- [✗](gh-issue:7366)
|
||||
- ✗
|
||||
- ✗
|
||||
- [✗](gh-issue:7366)
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Logprobs">logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-pr:8199)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Async Output Processing">async output</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - multi-step
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅
|
||||
- [✗](gh-issue:8198)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- ✅
|
||||
- [✗](gh-pr:8348)
|
||||
- [✗](gh-pr:7199)
|
||||
- ?
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
-
|
||||
-
|
||||
-
|
||||
-
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:6137)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- [✗](gh-issue:7968)
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
-
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:6137)
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- [✗](gh-issue:7968>)
|
||||
- ?
|
||||
- ✅
|
||||
-
|
||||
-
|
||||
* - <abbr title="Guided Decoding">guided dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
- ?
|
||||
- [✗](gh-issue:11484)
|
||||
- ✅
|
||||
- ✗
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:9893)
|
||||
- ?
|
||||
- ✅
|
||||
- ✅
|
||||
-
|
||||
|
||||
```
|
||||
- * Feature
|
||||
* [CP](#chunked-prefill)
|
||||
* [APC](#automatic-prefix-caching)
|
||||
* [LoRA](#lora-adapter)
|
||||
* <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
* [SD](#spec_decode)
|
||||
* CUDA graph
|
||||
* <abbr title="Pooling Models">pooling</abbr>
|
||||
* <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
* <abbr title="Logprobs">logP</abbr>
|
||||
* <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
* <abbr title="Async Output Processing">async output</abbr>
|
||||
* multi-step
|
||||
* <abbr title="Multimodal Inputs">mm</abbr>
|
||||
* best-of
|
||||
* beam-search
|
||||
* <abbr title="Guided Decoding">guided dec</abbr>
|
||||
- * [CP](#chunked-prefill)
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * [APC](#automatic-prefix-caching)
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * [LoRA](#lora-adapter)
|
||||
* [✗](gh-pr:9057)
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * [SD](#spec_decode)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * CUDA graph
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Pooling Models">pooling</abbr>
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
* ✗
|
||||
* [✗](gh-issue:7366)
|
||||
* ✗
|
||||
* ✗
|
||||
* [✗](gh-issue:7366)
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Logprobs">logP</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-pr:8199)
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Async Output Processing">async output</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * multi-step
|
||||
* ✗
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅
|
||||
* [✗](gh-issue:8198)
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * <abbr title="Multimodal Inputs">mm</abbr>
|
||||
* ✅
|
||||
* [✗](gh-pr:8348)
|
||||
* [✗](gh-pr:7199)
|
||||
* ?
|
||||
* ?
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
*
|
||||
*
|
||||
*
|
||||
*
|
||||
- * best-of
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:6137)
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
* [✗](gh-issue:7968)
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
*
|
||||
- * beam-search
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:6137)
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
* [✗](gh-issue:7968>)
|
||||
* ?
|
||||
* ✅
|
||||
*
|
||||
*
|
||||
- * <abbr title="Guided Decoding">guided dec</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
* ?
|
||||
* [✗](gh-issue:11484)
|
||||
* ✅
|
||||
* ✗
|
||||
* ?
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:9893)
|
||||
* ?
|
||||
* ✅
|
||||
* ✅
|
||||
*
|
||||
:::
|
||||
|
||||
(feature-x-hardware)=
|
||||
|
||||
## Feature x Hardware
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
:::{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
:widths: auto
|
||||
|
||||
* - Feature
|
||||
- Volta
|
||||
- Turing
|
||||
- Ampere
|
||||
- Ada
|
||||
- Hopper
|
||||
- CPU
|
||||
- AMD
|
||||
* - [CP](#chunked-prefill)
|
||||
- [✗](gh-issue:2729)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - [APC](#automatic-prefix-caching)
|
||||
- [✗](gh-issue:3687)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - [LoRA](#lora-adapter)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:8475)
|
||||
- ✅
|
||||
* - [SD](#spec_decode)
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - CUDA graph
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
* - <abbr title="Pooling Models">pooling</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ?
|
||||
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
* - <abbr title="Multimodal Inputs">mm</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Logprobs">logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Async Output Processing">async output</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
- ✗
|
||||
* - multi-step
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- [✗](gh-issue:8477)
|
||||
- ✅
|
||||
* - best-of
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - beam-search
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
* - <abbr title="Guided Decoding">guided dec</abbr>
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
```
|
||||
- * Feature
|
||||
* Volta
|
||||
* Turing
|
||||
* Ampere
|
||||
* Ada
|
||||
* Hopper
|
||||
* CPU
|
||||
* AMD
|
||||
- * [CP](#chunked-prefill)
|
||||
* [✗](gh-issue:2729)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * [APC](#automatic-prefix-caching)
|
||||
* [✗](gh-issue:3687)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * [LoRA](#lora-adapter)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Prompt Adapter">prmpt adptr</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:8475)
|
||||
* ✅
|
||||
- * [SD](#spec_decode)
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * CUDA graph
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✅
|
||||
- * <abbr title="Pooling Models">pooling</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ?
|
||||
- * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
- * <abbr title="Multimodal Inputs">mm</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Logprobs">logP</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Prompt Logprobs">prmpt logP</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Async Output Processing">async output</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✗
|
||||
* ✗
|
||||
- * multi-step
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* [✗](gh-issue:8477)
|
||||
* ✅
|
||||
- * best-of
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * beam-search
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
- * <abbr title="Guided Decoding">guided dec</abbr>
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
* ✅
|
||||
:::
|
||||
|
|
|
@ -4,9 +4,9 @@
|
|||
|
||||
This page introduces you the disaggregated prefilling feature in vLLM.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
This feature is experimental and subject to change.
|
||||
```
|
||||
:::
|
||||
|
||||
## Why disaggregated prefilling?
|
||||
|
||||
|
@ -15,9 +15,9 @@ Two main reasons:
|
|||
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
|
||||
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Disaggregated prefill DOES NOT improve throughput.
|
||||
```
|
||||
:::
|
||||
|
||||
## Usage example
|
||||
|
||||
|
@ -39,21 +39,21 @@ Key abstractions for disaggregated prefilling:
|
|||
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
|
||||
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
`insert` is non-blocking operation but `drop_select` is blocking operation.
|
||||
```
|
||||
:::
|
||||
|
||||
Here is a figure illustrating how the above 3 abstractions are organized:
|
||||
|
||||
```{image} /assets/features/disagg_prefill/abstraction.jpg
|
||||
:::{image} /assets/features/disagg_prefill/abstraction.jpg
|
||||
:alt: Disaggregated prefilling abstractions
|
||||
```
|
||||
:::
|
||||
|
||||
The workflow of disaggregated prefilling is as follows:
|
||||
|
||||
```{image} /assets/features/disagg_prefill/overview.jpg
|
||||
:::{image} /assets/features/disagg_prefill/overview.jpg
|
||||
:alt: Disaggregated prefilling workflow
|
||||
```
|
||||
:::
|
||||
|
||||
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
|
||||
|
||||
|
|
|
@ -60,9 +60,9 @@ vllm serve meta-llama/Llama-2-7b-hf \
|
|||
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
|
||||
```
|
||||
:::
|
||||
|
||||
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
|
||||
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
|
||||
|
|
|
@ -2,11 +2,11 @@
|
|||
|
||||
# AutoAWQ
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
|
||||
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
|
||||
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
|
||||
```
|
||||
:::
|
||||
|
||||
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
|
||||
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
|
||||
|
|
|
@ -14,10 +14,10 @@ The FP8 types typically supported in hardware have two distinct representations,
|
|||
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
|
||||
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
|
||||
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
|
||||
```
|
||||
:::
|
||||
|
||||
## Quick Start with Online Dynamic Quantization
|
||||
|
||||
|
@ -32,9 +32,9 @@ model = LLM("facebook/opt-125m", quantization="fp8")
|
|||
result = model.generate("Hello, my name is")
|
||||
```
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
|
||||
```
|
||||
:::
|
||||
|
||||
## Installation
|
||||
|
||||
|
@ -110,9 +110,9 @@ model.generate("Hello my name is")
|
|||
|
||||
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
|
||||
```
|
||||
:::
|
||||
|
||||
```console
|
||||
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||
|
@ -137,10 +137,10 @@ If you encounter any issues or have feature requests, please open an issue on th
|
|||
|
||||
## Deprecated Flow
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The following information is preserved for reference and search purposes.
|
||||
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
|
||||
```
|
||||
:::
|
||||
|
||||
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
|
||||
|
||||
|
|
|
@ -2,13 +2,13 @@
|
|||
|
||||
# GGUF
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
|
||||
```
|
||||
:::
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
|
||||
```
|
||||
:::
|
||||
|
||||
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
||||
|
||||
|
@ -25,9 +25,9 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
|
|||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
||||
```
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
|
||||
```
|
||||
:::
|
||||
|
||||
You can also use the GGUF model directly through the LLM entrypoint:
|
||||
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Contents
|
||||
:maxdepth: 1
|
||||
|
||||
|
@ -15,4 +15,4 @@ gguf
|
|||
int8
|
||||
fp8
|
||||
quantized_kvcache
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -7,9 +7,9 @@ This quantization method is particularly useful for reducing model size while ma
|
|||
|
||||
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
|
||||
```
|
||||
:::
|
||||
|
||||
## Prerequisites
|
||||
|
||||
|
@ -119,9 +119,9 @@ $ lm_eval --model vllm \
|
|||
--batch_size 'auto'
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
|
||||
```
|
||||
:::
|
||||
|
||||
## Best Practices
|
||||
|
||||
|
|
|
@ -4,128 +4,129 @@
|
|||
|
||||
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
|
||||
|
||||
```{list-table}
|
||||
:::{list-table}
|
||||
:header-rows: 1
|
||||
:widths: 20 8 8 8 8 8 8 8 8 8 8
|
||||
|
||||
* - Implementation
|
||||
- Volta
|
||||
- Turing
|
||||
- Ampere
|
||||
- Ada
|
||||
- Hopper
|
||||
- AMD GPU
|
||||
- Intel GPU
|
||||
- x86 CPU
|
||||
- AWS Inferentia
|
||||
- Google TPU
|
||||
* - AWQ
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - GPTQ
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - Marlin (GPTQ/AWQ/FP8)
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - INT8 (W8A8)
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - FP8 (W8A8)
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - AQLM
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - bitsandbytes
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - DeepSpeedFP
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - GGUF
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
```
|
||||
- * Implementation
|
||||
* Volta
|
||||
* Turing
|
||||
* Ampere
|
||||
* Ada
|
||||
* Hopper
|
||||
* AMD GPU
|
||||
* Intel GPU
|
||||
* x86 CPU
|
||||
* AWS Inferentia
|
||||
* Google TPU
|
||||
- * AWQ
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
- * GPTQ
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
- * Marlin (GPTQ/AWQ/FP8)
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * INT8 (W8A8)
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
- * FP8 (W8A8)
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * AQLM
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * bitsandbytes
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * DeepSpeedFP
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
- * GGUF
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
* ✗
|
||||
|
||||
:::
|
||||
|
||||
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
|
||||
- "✅︎" indicates that the quantization method is supported on the specified hardware.
|
||||
- "✗" indicates that the quantization method is not supported on the specified hardware.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
|
||||
|
||||
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -2,15 +2,15 @@
|
|||
|
||||
# Speculative Decoding
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Please note that speculative decoding in vLLM is not yet optimized and does
|
||||
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
|
||||
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
|
||||
```
|
||||
:::
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
|
||||
```
|
||||
:::
|
||||
|
||||
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
|
||||
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
|
|
@ -95,10 +95,10 @@ completion = client.chat.completions.create(
|
|||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
|
||||
This can improve the results notably in most cases.
|
||||
```
|
||||
:::
|
||||
|
||||
Finally we have the `guided_grammar`, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
|
||||
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
|
||||
|
|
|
@ -57,9 +57,9 @@ class Index:
|
|||
|
||||
def generate(self) -> str:
|
||||
content = f"# {self.title}\n\n{self.description}\n\n"
|
||||
content += "```{toctree}\n"
|
||||
content += ":::{toctree}\n"
|
||||
content += f":caption: {self.caption}\n:maxdepth: {self.maxdepth}\n"
|
||||
content += "\n".join(self.documents) + "\n```\n"
|
||||
content += "\n".join(self.documents) + "\n:::\n"
|
||||
return content
|
||||
|
||||
|
||||
|
|
|
@ -86,9 +86,9 @@ docker build -f Dockerfile.hpu -t vllm-hpu-env .
|
|||
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
|
||||
```
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
|
||||
```
|
||||
:::
|
||||
|
||||
## Extra information
|
||||
|
||||
|
@ -155,30 +155,30 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
|
|||
|
||||
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
|
||||
|
||||
```{list-table} vLLM execution modes
|
||||
:::{list-table} vLLM execution modes
|
||||
:widths: 25 25 50
|
||||
:header-rows: 1
|
||||
|
||||
* - `PT_HPU_LAZY_MODE`
|
||||
- `enforce_eager`
|
||||
- execution mode
|
||||
* - 0
|
||||
- 0
|
||||
- torch.compile
|
||||
* - 0
|
||||
- 1
|
||||
- PyTorch eager mode
|
||||
* - 1
|
||||
- 0
|
||||
- HPU Graphs
|
||||
* - 1
|
||||
- 1
|
||||
- PyTorch lazy mode
|
||||
```
|
||||
- * `PT_HPU_LAZY_MODE`
|
||||
* `enforce_eager`
|
||||
* execution mode
|
||||
- * 0
|
||||
* 0
|
||||
* torch.compile
|
||||
- * 0
|
||||
* 1
|
||||
* PyTorch eager mode
|
||||
- * 1
|
||||
* 0
|
||||
* HPU Graphs
|
||||
- * 1
|
||||
* 1
|
||||
* PyTorch lazy mode
|
||||
:::
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
|
||||
```
|
||||
:::
|
||||
|
||||
(gaudi-bucketing-mechanism)=
|
||||
|
||||
|
@ -187,9 +187,9 @@ In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and
|
|||
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
|
||||
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
|
||||
```
|
||||
:::
|
||||
|
||||
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
|
||||
|
||||
|
@ -222,15 +222,15 @@ min = 128, step = 128, max = 512
|
|||
|
||||
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
|
||||
```
|
||||
:::
|
||||
|
||||
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
|
||||
```
|
||||
:::
|
||||
|
||||
### Warmup
|
||||
|
||||
|
@ -252,9 +252,9 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size
|
|||
|
||||
This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
|
||||
```
|
||||
:::
|
||||
|
||||
### HPU Graph capture
|
||||
|
||||
|
@ -269,9 +269,9 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil
|
|||
Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
|
||||
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
|
||||
```
|
||||
:::
|
||||
|
||||
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
|
||||
\- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
|
||||
|
@ -279,9 +279,9 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec
|
|||
|
||||
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
|
||||
```
|
||||
:::
|
||||
|
||||
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
|
||||
|
||||
|
@ -352,13 +352,13 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
|
|||
|
||||
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
|
||||
|
||||
- `{phase}` is either `PROMPT` or `DECODE`
|
||||
* `{phase}` is either `PROMPT` or `DECODE`
|
||||
|
||||
- `{dim}` is either `BS`, `SEQ` or `BLOCK`
|
||||
* `{dim}` is either `BS`, `SEQ` or `BLOCK`
|
||||
|
||||
- `{param}` is either `MIN`, `STEP` or `MAX`
|
||||
* `{param}` is either `MIN`, `STEP` or `MAX`
|
||||
|
||||
- Default values:
|
||||
* Default values:
|
||||
|
||||
- Prompt:
|
||||
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`
|
||||
|
|
|
@ -2,374 +2,374 @@
|
|||
|
||||
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} TPU
|
||||
::::{tab-item} TPU
|
||||
:sync: tpu
|
||||
|
||||
```{include} tpu.inc.md
|
||||
:::{include} tpu.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
```{include} hpu-gaudi.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
```{include} neuron.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
```{include} openvino.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
:::{include} hpu-gaudi.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
:::{include} neuron.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
:::{include} openvino.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Requirements
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} TPU
|
||||
::::{tab-item} TPU
|
||||
:sync: tpu
|
||||
|
||||
```{include} tpu.inc.md
|
||||
:::{include} tpu.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Configure a new environment"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
```{include} hpu-gaudi.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Configure a new environment"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
```{include} neuron.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Configure a new environment"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
```{include} openvino.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
:::{include} hpu-gaudi.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Configure a new environment"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
:::{include} neuron.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Configure a new environment"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
:::{include} openvino.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Configure a new environment
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} TPU
|
||||
::::{tab-item} TPU
|
||||
:sync: tpu
|
||||
|
||||
```{include} tpu.inc.md
|
||||
:::{include} tpu.inc.md
|
||||
:start-after: "## Configure a new environment"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
```{include} hpu-gaudi.inc.md
|
||||
:start-after: "## Configure a new environment"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
```{include} neuron.inc.md
|
||||
:start-after: "## Configure a new environment"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
```{include} ../python_env_setup.inc.md
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
:::{include} hpu-gaudi.inc.md
|
||||
:start-after: "## Configure a new environment"
|
||||
:end-before: "## Set up using Python"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
:::{include} neuron.inc.md
|
||||
:start-after: "## Configure a new environment"
|
||||
:end-before: "## Set up using Python"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
:::{include} ../python_env_setup.inc.md
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Set up using Python
|
||||
|
||||
### Pre-built wheels
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} TPU
|
||||
::::{tab-item} TPU
|
||||
:sync: tpu
|
||||
|
||||
```{include} tpu.inc.md
|
||||
:::{include} tpu.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
```{include} hpu-gaudi.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
```{include} neuron.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
```{include} openvino.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
:::{include} hpu-gaudi.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
:::{include} neuron.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
:::{include} openvino.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} TPU
|
||||
::::{tab-item} TPU
|
||||
:sync: tpu
|
||||
|
||||
```{include} tpu.inc.md
|
||||
:::{include} tpu.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
```{include} hpu-gaudi.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
```{include} neuron.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
```{include} openvino.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
:::{include} hpu-gaudi.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
:::{include} neuron.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
:::{include} openvino.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
### Pre-built images
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} TPU
|
||||
::::{tab-item} TPU
|
||||
:sync: tpu
|
||||
|
||||
```{include} tpu.inc.md
|
||||
:::{include} tpu.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
```{include} hpu-gaudi.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
```{include} neuron.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
```{include} openvino.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
:::{include} hpu-gaudi.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
:::{include} neuron.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
:::{include} openvino.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
### Build image from source
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} TPU
|
||||
::::{tab-item} TPU
|
||||
:sync: tpu
|
||||
|
||||
```{include} tpu.inc.md
|
||||
:::{include} tpu.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Extra information"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
```{include} hpu-gaudi.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Extra information"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
```{include} neuron.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Extra information"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
```{include} openvino.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Extra information"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
:::{include} hpu-gaudi.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Extra information"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
:::{include} neuron.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Extra information"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
:::{include} openvino.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Extra information"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Extra information
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} TPU
|
||||
::::{tab-item} TPU
|
||||
:sync: tpu
|
||||
|
||||
```{include} tpu.inc.md
|
||||
:::{include} tpu.inc.md
|
||||
:start-after: "## Extra information"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
```{include} hpu-gaudi.inc.md
|
||||
:start-after: "## Extra information"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
```{include} neuron.inc.md
|
||||
:start-after: "## Extra information"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
```{include} openvino.inc.md
|
||||
:start-after: "## Extra information"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Intel Gaudi
|
||||
:sync: hpu-gaudi
|
||||
|
||||
:::{include} hpu-gaudi.inc.md
|
||||
:start-after: "## Extra information"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Neuron
|
||||
:sync: neuron
|
||||
|
||||
:::{include} neuron.inc.md
|
||||
:start-after: "## Extra information"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenVINO
|
||||
:sync: openvino
|
||||
|
||||
:::{include} openvino.inc.md
|
||||
:start-after: "## Extra information"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
|
|
@ -67,9 +67,9 @@ Currently, there are no pre-built Neuron wheels.
|
|||
|
||||
### Build wheel from source
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
|
||||
```
|
||||
:::
|
||||
|
||||
Following instructions are applicable to Neuron SDK 2.16 and beyond.
|
||||
|
||||
|
|
|
@ -47,10 +47,10 @@ When you request queued resources, the request is added to a queue maintained by
|
|||
the Cloud TPU service. When the requested resource becomes available, it's
|
||||
assigned to your Google Cloud project for your immediate exclusive use.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
In all of the following commands, replace the ALL CAPS parameter names with
|
||||
appropriate values. See the parameter descriptions table for more information.
|
||||
```
|
||||
:::
|
||||
|
||||
### Provision Cloud TPUs with GKE
|
||||
|
||||
|
@ -75,33 +75,33 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
|
|||
--service-account SERVICE_ACCOUNT
|
||||
```
|
||||
|
||||
```{list-table} Parameter descriptions
|
||||
:::{list-table} Parameter descriptions
|
||||
:header-rows: 1
|
||||
|
||||
* - Parameter name
|
||||
- Description
|
||||
* - QUEUED_RESOURCE_ID
|
||||
- The user-assigned ID of the queued resource request.
|
||||
* - TPU_NAME
|
||||
- The user-assigned name of the TPU which is created when the queued
|
||||
- * Parameter name
|
||||
* Description
|
||||
- * QUEUED_RESOURCE_ID
|
||||
* The user-assigned ID of the queued resource request.
|
||||
- * TPU_NAME
|
||||
* The user-assigned name of the TPU which is created when the queued
|
||||
resource request is allocated.
|
||||
* - PROJECT_ID
|
||||
- Your Google Cloud project
|
||||
* - ZONE
|
||||
- The GCP zone where you want to create your Cloud TPU. The value you use
|
||||
- * PROJECT_ID
|
||||
* Your Google Cloud project
|
||||
- * ZONE
|
||||
* The GCP zone where you want to create your Cloud TPU. The value you use
|
||||
depends on the version of TPUs you are using. For more information, see
|
||||
`TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
|
||||
* - ACCELERATOR_TYPE
|
||||
- The TPU version you want to use. Specify the TPU version, for example
|
||||
- * ACCELERATOR_TYPE
|
||||
* The TPU version you want to use. Specify the TPU version, for example
|
||||
`v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
|
||||
see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
|
||||
* - RUNTIME_VERSION
|
||||
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
|
||||
* - SERVICE_ACCOUNT
|
||||
- The email address for your service account. You can find it in the IAM
|
||||
- * RUNTIME_VERSION
|
||||
* The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
|
||||
- * SERVICE_ACCOUNT
|
||||
* The email address for your service account. You can find it in the IAM
|
||||
Cloud Console under *Service Accounts*. For example:
|
||||
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
|
||||
```
|
||||
:::
|
||||
|
||||
Connect to your TPU using SSH:
|
||||
|
||||
|
@ -178,15 +178,15 @@ Run the Docker image with the following command:
|
|||
docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
|
||||
possible input shapes and compiles an XLA graph for each shape. The
|
||||
compilation time may take 20~30 minutes in the first run. However, the
|
||||
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
|
||||
cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
|
||||
```
|
||||
:::
|
||||
|
||||
````{tip}
|
||||
:::{tip}
|
||||
If you encounter the following error:
|
||||
|
||||
```console
|
||||
|
@ -198,9 +198,10 @@ file or directory
|
|||
Install OpenBLAS with the following command:
|
||||
|
||||
```console
|
||||
$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
||||
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
||||
```
|
||||
````
|
||||
|
||||
:::
|
||||
|
||||
## Extra information
|
||||
|
||||
|
|
|
@ -25,9 +25,9 @@ pip install -r requirements-cpu.txt
|
|||
pip install -e .
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
|
||||
```
|
||||
:::
|
||||
|
||||
#### Troubleshooting
|
||||
|
||||
|
|
|
@ -2,86 +2,86 @@
|
|||
|
||||
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} x86
|
||||
::::{tab-item} x86
|
||||
:sync: x86
|
||||
|
||||
```{include} x86.inc.md
|
||||
:::{include} x86.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ARM
|
||||
:sync: arm
|
||||
|
||||
```{include} arm.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Apple silicon
|
||||
:sync: apple
|
||||
|
||||
```{include} apple.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ARM
|
||||
:sync: arm
|
||||
|
||||
:::{include} arm.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Apple silicon
|
||||
:sync: apple
|
||||
|
||||
:::{include} apple.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python: 3.9 -- 3.12
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} x86
|
||||
::::{tab-item} x86
|
||||
:sync: x86
|
||||
|
||||
```{include} x86.inc.md
|
||||
:::{include} x86.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ARM
|
||||
:sync: arm
|
||||
|
||||
```{include} arm.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Apple silicon
|
||||
:sync: apple
|
||||
|
||||
```{include} apple.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ARM
|
||||
:sync: arm
|
||||
|
||||
:::{include} arm.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Apple silicon
|
||||
:sync: apple
|
||||
|
||||
:::{include} apple.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Set up using Python
|
||||
|
||||
### Create a new Python environment
|
||||
|
||||
```{include} ../python_env_setup.inc.md
|
||||
```
|
||||
:::{include} ../python_env_setup.inc.md
|
||||
:::
|
||||
|
||||
### Pre-built wheels
|
||||
|
||||
|
@ -89,41 +89,41 @@ Currently, there are no pre-built CPU wheels.
|
|||
|
||||
### Build wheel from source
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} x86
|
||||
::::{tab-item} x86
|
||||
:sync: x86
|
||||
|
||||
```{include} x86.inc.md
|
||||
:::{include} x86.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ARM
|
||||
:sync: arm
|
||||
|
||||
```{include} arm.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Apple silicon
|
||||
:sync: apple
|
||||
|
||||
```{include} apple.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ARM
|
||||
:sync: arm
|
||||
|
||||
:::{include} arm.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Apple silicon
|
||||
:sync: apple
|
||||
|
||||
:::{include} apple.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
### Pre-built images
|
||||
|
@ -142,9 +142,9 @@ $ docker run -it \
|
|||
vllm-cpu-env
|
||||
```
|
||||
|
||||
:::{tip}
|
||||
::::{tip}
|
||||
For ARM or Apple silicon, use `Dockerfile.arm`
|
||||
:::
|
||||
::::
|
||||
|
||||
## Supported features
|
||||
|
||||
|
|
|
@ -17,10 +17,10 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
|
|||
:::{include} build.inc.md
|
||||
:::
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
|
||||
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
|
||||
```
|
||||
:::
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
|
|
|
@ -10,9 +10,9 @@ vLLM contains pre-compiled C++ and CUDA (12.1) binaries.
|
|||
|
||||
### Create a new Python environment
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
|
||||
```
|
||||
:::
|
||||
|
||||
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
|
||||
|
||||
|
@ -100,10 +100,10 @@ pip install --editable .
|
|||
|
||||
You can find more information about vLLM's wheels in <project:#install-the-latest-code>.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
|
||||
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to <project:#install-the-latest-code> for instructions on how to install a specified wheel.
|
||||
```
|
||||
:::
|
||||
|
||||
#### Full build (with compilation)
|
||||
|
||||
|
@ -115,7 +115,7 @@ cd vllm
|
|||
pip install -e .
|
||||
```
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
|
||||
|
||||
For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
|
||||
|
@ -123,7 +123,7 @@ As long as `which ccache` command can find the `ccache` binary, it will be used
|
|||
|
||||
[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
|
||||
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
|
||||
```
|
||||
:::
|
||||
|
||||
##### Use an existing PyTorch installation
|
||||
|
||||
|
|
|
@ -2,299 +2,299 @@
|
|||
|
||||
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} CUDA
|
||||
::::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
```{include} cuda.inc.md
|
||||
:::{include} cuda.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
```{include} rocm.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
```{include} xpu.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
:::{include} rocm.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
:::{include} xpu.inc.md
|
||||
:start-after: "# Installation"
|
||||
:end-before: "## Requirements"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Requirements
|
||||
|
||||
- OS: Linux
|
||||
- Python: 3.9 -- 3.12
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} CUDA
|
||||
::::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
```{include} cuda.inc.md
|
||||
:::{include} cuda.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
```{include} rocm.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
```{include} xpu.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
:::{include} rocm.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
:::{include} xpu.inc.md
|
||||
:start-after: "## Requirements"
|
||||
:end-before: "## Set up using Python"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Set up using Python
|
||||
|
||||
### Create a new Python environment
|
||||
|
||||
```{include} ../python_env_setup.inc.md
|
||||
```
|
||||
|
||||
::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
```{include} cuda.inc.md
|
||||
:start-after: "## Create a new Python environment"
|
||||
:end-before: "### Pre-built wheels"
|
||||
```
|
||||
|
||||
:::{include} ../python_env_setup.inc.md
|
||||
:::
|
||||
|
||||
:::{tab-item} ROCm
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
::::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
:::{include} cuda.inc.md
|
||||
:start-after: "## Create a new Python environment"
|
||||
:end-before: "### Pre-built wheels"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
There is no extra information on creating a new Python environment for this device.
|
||||
|
||||
:::
|
||||
::::
|
||||
|
||||
:::{tab-item} XPU
|
||||
::::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
There is no extra information on creating a new Python environment for this device.
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
### Pre-built wheels
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} CUDA
|
||||
::::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
```{include} cuda.inc.md
|
||||
:::{include} cuda.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
```{include} rocm.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
```{include} xpu.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
:::{include} rocm.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
:::{include} xpu.inc.md
|
||||
:start-after: "### Pre-built wheels"
|
||||
:end-before: "### Build wheel from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
(build-from-source)=
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} CUDA
|
||||
::::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
```{include} cuda.inc.md
|
||||
:::{include} cuda.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
```{include} rocm.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
```{include} xpu.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
:::{include} rocm.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
:::{include} xpu.inc.md
|
||||
:start-after: "### Build wheel from source"
|
||||
:end-before: "## Set up using Docker"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
### Pre-built images
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} CUDA
|
||||
::::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
```{include} cuda.inc.md
|
||||
:::{include} cuda.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
```{include} rocm.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
```{include} xpu.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
:::{include} rocm.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
:::{include} xpu.inc.md
|
||||
:start-after: "### Pre-built images"
|
||||
:end-before: "### Build image from source"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
### Build image from source
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} CUDA
|
||||
::::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
```{include} cuda.inc.md
|
||||
:::{include} cuda.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Supported features"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
```{include} rocm.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Supported features"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
```{include} xpu.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Supported features"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
:::{include} rocm.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Supported features"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
:::{include} xpu.inc.md
|
||||
:start-after: "### Build image from source"
|
||||
:end-before: "## Supported features"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Supported features
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
:sync-group: device
|
||||
|
||||
:::{tab-item} CUDA
|
||||
::::{tab-item} CUDA
|
||||
:sync: cuda
|
||||
|
||||
```{include} cuda.inc.md
|
||||
:::{include} cuda.inc.md
|
||||
:start-after: "## Supported features"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
```{include} rocm.inc.md
|
||||
:start-after: "## Supported features"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
```{include} xpu.inc.md
|
||||
:start-after: "## Supported features"
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} ROCm
|
||||
:sync: rocm
|
||||
|
||||
:::{include} rocm.inc.md
|
||||
:start-after: "## Supported features"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} XPU
|
||||
:sync: xpu
|
||||
|
||||
:::{include} xpu.inc.md
|
||||
:start-after: "## Supported features"
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
|
|
@ -16,10 +16,10 @@ Currently, there are no pre-built ROCm wheels.
|
|||
However, the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
|
||||
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
|
||||
for instructions on how to use this prebuilt docker image.
|
||||
```
|
||||
:::
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
|
@ -47,9 +47,9 @@ for instructions on how to use this prebuilt docker image.
|
|||
cd ../..
|
||||
```
|
||||
|
||||
```{note}
|
||||
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
|
||||
```
|
||||
:::{note}
|
||||
If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
|
||||
:::
|
||||
|
||||
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
|
||||
|
||||
|
@ -67,9 +67,9 @@ for instructions on how to use this prebuilt docker image.
|
|||
cd ..
|
||||
```
|
||||
|
||||
```{note}
|
||||
- You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
|
||||
```
|
||||
:::{note}
|
||||
You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
|
||||
:::
|
||||
|
||||
3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:
|
||||
|
||||
|
@ -95,17 +95,18 @@ for instructions on how to use this prebuilt docker image.
|
|||
|
||||
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
||||
|
||||
```{tip}
|
||||
<!--- pyml disable-num-lines 5 ul-indent-->
|
||||
:::{tip}
|
||||
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
|
||||
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
|
||||
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
|
||||
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
|
||||
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
|
||||
```
|
||||
:::
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
|
|
|
@ -30,10 +30,10 @@ pip install -v -r requirements-xpu.txt
|
|||
VLLM_TARGET_DEVICE=xpu python setup.py install
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
- FP16 is the default data type in the current XPU backend. The BF16 data
|
||||
type will be supported in the future.
|
||||
```
|
||||
:::
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
|
|
|
@ -4,10 +4,10 @@
|
|||
|
||||
vLLM supports the following hardware platforms:
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
gpu/index
|
||||
cpu/index
|
||||
ai_accelerator/index
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -6,9 +6,9 @@ conda create -n myenv python=3.12 -y
|
|||
conda activate myenv
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
|
||||
```
|
||||
:::
|
||||
|
||||
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
|
||||
|
||||
|
|
|
@ -32,9 +32,9 @@ conda activate myenv
|
|||
pip install vllm
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
|
||||
```
|
||||
:::
|
||||
|
||||
(quickstart-offline)=
|
||||
|
||||
|
@ -69,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](
|
|||
llm = LLM(model="facebook/opt-125m")
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
|
||||
```
|
||||
:::
|
||||
|
||||
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
|
||||
|
||||
|
@ -97,10 +97,10 @@ Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instru
|
|||
vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
By default, the server uses a predefined chat template stored in the tokenizer.
|
||||
You can learn about overriding it [here](#chat-template).
|
||||
```
|
||||
:::
|
||||
|
||||
This server can be queried in the same format as OpenAI API. For example, to list the models:
|
||||
|
||||
|
|
|
@ -4,9 +4,9 @@
|
|||
|
||||
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
|
||||
```
|
||||
:::
|
||||
|
||||
## Hangs downloading a model
|
||||
|
||||
|
@ -18,9 +18,9 @@ It's recommended to download the model first using the [huggingface-cli](https:/
|
|||
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
|
||||
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
|
||||
```
|
||||
:::
|
||||
|
||||
## Out of memory
|
||||
|
||||
|
@ -132,14 +132,14 @@ If the script runs successfully, you should see the message `sanity check is suc
|
|||
|
||||
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
|
||||
|
||||
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
|
||||
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
|
||||
|
||||
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
|
||||
```
|
||||
:::
|
||||
|
||||
(troubleshooting-python-multiprocessing)=
|
||||
|
||||
|
|
|
@ -1,13 +1,13 @@
|
|||
# Welcome to vLLM
|
||||
|
||||
```{figure} ./assets/logos/vllm-logo-text-light.png
|
||||
:::{figure} ./assets/logos/vllm-logo-text-light.png
|
||||
:align: center
|
||||
:alt: vLLM
|
||||
:class: no-scaled-link
|
||||
:width: 60%
|
||||
```
|
||||
:::
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<p style="text-align:center">
|
||||
<strong>Easy, fast, and cheap LLM serving for everyone
|
||||
</strong>
|
||||
|
@ -19,7 +19,7 @@
|
|||
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
|
||||
</p>
|
||||
```
|
||||
:::
|
||||
|
||||
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
||||
|
||||
|
@ -58,7 +58,7 @@ For more information, check out the following:
|
|||
|
||||
% How to start using vLLM?
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Getting Started
|
||||
:maxdepth: 1
|
||||
|
||||
|
@ -67,11 +67,11 @@ getting_started/quickstart
|
|||
getting_started/examples/examples_index
|
||||
getting_started/troubleshooting
|
||||
getting_started/faq
|
||||
```
|
||||
:::
|
||||
|
||||
% What does vLLM support?
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Models
|
||||
:maxdepth: 1
|
||||
|
||||
|
@ -79,11 +79,11 @@ models/generative_models
|
|||
models/pooling_models
|
||||
models/supported_models
|
||||
models/extensions/index
|
||||
```
|
||||
:::
|
||||
|
||||
% Additional capabilities
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Features
|
||||
:maxdepth: 1
|
||||
|
||||
|
@ -96,11 +96,11 @@ features/automatic_prefix_caching
|
|||
features/disagg_prefill
|
||||
features/spec_decode
|
||||
features/compatibility_matrix
|
||||
```
|
||||
:::
|
||||
|
||||
% Details about running vLLM
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Inference and Serving
|
||||
:maxdepth: 1
|
||||
|
||||
|
@ -113,11 +113,11 @@ serving/engine_args
|
|||
serving/env_vars
|
||||
serving/usage_stats
|
||||
serving/integrations/index
|
||||
```
|
||||
:::
|
||||
|
||||
% Scaling up vLLM for production
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Deployment
|
||||
:maxdepth: 1
|
||||
|
||||
|
@ -126,21 +126,21 @@ deployment/k8s
|
|||
deployment/nginx
|
||||
deployment/frameworks/index
|
||||
deployment/integrations/index
|
||||
```
|
||||
:::
|
||||
|
||||
% Making the most out of vLLM
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Performance
|
||||
:maxdepth: 1
|
||||
|
||||
performance/optimization
|
||||
performance/benchmarks
|
||||
```
|
||||
:::
|
||||
|
||||
% Explanation of vLLM internals
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Design Documents
|
||||
:maxdepth: 2
|
||||
|
||||
|
@ -151,11 +151,11 @@ design/kernel/paged_attention
|
|||
design/mm_processing
|
||||
design/automatic_prefix_caching
|
||||
design/multiprocessing
|
||||
```
|
||||
:::
|
||||
|
||||
% How to contribute to the vLLM project
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Developer Guide
|
||||
:maxdepth: 2
|
||||
|
||||
|
@ -164,11 +164,11 @@ contributing/profiling/profiling_index
|
|||
contributing/dockerfile/dockerfile
|
||||
contributing/model/index
|
||||
contributing/vulnerability_management
|
||||
```
|
||||
:::
|
||||
|
||||
% Technical API specifications
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: API Reference
|
||||
:maxdepth: 2
|
||||
|
||||
|
@ -177,18 +177,18 @@ api/engine/index
|
|||
api/inference_params
|
||||
api/multimodal/index
|
||||
api/model/index
|
||||
```
|
||||
:::
|
||||
|
||||
% Latest news and acknowledgements
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Community
|
||||
:maxdepth: 1
|
||||
|
||||
community/blog
|
||||
community/meetups
|
||||
community/sponsors
|
||||
```
|
||||
:::
|
||||
|
||||
## Indices and tables
|
||||
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
# Built-in Extensions
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
runai_model_streamer
|
||||
tensorizer
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -48,6 +48,6 @@ You can read further about CPU buffer memory limiting [here](https://github.com/
|
|||
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -11,6 +11,6 @@ For more information on CoreWeave's Tensorizer, please refer to
|
|||
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
|
||||
the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html).
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -70,10 +70,10 @@ The {class}`~vllm.LLM.chat` method implements chat functionality on top of {clas
|
|||
In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
|
||||
and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
In general, only instruction-tuned models have a chat template.
|
||||
Base models may perform poorly as they are not trained to respond to the chat conversation.
|
||||
```
|
||||
:::
|
||||
|
||||
```python
|
||||
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
|
||||
|
|
|
@ -8,54 +8,54 @@ In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmMo
|
|||
These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
|
||||
before returning them.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
We currently support pooling models primarily as a matter of convenience.
|
||||
As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
|
||||
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
|
||||
```
|
||||
:::
|
||||
|
||||
For pooling models, we support the following `--task` options.
|
||||
The selected option sets the default pooler used to extract the final hidden states:
|
||||
|
||||
```{list-table}
|
||||
:::{list-table}
|
||||
:widths: 50 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
* - Task
|
||||
- Pooling Type
|
||||
- Normalization
|
||||
- Softmax
|
||||
* - Embedding (`embed`)
|
||||
- `LAST`
|
||||
- ✅︎
|
||||
- ✗
|
||||
* - Classification (`classify`)
|
||||
- `LAST`
|
||||
- ✗
|
||||
- ✅︎
|
||||
* - Sentence Pair Scoring (`score`)
|
||||
- \*
|
||||
- \*
|
||||
- \*
|
||||
* - Reward Modeling (`reward`)
|
||||
- `ALL`
|
||||
- ✗
|
||||
- ✗
|
||||
```
|
||||
- * Task
|
||||
* Pooling Type
|
||||
* Normalization
|
||||
* Softmax
|
||||
- * Embedding (`embed`)
|
||||
* `LAST`
|
||||
* ✅︎
|
||||
* ✗
|
||||
- * Classification (`classify`)
|
||||
* `LAST`
|
||||
* ✗
|
||||
* ✅︎
|
||||
- * Sentence Pair Scoring (`score`)
|
||||
* \*
|
||||
* \*
|
||||
* \*
|
||||
- * Reward Modeling (`reward`)
|
||||
* `ALL`
|
||||
* ✗
|
||||
* ✗
|
||||
:::
|
||||
|
||||
\*The default pooler is always defined by the model.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
|
||||
```
|
||||
:::
|
||||
|
||||
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
|
||||
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
You can customize the model's pooling method via the `--override-pooler-config` option,
|
||||
which takes priority over both the model's and Sentence Transformers's defaults.
|
||||
```
|
||||
:::
|
||||
|
||||
## Offline Inference
|
||||
|
||||
|
@ -111,10 +111,10 @@ The {class}`~vllm.LLM.score` method outputs similarity scores between sentence p
|
|||
It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html).
|
||||
These types of models serve as rerankers between candidate query-document pairs in RAG systems.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
|
||||
To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
|
||||
```
|
||||
:::
|
||||
|
||||
```python
|
||||
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -14,9 +14,9 @@ In short, you should increase the number of GPUs and the number of nodes until y
|
|||
|
||||
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
|
||||
```
|
||||
:::
|
||||
|
||||
## Running vLLM on a single node
|
||||
|
||||
|
@ -94,12 +94,12 @@ vllm serve /path/to/the/model/in/the/container \
|
|||
|
||||
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
|
||||
```
|
||||
:::
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
|
||||
|
||||
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -4,6 +4,7 @@
|
|||
|
||||
Below, you can find an explanation of every engine argument for vLLM:
|
||||
|
||||
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
|
||||
```{eval-rst}
|
||||
.. argparse::
|
||||
:module: vllm.engine.arg_utils
|
||||
|
@ -16,6 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM:
|
|||
|
||||
Below are the additional arguments related to the asynchronous engine:
|
||||
|
||||
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
|
||||
```{eval-rst}
|
||||
.. argparse::
|
||||
:module: vllm.engine.arg_utils
|
||||
|
|
|
@ -2,14 +2,14 @@
|
|||
|
||||
vLLM uses the following environment variables to configure the system:
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
|
||||
|
||||
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
|
||||
```
|
||||
:::
|
||||
|
||||
```{literalinclude} ../../../vllm/envs.py
|
||||
:::{literalinclude} ../../../vllm/envs.py
|
||||
:end-before: end-env-vars-definition
|
||||
:language: python
|
||||
:start-after: begin-env-vars-definition
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
# External Integrations
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
langchain
|
||||
llamaindex
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -31,8 +31,8 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I
|
|||
|
||||
The following metrics are exposed:
|
||||
|
||||
```{literalinclude} ../../../vllm/engine/metrics.py
|
||||
:::{literalinclude} ../../../vllm/engine/metrics.py
|
||||
:end-before: end-metrics-definitions
|
||||
:language: python
|
||||
:start-after: begin-metrics-definitions
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -4,10 +4,10 @@
|
|||
|
||||
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
|
||||
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
|
||||
```
|
||||
:::
|
||||
|
||||
## Offline Inference
|
||||
|
||||
|
@ -203,13 +203,13 @@ for o in outputs:
|
|||
|
||||
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
A chat template is **required** to use Chat Completions API.
|
||||
|
||||
Although most models come with a chat template, for others you have to define one yourself.
|
||||
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
|
||||
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
|
||||
```
|
||||
:::
|
||||
|
||||
### Image
|
||||
|
||||
|
@ -273,24 +273,25 @@ print("Chat completion output:", chat_response.choices[0].message.content)
|
|||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
|
||||
and pass the file path as `url` in the API request.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
|
||||
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
|
||||
```
|
||||
:::
|
||||
|
||||
````{note}
|
||||
:::{note}
|
||||
By default, the timeout for fetching images through HTTP URL is `5` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
||||
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
````
|
||||
|
||||
:::
|
||||
|
||||
### Video
|
||||
|
||||
|
@ -345,14 +346,15 @@ print("Chat completion output from image url:", result)
|
|||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
|
||||
|
||||
````{note}
|
||||
:::{note}
|
||||
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
|
||||
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
````
|
||||
|
||||
:::
|
||||
|
||||
### Audio
|
||||
|
||||
|
@ -448,24 +450,25 @@ print("Chat completion output from audio url:", result)
|
|||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
|
||||
|
||||
````{note}
|
||||
:::{note}
|
||||
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
|
||||
You can override this by setting the environment variable:
|
||||
|
||||
```console
|
||||
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
|
||||
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
|
||||
```
|
||||
````
|
||||
|
||||
:::
|
||||
|
||||
### Embedding
|
||||
|
||||
vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings),
|
||||
where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models.
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
The schema of `messages` is exactly the same as in Chat Completions API.
|
||||
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
|
||||
```
|
||||
:::
|
||||
|
||||
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
|
||||
Refer to the examples below for illustration.
|
||||
|
@ -477,13 +480,13 @@ vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
|||
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
|
||||
```
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
The custom chat template is completely different from the original one for this model,
|
||||
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
|
||||
```
|
||||
:::
|
||||
|
||||
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
|
||||
|
||||
|
@ -518,16 +521,16 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
|||
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
|
||||
```
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
Like with VLM2Vec, we have to explicitly pass `--task embed`.
|
||||
|
||||
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
|
||||
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
|
||||
```
|
||||
:::
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
|
||||
example below for details.
|
||||
```
|
||||
:::
|
||||
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
|
||||
|
|
|
@ -22,9 +22,9 @@ The available APIs depend on the type of model that is being run:
|
|||
|
||||
Please refer to the above pages for more details about each API.
|
||||
|
||||
```{seealso}
|
||||
:::{seealso}
|
||||
[API Reference](/api/offline_inference/index)
|
||||
```
|
||||
:::
|
||||
|
||||
## Configuration Options
|
||||
|
||||
|
@ -70,12 +70,12 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
|
|||
tensor_parallel_size=2)
|
||||
```
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
|
||||
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
|
||||
|
||||
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
|
||||
```
|
||||
:::
|
||||
|
||||
#### Quantization
|
||||
|
||||
|
|
|
@ -161,11 +161,11 @@ print(completion._request_id)
|
|||
|
||||
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
||||
|
||||
```{argparse}
|
||||
:::{argparse}
|
||||
:module: vllm.entrypoints.openai.cli_args
|
||||
:func: create_parser_for_docs
|
||||
:prog: vllm serve
|
||||
```
|
||||
:::
|
||||
|
||||
#### Configuration file
|
||||
|
||||
|
@ -188,10 +188,10 @@ To use the above config file:
|
|||
vllm serve SOME_MODEL --config config.yaml
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
|
||||
The order of priorities is `command line > config file values > defaults`.
|
||||
```
|
||||
:::
|
||||
|
||||
## API Reference
|
||||
|
||||
|
@ -208,19 +208,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>
|
|||
|
||||
The following [sampling parameters](#sampling-params) are supported.
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-completion-sampling-params
|
||||
:end-before: end-completion-sampling-params
|
||||
```
|
||||
:::
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-completion-extra-params
|
||||
:end-before: end-completion-extra-params
|
||||
```
|
||||
:::
|
||||
|
||||
(chat-api)=
|
||||
|
||||
|
@ -240,19 +240,19 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
|
|||
|
||||
The following [sampling parameters](#sampling-params) are supported.
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-chat-completion-sampling-params
|
||||
:end-before: end-chat-completion-sampling-params
|
||||
```
|
||||
:::
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-chat-completion-extra-params
|
||||
:end-before: end-chat-completion-extra-params
|
||||
```
|
||||
:::
|
||||
|
||||
(embeddings-api)=
|
||||
|
||||
|
@ -264,9 +264,9 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
|
|||
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
|
||||
which will be treated as a single prompt to the model.
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
|
||||
```
|
||||
:::
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
|
||||
|
||||
|
@ -274,27 +274,27 @@ Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
|
|||
|
||||
The following [pooling parameters](#pooling-params) are supported.
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-embedding-pooling-params
|
||||
:end-before: end-embedding-pooling-params
|
||||
```
|
||||
:::
|
||||
|
||||
The following extra parameters are supported by default:
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-embedding-extra-params
|
||||
:end-before: end-embedding-extra-params
|
||||
```
|
||||
:::
|
||||
|
||||
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-chat-embedding-extra-params
|
||||
:end-before: end-chat-embedding-extra-params
|
||||
```
|
||||
:::
|
||||
|
||||
(tokenizer-api)=
|
||||
|
||||
|
@ -465,19 +465,19 @@ Response:
|
|||
|
||||
The following [pooling parameters](#pooling-params) are supported.
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-score-pooling-params
|
||||
:end-before: end-score-pooling-params
|
||||
```
|
||||
:::
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-score-extra-params
|
||||
:end-before: end-score-extra-params
|
||||
```
|
||||
:::
|
||||
|
||||
(rerank-api)=
|
||||
|
||||
|
@ -552,16 +552,16 @@ Response:
|
|||
|
||||
The following [pooling parameters](#pooling-params) are supported.
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-rerank-pooling-params
|
||||
:end-before: end-rerank-pooling-params
|
||||
```
|
||||
:::
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
||||
:language: python
|
||||
:start-after: begin-rerank-extra-params
|
||||
:end-before: end-rerank-extra-params
|
||||
```
|
||||
:::
|
||||
|
|
|
@ -111,6 +111,7 @@ markers = [
|
|||
]
|
||||
|
||||
[tool.pymarkdown]
|
||||
plugins.md004.style = "sublist" # ul-style
|
||||
plugins.md013.enabled = false # line-length
|
||||
plugins.md041.enabled = false # first-line-h1
|
||||
plugins.md033.enabled = false # inline-html
|
||||
|
|
Loading…
Reference in New Issue