louis_lifu/vllm - vllm - Trustie: Git with trustie

Go to file

Zhuohan Li 0b7db411b5 [Bug] Fix the OOM condition for CPU cache (#260 )		2023-06-26 11:16:13 -07:00
benchmarks	Remove benchmark_async_llm_server.py (#155 )	2023-06-19 11:12:37 +08:00
csrc	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00
docs	[Docs] Add GPTBigCode to supported models (#213 )	2023-06-22 15:05:11 -07:00
examples	[Bugfix] Fix a bug in RequestOutput.finished (#202 )	2023-06-22 00:17:24 -07:00
tests/kernels	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00
vllm	[Bug] Fix the OOM condition for CPU cache (#260 )	2023-06-26 11:16:13 -07:00
.gitignore	Add logo and polish readme (#156 )	2023-06-19 16:31:13 +08:00
.readthedocs.yaml	Add .readthedocs.yaml (#136 )	2023-06-02 22:27:44 -07:00
CONTRIBUTING.md	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00
LICENSE	Add Apache-2.0 license (#102 )	2023-05-14 18:05:19 -07:00
MANIFEST.in	[PyPI] Packaging for PyPI distribution (#140 )	2023-06-05 20:03:14 -07:00
README.md	Update README.md (#236 )	2023-06-25 16:58:06 -07:00
mypy.ini	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00
pyproject.toml	[PyPI] Packaging for PyPI distribution (#140 )	2023-06-05 20:03:14 -07:00
requirements-dev.txt	Add contributing guideline and mypy config (#122 )	2023-05-23 17:58:51 -07:00
requirements.txt	OpenAI Compatible Frontend (#116 )	2023-05-23 21:39:50 -07:00
setup.py	[PyPI] Fix package info in setup.py (#158 )	2023-06-19 18:05:01 -07:00

README.md

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Discussions |

Latest News 🔥

[2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Dynamic batching of incoming requests
Optimized CUDA kernels

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server

vLLM seamlessly supports many Huggingface models, including the following architectures:

GPT-2 (gpt2, gpt2-xl, etc.)
GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
LLaMA (lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Performance

vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. For details, check out our blog post.

Serving throughput when each request asks for 1 output completion.

Serving throughput when each request asks for 3 output completions.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.