louis_lifu/vllm - vllm - Trustie: Git with trustie

Commit Graph

Author	SHA1	Message	Date
Wenwei Zhang	546a97ef69	[Misc]: allow user to specify port in distributed setting (#4914 )	2024-05-20 17:45:06 +00:00
Alexander Matveev	da5a0b539d	Remove marlin warning (#4918 )	2024-05-20 14:55:34 +00:00
Cyrus Leung	6287537a0c	[Model] LLaVA model refactor (#4910 )	2024-05-20 08:11:25 +00:00
Woosuk Kwon	b57e6c5949	[Kernel] Add flash-attn back (#4907 )	2024-05-19 18:11:30 -07:00
Alexander Matveev	27ce85476e	[Kernel] Add marlin_24 unit tests (#4901 )	2024-05-19 11:37:34 -04:00
Cyrus Leung	f68470e803	[Bugfix][Model] Add base class for vision-language models (#4809 )	2024-05-19 00:13:33 -07:00
SangBin Cho	2e9a2227ec	[Lora] Support long context lora (#4787 ) Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files	2024-05-18 16:05:23 +09:00
alexeykondrat	c0724fc915	[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used (#4658 )	2024-05-18 05:09:11 +00:00
Michael Goin	86b45ae065	[Bugfix] Relax tiktoken to >= 0.6.0 (#4890 )	2024-05-17 12:58:52 -06:00
Antoni Baum	c5711ef985	[Doc] Update Ray Data distributed offline inference example (#4871 )	2024-05-17 10:52:11 -07:00
eigenLiu	48d5985a08	Sync huggingface modifications of qwen Moe model (#4774 )	2024-05-17 09:43:19 -07:00
Jinzhen Lin	33e0823de5	[Bugfix] fix rope error when load models with different dtypes (#4835 )	2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD	26148120b3	[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797 )	2024-05-16 20:58:25 -07:00
bofeng huang	0150a10630	[Frontend] OpenAI API server: Do not add bos token by default when encoding (#4688 )	2024-05-16 18:47:22 -07:00
Kante Yin	8e7fb5d43a	Support to serve vLLM on Kubernetes with LWS (#4829 ) Signed-off-by: kerthcet <kerthcet@gmail.com>	2024-05-16 16:37:29 -07:00
Woosuk Kwon	9a31a817a8	[Bugfix] Fix FP8 KV cache support (#4869 )	2024-05-16 22:42:29 +00:00
Tyler Michael Smith	2060e93659	[Kernel] Add w8a8 CUTLASS kernels (#4749 )	2024-05-16 18:32:50 -04:00
Silencio	8435b207af	[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850 ) Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>	2024-05-16 11:16:09 -07:00
youkaichao	10fa9eea21	[Misc] remove old comments (#4866 )	2024-05-16 11:07:41 -07:00
youkaichao	e08188081b	[Core][Distributed] remove graph mode function (#4818 )	2024-05-16 10:59:52 -07:00
Hongxia Yang	b5853f9963	[ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845 )	2024-05-16 10:46:52 -07:00
Simon Mo	f09edd8a25	Add JSON output support for benchmark_latency and benchmark_throughput (#4848 )	2024-05-16 10:02:56 -07:00
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
Pierre Dulac	9216b9cc38	[Bugfix] Bypass authorization API token for preflight requests (#4862 )	2024-05-16 09:42:21 -07:00
Alex Wu	5e0391c040	[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851 )	2024-05-17 00:42:41 +09:00
Alex Wu	dbc0754ddf	[docs] Fix typo in examples filename openi -> openai (#4864 )	2024-05-17 00:42:17 +09:00
Jinzhen Lin	99caa49106	[Kernel] add bfloat16 support for gptq marlin kernel (#4788 )	2024-05-16 09:55:29 -04:00
alexm-nm	5c342570d7	Add marlin unit tests and marlin benchmark script (#4815 )	2024-05-16 09:36:49 -04:00
Cody Yu	973617ae02	[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840 ) Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cade Daniel <cade@anyscale.com>	2024-05-16 00:53:51 -07:00
Aurick Qiao	30e754390c	[Core] Implement sharded state loader (#4690 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-05-15 22:11:54 -07:00
Alex Wu	52f8107cf2	[Frontend] Support OpenAI batch file format (#4794 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-15 19:13:36 -04:00
Cyrus Leung	fc0d9dfc3a	[Frontend] Re-enable custom roles in Chat Completions API (#4758 )	2024-05-15 14:58:46 -07:00
Zhuohan Li	361c461a12	[Doc] Highlight the fourth meetup in the README (#4842 )	2024-05-15 11:38:49 -07:00
zifeitong	a5675d348b	[Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816 )	2024-05-15 07:22:09 -07:00
Cyrus Leung	e9cdd2b1e2	[CI/Build] Further decouple HuggingFace implementation from ours during tests (#4166 )	2024-05-14 23:38:40 -07:00
SangBin Cho	65bf2ac165	[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681 ) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR	2024-05-15 14:00:10 +09:00
SangBin Cho	8a7cc254a0	Revert "[Kernel] Use flash-attn for decoding (#3648 )" (#4820 ) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit `1356df5`. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)	2024-05-15 11:52:45 +09:00
Simon Mo	29bc01bf3b	Add 4th meetup announcement to readme (#4817 )	2024-05-14 18:33:06 -04:00
Nick Hill	676a99982f	[Core] Add MultiprocessingGPUExecutor (#4539 ) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>	2024-05-14 10:38:59 -07:00
Cyrus Leung	dc72402b57	[Bugfix][Doc] Fix CI failure in docs (#4804 ) This PR fixes the CI failure introduced by #4798. The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion. I have also changed the format of the links to be more distinct from each other.	2024-05-15 01:57:08 +09:00
Kuntai Du	ccb63a8245	[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696 )	2024-05-14 21:34:33 +09:00
Zhuohan Li	c579b750a0	[Doc] Add meetups to the doc (#4798 )	2024-05-13 18:48:00 -07:00
Cyrus Leung	4bfa7e7f75	[Doc] Add API reference for offline inference (#4710 )	2024-05-13 17:47:42 -07:00
Zhuohan Li	ac1fbf7fd2	[Doc] Shorten README by removing supported model list (#4796 )	2024-05-13 16:23:54 -07:00
Philipp Moritz	33d3914b1e	[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793 )	2024-05-13 19:00:27 -04:00
Stephen Krider	1356df53bd	[Kernel] Use flash-attn for decoding (#3648 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2024-05-13 15:50:33 -07:00
Cody Yu	ce532ff45c	[Speculative decoding] Improve n-gram efficiency (#4724 )	2024-05-13 15:00:13 -07:00
Sanger Steel	8bc68e198c	[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208 )	2024-05-13 14:57:07 -07:00
Woosuk Kwon	0fca3cdcf2	[Misc] Enhance attention selector (#4751 )	2024-05-13 10:47:25 -07:00
SangBin Cho	e7c46b9527	[Scheduler] Warning upon preemption and Swapping (#4647 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-13 23:50:44 +09:00

1 2 3 4 5 ...

1375 Commits All Branches Search

1375 Commits

All Branches