Dinghow Yang
|
cf6ff18246
|
Fix Baichuan chat template (#3340)
|
2024-03-15 21:02:12 -07:00 |
Ronen Schaffer
|
14e3f9a1b2
|
Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (#2958)
|
2024-03-15 21:01:30 -07:00 |
Tao He
|
3123f15138
|
Fixes the incorrect argument in the prefix-prefill test cases (#3246)
|
2024-03-15 20:58:10 -07:00 |
youkaichao
|
413366e9a2
|
[Misc] PR templates (#3413)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-03-15 18:25:51 -07:00 |
Robert Shaw
|
10585e035e
|
Removed Extraneous Print Message From OAI Server (#3440)
|
2024-03-16 00:35:36 +00:00 |
Antoni Baum
|
fb96c1e98c
|
Asynchronous tokenization (#2879)
|
2024-03-15 23:37:01 +00:00 |
laneeee
|
8fa7357f2d
|
fix document error for value and v_vec illustration (#3421)
|
2024-03-15 16:06:09 -07:00 |
Harry Mellor
|
a7af4538ca
|
Fix issue templates (#3436)
|
2024-03-15 21:26:00 +00:00 |
youkaichao
|
604f235937
|
[Misc] add error message in non linux platform (#3438)
|
2024-03-15 21:21:37 +00:00 |
Tao He
|
14b8ae02e7
|
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-03-15 18:25:43 +00:00 |
Dan Clark
|
03d37f2441
|
[Fix] Add args for mTLS support (#3430)
Co-authored-by: declark1 <daniel.clark@ibm.com>
|
2024-03-15 09:56:13 -07:00 |
Yang Fan
|
a7c871680e
|
Fix tie_word_embeddings for Qwen2. (#3344)
|
2024-03-15 09:36:53 -07:00 |
Junda Chen
|
429284dc37
|
Fix `dist.broadcast` stall without group argument (#3408)
|
2024-03-14 23:25:05 -07:00 |
Dinghow Yang
|
253a98078a
|
Add chat templates for ChatGLM (#3418)
|
2024-03-14 23:19:22 -07:00 |
Dinghow Yang
|
21539e6856
|
Add chat templates for Falcon (#3420)
|
2024-03-14 23:19:02 -07:00 |
youkaichao
|
b522c4476f
|
[Misc] add HOST_IP env var (#3419)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-03-14 21:32:52 -07:00 |
akhoroshev
|
78b6c4845a
|
Dynamically configure shared memory size for moe_align_block_size_kernel (#3376)
|
2024-03-14 18:18:07 -07:00 |
Enrique Shockwave
|
b983ba35bd
|
fix marlin config repr (#3414)
|
2024-03-14 16:26:19 -07:00 |
陈序
|
54be8a0be2
|
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
Co-authored-by: Cade Daniel <edacih@gmail.com>
|
2024-03-14 13:56:57 -07:00 |
youkaichao
|
dfc77408bd
|
[issue templates] add some issue templates (#3412)
|
2024-03-14 13:16:00 -07:00 |
Dan Clark
|
c17ca8ef18
|
Add args for mTLS support (#3410)
Co-authored-by: Daniel Clark <daniel.clark@ibm.com>
|
2024-03-14 13:11:45 -07:00 |
Thomas Parnell
|
06ec486794
|
Install `flash_attn` in Docker image (#3396)
|
2024-03-14 10:55:54 -07:00 |
youkaichao
|
8fe8386591
|
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
|
2024-03-14 08:11:48 +00:00 |
Allen.Dou
|
a37415c31b
|
allow user to chose which vllm's merics to display in grafana (#3393)
|
2024-03-14 06:35:13 +00:00 |
Simon Mo
|
81653d9688
|
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion (#3383)
|
2024-03-13 17:02:21 -07:00 |
Zhuohan Li
|
eeab52a4ff
|
[FIX] Simpler fix for async engine running on ray (#3371)
|
2024-03-13 14:18:40 -07:00 |
Antoni Baum
|
c33afd89f5
|
Fix lint (#3388)
|
2024-03-13 13:56:49 -07:00 |
Terry
|
7e9bd08f60
|
Add batched RoPE kernel (#3095)
|
2024-03-13 13:45:26 -07:00 |
Or Sharir
|
ae0ccb4017
|
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350)
|
2024-03-13 12:18:25 -07:00 |
陈序
|
739c350c19
|
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256)
|
2024-03-13 09:43:24 -07:00 |
Hui Liu
|
ba8dc958a3
|
[Minor] Fix bias in if to remove ambiguity (#3259)
|
2024-03-13 09:16:55 -07:00 |
Ronan McGovern
|
e221910e77
|
add hf_transfer to requirements.txt (#3031)
|
2024-03-12 23:33:43 -07:00 |
Bo-Wen Wang
|
b167109ba1
|
[Fix] Fix quantization="gptq" when using Marlin (#3319)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-03-12 22:51:42 -07:00 |
Woosuk Kwon
|
602358f8a8
|
Add kernel for GeGLU with approximate GELU (#3337)
|
2024-03-12 22:06:17 -07:00 |
Breno Faria
|
49a3c8662b
|
Fixes #1556 double free (#3347)
|
2024-03-13 00:30:08 +00:00 |
Sherlock Xu
|
b0925b3878
|
docs: Add BentoML deployment doc (#3336)
Signed-off-by: Sherlock113 <sherlockxu07@gmail.com>
|
2024-03-12 10:34:30 -07:00 |
DAIZHENWEI
|
654865e21d
|
Support Mistral Model Inference with transformers-neuronx (#3153)
|
2024-03-11 13:19:51 -07:00 |
kliuae
|
c9415c19d3
|
[ROCm] Fix warp and lane calculation in blockReduceSum (#3321)
|
2024-03-11 13:14:07 -07:00 |
Zhuohan Li
|
4c922709b6
|
Add distributed model executor abstraction (#3191)
|
2024-03-11 11:03:45 -07:00 |
Philipp Moritz
|
657061fdce
|
[docs] Add LoRA support information for models (#3299)
|
2024-03-11 00:54:51 -07:00 |
Zhuohan Li
|
2f8844ba08
|
Re-enable the 80 char line width limit (#3305)
|
2024-03-10 19:49:14 -07:00 |
Nick Hill
|
4b59f00e91
|
[Fix] Fix best_of behavior when n=1 (#3298)
|
2024-03-10 19:17:46 -07:00 |
Roy
|
9e8744a545
|
[BugFix] Fix get tokenizer when using ray (#3301)
|
2024-03-10 19:17:16 -07:00 |
Douglas Lehr
|
e4a28e5316
|
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262)
|
2024-03-10 15:27:45 -07:00 |
Terry
|
0bba88df03
|
Enhance lora tests with more layer and rank variations (#3243)
|
2024-03-09 17:14:16 -08:00 |
Cade Daniel
|
8437bae6ef
|
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103)
|
2024-03-08 23:32:46 -08:00 |
Zhuohan Li
|
f48c6791b7
|
[FIX] Fix prefix test error on main (#3286)
|
2024-03-08 17:16:14 -08:00 |
Michael Goin
|
c2c5e0909a
|
Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241)
|
2024-03-08 13:33:10 -08:00 |
Woosuk Kwon
|
1cb0cc2975
|
[FIX] Make `flash_attn` optional (#3269)
|
2024-03-08 10:52:20 -08:00 |
Roger Wang
|
99c3cfb83c
|
[Docs] Fix Unmocked Imports (#3275)
|
2024-03-08 09:58:01 -08:00 |