Commit Graph

890 Commits

Author SHA1 Message Date
Dinghow Yang cf6ff18246
Fix Baichuan chat template (#3340) 2024-03-15 21:02:12 -07:00
Ronen Schaffer 14e3f9a1b2
Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (#2958) 2024-03-15 21:01:30 -07:00
Tao He 3123f15138
Fixes the incorrect argument in the prefix-prefill test cases (#3246) 2024-03-15 20:58:10 -07:00
youkaichao 413366e9a2
[Misc] PR templates (#3413)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-15 18:25:51 -07:00
Robert Shaw 10585e035e
Removed Extraneous Print Message From OAI Server (#3440) 2024-03-16 00:35:36 +00:00
Antoni Baum fb96c1e98c
Asynchronous tokenization (#2879) 2024-03-15 23:37:01 +00:00
laneeee 8fa7357f2d
fix document error for value and v_vec illustration (#3421) 2024-03-15 16:06:09 -07:00
Harry Mellor a7af4538ca
Fix issue templates (#3436) 2024-03-15 21:26:00 +00:00
youkaichao 604f235937
[Misc] add error message in non linux platform (#3438) 2024-03-15 21:21:37 +00:00
Tao He 14b8ae02e7
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-03-15 18:25:43 +00:00
Dan Clark 03d37f2441
[Fix] Add args for mTLS support (#3430)
Co-authored-by: declark1 <daniel.clark@ibm.com>
2024-03-15 09:56:13 -07:00
Yang Fan a7c871680e
Fix tie_word_embeddings for Qwen2. (#3344) 2024-03-15 09:36:53 -07:00
Junda Chen 429284dc37
Fix `dist.broadcast` stall without group argument (#3408) 2024-03-14 23:25:05 -07:00
Dinghow Yang 253a98078a
Add chat templates for ChatGLM (#3418) 2024-03-14 23:19:22 -07:00
Dinghow Yang 21539e6856
Add chat templates for Falcon (#3420) 2024-03-14 23:19:02 -07:00
youkaichao b522c4476f
[Misc] add HOST_IP env var (#3419)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-14 21:32:52 -07:00
akhoroshev 78b6c4845a
Dynamically configure shared memory size for moe_align_block_size_kernel (#3376) 2024-03-14 18:18:07 -07:00
Enrique Shockwave b983ba35bd
fix marlin config repr (#3414) 2024-03-14 16:26:19 -07:00
陈序 54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
youkaichao dfc77408bd
[issue templates] add some issue templates (#3412) 2024-03-14 13:16:00 -07:00
Dan Clark c17ca8ef18
Add args for mTLS support (#3410)
Co-authored-by: Daniel Clark <daniel.clark@ibm.com>
2024-03-14 13:11:45 -07:00
Thomas Parnell 06ec486794
Install `flash_attn` in Docker image (#3396) 2024-03-14 10:55:54 -07:00
youkaichao 8fe8386591
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389) 2024-03-14 08:11:48 +00:00
Allen.Dou a37415c31b
allow user to chose which vllm's merics to display in grafana (#3393) 2024-03-14 06:35:13 +00:00
Simon Mo 81653d9688
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion (#3383) 2024-03-13 17:02:21 -07:00
Zhuohan Li eeab52a4ff
[FIX] Simpler fix for async engine running on ray (#3371) 2024-03-13 14:18:40 -07:00
Antoni Baum c33afd89f5
Fix lint (#3388) 2024-03-13 13:56:49 -07:00
Terry 7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
Or Sharir ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350) 2024-03-13 12:18:25 -07:00
陈序 739c350c19
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256) 2024-03-13 09:43:24 -07:00
Hui Liu ba8dc958a3
[Minor] Fix bias in if to remove ambiguity (#3259) 2024-03-13 09:16:55 -07:00
Ronan McGovern e221910e77
add hf_transfer to requirements.txt (#3031) 2024-03-12 23:33:43 -07:00
Bo-Wen Wang b167109ba1
[Fix] Fix quantization="gptq" when using Marlin (#3319)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-03-12 22:51:42 -07:00
Woosuk Kwon 602358f8a8
Add kernel for GeGLU with approximate GELU (#3337) 2024-03-12 22:06:17 -07:00
Breno Faria 49a3c8662b
Fixes #1556 double free (#3347) 2024-03-13 00:30:08 +00:00
Sherlock Xu b0925b3878
docs: Add BentoML deployment doc (#3336)
Signed-off-by: Sherlock113 <sherlockxu07@gmail.com>
2024-03-12 10:34:30 -07:00
DAIZHENWEI 654865e21d
Support Mistral Model Inference with transformers-neuronx (#3153) 2024-03-11 13:19:51 -07:00
kliuae c9415c19d3
[ROCm] Fix warp and lane calculation in blockReduceSum (#3321) 2024-03-11 13:14:07 -07:00
Zhuohan Li 4c922709b6
Add distributed model executor abstraction (#3191) 2024-03-11 11:03:45 -07:00
Philipp Moritz 657061fdce
[docs] Add LoRA support information for models (#3299) 2024-03-11 00:54:51 -07:00
Zhuohan Li 2f8844ba08
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Nick Hill 4b59f00e91
[Fix] Fix best_of behavior when n=1 (#3298) 2024-03-10 19:17:46 -07:00
Roy 9e8744a545
[BugFix] Fix get tokenizer when using ray (#3301) 2024-03-10 19:17:16 -07:00
Douglas Lehr e4a28e5316
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262) 2024-03-10 15:27:45 -07:00
Terry 0bba88df03
Enhance lora tests with more layer and rank variations (#3243) 2024-03-09 17:14:16 -08:00
Cade Daniel 8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) 2024-03-08 23:32:46 -08:00
Zhuohan Li f48c6791b7
[FIX] Fix prefix test error on main (#3286) 2024-03-08 17:16:14 -08:00
Michael Goin c2c5e0909a
Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241) 2024-03-08 13:33:10 -08:00
Woosuk Kwon 1cb0cc2975
[FIX] Make `flash_attn` optional (#3269) 2024-03-08 10:52:20 -08:00
Roger Wang 99c3cfb83c
[Docs] Fix Unmocked Imports (#3275) 2024-03-08 09:58:01 -08:00