Yuan
|
cafb8e06c5
|
[CI/BUILD] enable intel queue for longer CPU tests (#4113)
|
2024-06-03 10:39:50 -07:00 |
Tyler Michael Smith
|
cbb2f59cc8
|
[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159)
|
2024-06-03 09:52:30 -07:00 |
Antoni Baum
|
0ab278ca31
|
[Core] Remove unnecessary copies in flash attn backend (#5138)
|
2024-06-03 09:39:31 -07:00 |
Cyrus Leung
|
7a64d24aad
|
[Core] Support image processor (#4197)
|
2024-06-02 22:56:41 -07:00 |
Cyrus Leung
|
dfbe60dc62
|
[Misc] Simplify code and fix type annotations in `conftest.py` (#5118)
|
2024-06-02 16:05:50 -07:00 |
Divakar Verma
|
a66cf40b20
|
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927)
This PR enables the fused topk_softmax kernel used in moe layer for HIP
|
2024-06-02 14:13:26 -07:00 |
Avinash Raj
|
f790ad3c50
|
[Frontend][OpenAI] Support for returning max_model_len on /v1/models response (#4643)
|
2024-06-02 08:06:13 +00:00 |
Simon Mo
|
ed59a7ed23
|
Update test_ignore_eos (#4898)
|
2024-06-02 02:21:53 +00:00 |
Robert Shaw
|
044793d8df
|
[BugFix] Prevent `LLM.encode` for non-generation Models (#5184)
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-06-01 23:35:41 +00:00 |
Daniil Arapov
|
c2d6d2f960
|
[Bugfix]: Fix issues related to prefix caching example (#5177) (#5180)
|
2024-06-01 15:53:52 -07:00 |
Zhuohan Li
|
8279078e21
|
[Bugfix] Remove deprecated @abstractproperty (#5174)
|
2024-06-01 22:40:25 +00:00 |
chenqianfzh
|
b9c0605a8e
|
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776)
|
2024-06-01 14:51:10 -06:00 |
Nadav Shmayovits
|
37464a0f74
|
[Bugfix] Fix call to init_logger in openai server (#4765)
|
2024-06-01 17:18:50 +00:00 |
Ye Cao
|
c354072828
|
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151)
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
|
2024-06-01 17:11:22 +00:00 |
Varun Sundar Rabindranath
|
f081c3ce4b
|
[Kernel] Update Cutlass fp8 configs (#5144)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-06-01 08:46:07 +00:00 |
Tyler Michael Smith
|
260d119e86
|
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137)
|
2024-06-01 06:45:32 +00:00 |
Daniele
|
a360ff80bb
|
[CI/Build] CMakeLists: build all extensions' cmake targets at the same time (#5034)
|
2024-05-31 22:06:45 -06:00 |
Tyler Michael Smith
|
1197e02141
|
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168)
|
2024-05-31 17:21:38 -07:00 |
Nick Hill
|
657579113f
|
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171)
|
2024-05-31 17:20:19 -07:00 |
Cody Yu
|
e9899fb7a4
|
[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039)
|
2024-05-31 14:29:19 -07:00 |
functionxu123
|
a377f0bd5e
|
[Misc]: optimize eager mode host time (#4196)
Co-authored-by: xuhao <xuhao@cambricon.com>
|
2024-05-31 13:14:50 +08:00 |
Simon Mo
|
e9d3aa04f6
|
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149)
|
2024-05-30 22:00:26 -07:00 |
SnowDist
|
a22dea54d3
|
[Model] Support MAP-NEO model (#5081)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-05-30 19:24:41 -07:00 |
simon-mo
|
533c217792
|
Fix cutlass sm_90a vesrion in CMakeList
|
2024-05-31 02:13:01 +00:00 |
Alexander Matveev
|
6d21fa1cad
|
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136)
|
2024-05-30 21:02:11 -05:00 |
Robert Shaw
|
b35be5403f
|
[Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120)
|
2024-05-30 17:04:37 -07:00 |
Simon Mo
|
45a1a69b98
|
[Build] Disable sm_90a in cu11 (#5141)
|
2024-05-30 14:37:16 -07:00 |
Simon Mo
|
87a658c812
|
Bump version to v0.4.3 (#5046)
|
2024-05-30 11:13:46 -07:00 |
Chansung Park
|
429d89720e
|
add doc about serving option on dstack (#3074)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-05-30 10:11:07 -07:00 |
Cyrus Leung
|
a9bcc7afb2
|
[Doc] Use intersphinx and update entrypoints docs (#5125)
|
2024-05-30 09:59:23 -07:00 |
Hyunsung Lee
|
d79d9eaaff
|
[Misc] remove duplicate definition of `seq_lens_tensor` in model_runner.py (#5129)
|
2024-05-30 06:56:19 -07:00 |
youkaichao
|
f758505c73
|
[CI/Build] increase wheel size limit to 200 MB (#5130)
|
2024-05-30 06:29:48 -07:00 |
Robert Shaw
|
d910816c73
|
[Bugfix] Automatically Detect SparseML models (#5119)
|
2024-05-30 12:58:37 +00:00 |
Breno Faria
|
87d41c849d
|
[BUGFIX] [FRONTEND] Correct chat logprobs (#5029)
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
|
2024-05-30 02:52:14 -07:00 |
omkar kakarparthi
|
e07aff9e52
|
[CI/Build] Docker cleanup functionality for amd servers (#5112)
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Co-authored-by: omkarkakarparthi <okakarpa>
|
2024-05-30 03:27:39 +00:00 |
Alexander Matveev
|
5bf185a1c4
|
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108)
|
2024-05-30 00:30:18 +00:00 |
youkaichao
|
4fbcb0f27e
|
[Doc][Build] update after removing vllm-nccl (#5103)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-05-29 23:51:18 +00:00 |
Itay Etelis
|
7c3604fb68
|
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 (#5031)
|
2024-05-29 16:13:22 -07:00 |
Cyrus Leung
|
b1c255630d
|
[Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099)
|
2024-05-29 16:05:01 -07:00 |
Cyrus Leung
|
eb6c50cdc2
|
[Bugfix][CI/Build] Fix codespell failing to skip files in `git diff` (#5097)
|
2024-05-29 16:02:54 -07:00 |
Cyrus Leung
|
eecd864388
|
[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` (#5096)
|
2024-05-29 16:02:25 -07:00 |
Ronen Schaffer
|
ae495c74ea
|
[Doc]Replace deprecated flag in readme (#4526)
|
2024-05-29 22:26:33 +00:00 |
afeldman-nm
|
4238bc82f2
|
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837)
|
2024-05-29 16:09:13 +00:00 |
youkaichao
|
594392d27a
|
[Core][Distributed] improve p2p access check (#4992)
|
2024-05-29 11:29:07 +00:00 |
Cyrus Leung
|
18c1f16d86
|
[Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092)
|
2024-05-29 07:16:41 +00:00 |
youkaichao
|
5bd3c65072
|
[Core][Optimization] remove vllm-nccl (#5091)
|
2024-05-29 05:13:52 +00:00 |
Marut Pandya
|
616e600e0b
|
[Misc] add gpu_memory_utilization arg (#5079)
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
|
2024-05-28 17:16:18 -07:00 |
Junichi Sato
|
dfba529b40
|
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
|
2024-05-28 17:15:35 -07:00 |
Cyrus Leung
|
5ae5ed1e60
|
[Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-05-28 13:29:31 -07:00 |
Simon Mo
|
290f4ada2b
|
[Docs] Add Dropbox as sponsors (#5089)
|
2024-05-28 10:29:09 -07:00 |