Kunshang Ji
|
f4bc4de1b1
|
[Core]refactor aqlm quant ops (#4351)
|
2024-04-25 15:03:56 -04:00 |
zifeitong
|
a395a638c2
|
[Misc] Use public API in benchmark_throughput (#4300)
|
2024-04-24 21:10:24 +00:00 |
Roger Wang
|
7923dcad12
|
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279)
|
2024-04-24 09:49:13 -07:00 |
James Fleming
|
2b7949c1c2
|
AQLM CUDA support (#3287)
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-04-23 13:59:33 -04:00 |
Michael Goin
|
53b018edcb
|
[Bugfix] Get available quantization methods from quantization registry (#4098)
|
2024-04-18 00:21:55 -07:00 |
Elinx
|
fe3b5bbc23
|
[Bugfix] fix output parsing error for trtllm backend (#4137)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-17 11:07:23 +00:00 |
Michael Feil
|
c2b4a1bce9
|
[Doc] Add typing hints / mypy types cleanup (#3816)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-04-11 17:17:21 -07:00 |
Kunshang Ji
|
e9da5a40c6
|
[Misc] Add indirection layer for custom ops (#3913)
|
2024-04-10 20:26:07 -07:00 |
SangBin Cho
|
67b4221a61
|
[Core][5/N] Fully working chunked prefill e2e (#3884)
|
2024-04-10 17:56:48 -07:00 |
Zedong Peng
|
c013d32c75
|
[Benchmark] Add cpu options to bench scripts (#3915)
|
2024-04-09 21:30:03 -07:00 |
youkaichao
|
e4be7d70bb
|
[CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889)
|
2024-04-06 21:32:30 +00:00 |
TianYu GUO
|
b7782002e1
|
[Benchmark] Refactor sample_requests in benchmark_throughput (#3613)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-04 09:56:22 +00:00 |
Chang Su
|
819a309c0f
|
[Bugfix] Fix args in benchmark_serving (#3836)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-04 07:41:05 +00:00 |
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
Roger Wang
|
ccb58b23e6
|
[Misc] Fix Benchmark TTFT Calculation for Chat Completions (#3768)
|
2024-04-01 15:24:30 -07:00 |
Yile (Michael) Gu
|
98a42e7078
|
[Benchmark] Change mii to use persistent deployment and support tensor parallel (#3628)
|
2024-03-28 17:33:52 -07:00 |
SangBin Cho
|
b51c1cc9d2
|
[2/N] Chunked prefill data update (#3538)
|
2024-03-28 10:06:01 -07:00 |
Roger Wang
|
45b6ef6513
|
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277)
|
2024-03-27 13:39:26 -07:00 |
AmadeusChan
|
1956931436
|
[Misc] add the "download-dir" option to the latency/throughput benchmarks (#3621)
|
2024-03-27 13:39:05 -07:00 |
SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
Simon Mo
|
8e67598aa6
|
[Misc] fix line length for entire codebase (#3444)
|
2024-03-16 00:36:29 -07:00 |
Ronen Schaffer
|
14e3f9a1b2
|
Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (#2958)
|
2024-03-15 21:01:30 -07:00 |
youkaichao
|
8fe8386591
|
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
|
2024-03-14 08:11:48 +00:00 |
Terry
|
7e9bd08f60
|
Add batched RoPE kernel (#3095)
|
2024-03-13 13:45:26 -07:00 |
TianYu GUO
|
1ece1ae829
|
[Minor Fix] Fix comments in benchmark_serving (#3252)
|
2024-03-07 22:22:59 -08:00 |
Chen Wang
|
9a4548bae7
|
Fix the openai benchmarking requests to work with latest OpenAI apis (#2992)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-04 15:51:56 -08:00 |
Allen.Dou
|
9cbc7e5f3b
|
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
|
2024-03-04 10:37:58 -08:00 |
TianYu GUO
|
901cf4c52b
|
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171)
|
2024-03-03 22:48:27 -08:00 |
Philipp Moritz
|
17c3103c56
|
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-03 16:19:13 -08:00 |
Zhuohan Li
|
996d095c54
|
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158)
|
2024-03-03 14:37:18 -08:00 |
Sage Moore
|
ce4f5a29fb
|
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-03-02 00:50:01 -08:00 |
Philipp Moritz
|
cfc15a1031
|
Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
|
2024-02-26 13:48:56 -08:00 |
Massimiliano Pronesti
|
93dc5a2870
|
chore(vllm): codespell for spell checking (#2820)
|
2024-02-21 18:56:01 -08:00 |
Ronen Schaffer
|
d7f396486e
|
Update comment (#2934)
|
2024-02-21 18:18:37 -08:00 |
Roger Wang
|
a4211a4dc3
|
Serving Benchmark Refactoring (#2433)
|
2024-02-12 22:53:00 -08:00 |
Woosuk Kwon
|
72d3a30c63
|
[Minor] Fix benchmark_latency script (#2765)
|
2024-02-05 12:45:37 -08:00 |
Kunshang Ji
|
96b6f475dd
|
Remove hardcoded `device="cuda" ` to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-02-01 15:46:39 -08:00 |
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
Simon Mo
|
1e4277d2d1
|
lint: format all python file instead of just source code (#2567)
|
2024-01-23 15:53:06 -08:00 |
Antoni Baum
|
9b945daaf1
|
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-01-23 15:26:37 -08:00 |
Harry Mellor
|
63e835cbcc
|
Fix progress bar and allow HTTPS in `benchmark_serving.py` (#2552)
|
2024-01-22 14:40:31 -08:00 |
Harry Mellor
|
2709c0009a
|
Support OpenAI API server in `benchmark_serving.py` (#2172)
|
2024-01-18 20:34:08 -08:00 |
Woosuk Kwon
|
37ca558103
|
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2023-12-16 21:12:08 -08:00 |
CHU Tianxiang
|
0fbfc4b81b
|
Add GPTQ support (#916)
|
2023-12-15 03:04:22 -08:00 |
Woosuk Kwon
|
5dd80d3777
|
Fix latency benchmark script (#2035)
|
2023-12-11 11:19:08 -08:00 |
wbn
|
dacaf5a400
|
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
|
2023-12-10 10:12:53 -08:00 |
Antoni Baum
|
05ff90b692
|
Save pytorch profiler output for latency benchmark (#1871)
* Save profiler output
* Apply feedback from code review
|
2023-12-05 20:55:55 -08:00 |
aisensiy
|
8d8c2f6ffe
|
Support max-model-len argument for throughput benchmark (#1858)
|
2023-11-30 08:10:24 -08:00 |
Woosuk Kwon
|
51d3cb951d
|
Remove max_num_seqs in latency benchmark script (#1855)
|
2023-11-30 00:00:32 -08:00 |
Woosuk Kwon
|
e74b1736a1
|
Add profile option to latency benchmark script (#1839)
|
2023-11-29 23:42:52 -08:00 |