Commit Graph

73 Commits

Author SHA1 Message Date
Kunshang Ji f4bc4de1b1
[Core]refactor aqlm quant ops (#4351) 2024-04-25 15:03:56 -04:00
zifeitong a395a638c2
[Misc] Use public API in benchmark_throughput (#4300) 2024-04-24 21:10:24 +00:00
Roger Wang 7923dcad12
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279) 2024-04-24 09:49:13 -07:00
James Fleming 2b7949c1c2
AQLM CUDA support (#3287)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Michael Goin 53b018edcb
[Bugfix] Get available quantization methods from quantization registry (#4098) 2024-04-18 00:21:55 -07:00
Elinx fe3b5bbc23
[Bugfix] fix output parsing error for trtllm backend (#4137)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-17 11:07:23 +00:00
Michael Feil c2b4a1bce9
[Doc] Add typing hints / mypy types cleanup (#3816)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-11 17:17:21 -07:00
Kunshang Ji e9da5a40c6
[Misc] Add indirection layer for custom ops (#3913) 2024-04-10 20:26:07 -07:00
SangBin Cho 67b4221a61
[Core][5/N] Fully working chunked prefill e2e (#3884) 2024-04-10 17:56:48 -07:00
Zedong Peng c013d32c75
[Benchmark] Add cpu options to bench scripts (#3915) 2024-04-09 21:30:03 -07:00
youkaichao e4be7d70bb
[CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889) 2024-04-06 21:32:30 +00:00
TianYu GUO b7782002e1
[Benchmark] Refactor sample_requests in benchmark_throughput (#3613)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-04 09:56:22 +00:00
Chang Su 819a309c0f
[Bugfix] Fix args in benchmark_serving (#3836)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-04 07:41:05 +00:00
Adrian Abeyta 2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
Roger Wang ccb58b23e6
[Misc] Fix Benchmark TTFT Calculation for Chat Completions (#3768) 2024-04-01 15:24:30 -07:00
Yile (Michael) Gu 98a42e7078
[Benchmark] Change mii to use persistent deployment and support tensor parallel (#3628) 2024-03-28 17:33:52 -07:00
SangBin Cho b51c1cc9d2
[2/N] Chunked prefill data update (#3538) 2024-03-28 10:06:01 -07:00
Roger Wang 45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277) 2024-03-27 13:39:26 -07:00
AmadeusChan 1956931436
[Misc] add the "download-dir" option to the latency/throughput benchmarks (#3621) 2024-03-27 13:39:05 -07:00
SangBin Cho 01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
Simon Mo 8e67598aa6
[Misc] fix line length for entire codebase (#3444) 2024-03-16 00:36:29 -07:00
Ronen Schaffer 14e3f9a1b2
Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (#2958) 2024-03-15 21:01:30 -07:00
youkaichao 8fe8386591
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389) 2024-03-14 08:11:48 +00:00
Terry 7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
TianYu GUO 1ece1ae829
[Minor Fix] Fix comments in benchmark_serving (#3252) 2024-03-07 22:22:59 -08:00
Chen Wang 9a4548bae7
Fix the openai benchmarking requests to work with latest OpenAI apis (#2992)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-04 15:51:56 -08:00
Allen.Dou 9cbc7e5f3b
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
2024-03-04 10:37:58 -08:00
TianYu GUO 901cf4c52b
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171) 2024-03-03 22:48:27 -08:00
Philipp Moritz 17c3103c56
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-03 16:19:13 -08:00
Zhuohan Li 996d095c54
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158) 2024-03-03 14:37:18 -08:00
Sage Moore ce4f5a29fb
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Philipp Moritz cfc15a1031
Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
Massimiliano Pronesti 93dc5a2870
chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
Ronen Schaffer d7f396486e
Update comment (#2934) 2024-02-21 18:18:37 -08:00
Roger Wang a4211a4dc3
Serving Benchmark Refactoring (#2433) 2024-02-12 22:53:00 -08:00
Woosuk Kwon 72d3a30c63
[Minor] Fix benchmark_latency script (#2765) 2024-02-05 12:45:37 -08:00
Kunshang Ji 96b6f475dd
Remove hardcoded `device="cuda" ` to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
zhaoyang-star 9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Simon Mo 1e4277d2d1
lint: format all python file instead of just source code (#2567) 2024-01-23 15:53:06 -08:00
Antoni Baum 9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Harry Mellor 63e835cbcc
Fix progress bar and allow HTTPS in `benchmark_serving.py` (#2552) 2024-01-22 14:40:31 -08:00
Harry Mellor 2709c0009a
Support OpenAI API server in `benchmark_serving.py` (#2172) 2024-01-18 20:34:08 -08:00
Woosuk Kwon 37ca558103
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
CHU Tianxiang 0fbfc4b81b
Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
Woosuk Kwon 5dd80d3777
Fix latency benchmark script (#2035) 2023-12-11 11:19:08 -08:00
wbn dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
Antoni Baum 05ff90b692
Save pytorch profiler output for latency benchmark (#1871)
* Save profiler output

* Apply feedback from code review
2023-12-05 20:55:55 -08:00
aisensiy 8d8c2f6ffe
Support max-model-len argument for throughput benchmark (#1858) 2023-11-30 08:10:24 -08:00
Woosuk Kwon 51d3cb951d
Remove max_num_seqs in latency benchmark script (#1855) 2023-11-30 00:00:32 -08:00
Woosuk Kwon e74b1736a1
Add profile option to latency benchmark script (#1839) 2023-11-29 23:42:52 -08:00