louis_lifu/vllm - vllm - Trustie: Git with trustie

Commit Graph

Author	SHA1	Message	Date
Kunshang Ji	f4bc4de1b1	[Core]refactor aqlm quant ops (#4351 )	2024-04-25 15:03:56 -04:00
zifeitong	a395a638c2	[Misc] Use public API in benchmark_throughput (#4300 )	2024-04-24 21:10:24 +00:00
Roger Wang	7923dcad12	[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279 )	2024-04-24 09:49:13 -07:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Michael Goin	53b018edcb	[Bugfix] Get available quantization methods from quantization registry (#4098 )	2024-04-18 00:21:55 -07:00
Elinx	fe3b5bbc23	[Bugfix] fix output parsing error for trtllm backend (#4137 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-17 11:07:23 +00:00
Michael Feil	c2b4a1bce9	[Doc] Add typing hints / mypy types cleanup (#3816 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-04-11 17:17:21 -07:00
Kunshang Ji	e9da5a40c6	[Misc] Add indirection layer for custom ops (#3913 )	2024-04-10 20:26:07 -07:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
Zedong Peng	c013d32c75	[Benchmark] Add cpu options to bench scripts (#3915 )	2024-04-09 21:30:03 -07:00
youkaichao	e4be7d70bb	[CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889 )	2024-04-06 21:32:30 +00:00
TianYu GUO	b7782002e1	[Benchmark] Refactor sample_requests in benchmark_throughput (#3613 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-04 09:56:22 +00:00
Chang Su	819a309c0f	[Bugfix] Fix args in benchmark_serving (#3836 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-04 07:41:05 +00:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
Roger Wang	ccb58b23e6	[Misc] Fix Benchmark TTFT Calculation for Chat Completions (#3768 )	2024-04-01 15:24:30 -07:00
Yile (Michael) Gu	98a42e7078	[Benchmark] Change mii to use persistent deployment and support tensor parallel (#3628 )	2024-03-28 17:33:52 -07:00
SangBin Cho	b51c1cc9d2	[2/N] Chunked prefill data update (#3538 )	2024-03-28 10:06:01 -07:00
Roger Wang	45b6ef6513	feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )	2024-03-27 13:39:26 -07:00
AmadeusChan	1956931436	[Misc] add the "download-dir" option to the latency/throughput benchmarks (#3621 )	2024-03-27 13:39:05 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Simon Mo	8e67598aa6	[Misc] fix line length for entire codebase (#3444 )	2024-03-16 00:36:29 -07:00
Ronen Schaffer	14e3f9a1b2	Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (#2958 )	2024-03-15 21:01:30 -07:00
youkaichao	8fe8386591	[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389 )	2024-03-14 08:11:48 +00:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
TianYu GUO	1ece1ae829	[Minor Fix] Fix comments in benchmark_serving (#3252 )	2024-03-07 22:22:59 -08:00
Chen Wang	9a4548bae7	Fix the openai benchmarking requests to work with latest OpenAI apis (#2992 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-04 15:51:56 -08:00
Allen.Dou	9cbc7e5f3b	enable --gpu-memory-utilization in benchmark_throughput.py (#3175 ) Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>	2024-03-04 10:37:58 -08:00
TianYu GUO	901cf4c52b	[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171 )	2024-03-03 22:48:27 -08:00
Philipp Moritz	17c3103c56	Make it easy to profile workers with nsight (#3162 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-03 16:19:13 -08:00
Zhuohan Li	996d095c54	[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158 )	2024-03-03 14:37:18 -08:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
Philipp Moritz	cfc15a1031	Optimize Triton MoE Kernel (#2979 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-02-26 13:48:56 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Ronen Schaffer	d7f396486e	Update comment (#2934 )	2024-02-21 18:18:37 -08:00
Roger Wang	a4211a4dc3	Serving Benchmark Refactoring (#2433 )	2024-02-12 22:53:00 -08:00
Woosuk Kwon	72d3a30c63	[Minor] Fix benchmark_latency script (#2765 )	2024-02-05 12:45:37 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda" ` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
zhaoyang-star	9090bf02e7	Support FP8-E5M2 KV Cache (#2279 ) Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-28 16:43:54 -08:00
Simon Mo	1e4277d2d1	lint: format all python file instead of just source code (#2567 )	2024-01-23 15:53:06 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00
Harry Mellor	63e835cbcc	Fix progress bar and allow HTTPS in `benchmark_serving.py` (#2552 )	2024-01-22 14:40:31 -08:00
Harry Mellor	2709c0009a	Support OpenAI API server in `benchmark_serving.py` (#2172 )	2024-01-18 20:34:08 -08:00
Woosuk Kwon	37ca558103	Optimize model execution with CUDA graph (#1926 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2023-12-16 21:12:08 -08:00
CHU Tianxiang	0fbfc4b81b	Add GPTQ support (#916 )	2023-12-15 03:04:22 -08:00
Woosuk Kwon	5dd80d3777	Fix latency benchmark script (#2035 )	2023-12-11 11:19:08 -08:00
wbn	dacaf5a400	Replace head_mapping params with num_kv_heads to attention kernel. (#1997 ) Co-authored-by: wangguoya <wangguoya@baidu.com> Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>	2023-12-10 10:12:53 -08:00
Antoni Baum	05ff90b692	Save pytorch profiler output for latency benchmark (#1871 ) * Save profiler output * Apply feedback from code review	2023-12-05 20:55:55 -08:00
aisensiy	8d8c2f6ffe	Support max-model-len argument for throughput benchmark (#1858 )	2023-11-30 08:10:24 -08:00
Woosuk Kwon	51d3cb951d	Remove max_num_seqs in latency benchmark script (#1855 )	2023-11-30 00:00:32 -08:00
Woosuk Kwon	e74b1736a1	Add profile option to latency benchmark script (#1839 )	2023-11-29 23:42:52 -08:00

1 2

73 Commits