Skip to content

Commit 9d404cc

Browse files
amogkamyoukaichaojeejeeleehmellorzucchini-nlp
authored andcommitted
Pull Upstream main 6/16 (#7)
* [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>
1 parent 67b1083 commit 9d404cc

File tree

163 files changed

+3827
-539
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

163 files changed

+3827
-539
lines changed

.github/mergify.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,26 @@ pull_request_rules:
6565
add:
6666
- multi-modality
6767

68+
- name: label-rocm
69+
description: Automatically apply rocm label
70+
conditions:
71+
- or:
72+
- files~=^csrc/rocm/
73+
- files~=^docker/Dockerfile.rocm
74+
- files~=^requirements/rocm.*\.txt
75+
- files~=^vllm/attention/backends/rocm.*\.py
76+
- files~=^vllm/attention/ops/rocm.*\.py
77+
- files~=^vllm/model_executor/layers/fused_moe/rocm.*\.py
78+
- files~=^vllm/v1/attention/backends/mla/rocm.*\.py
79+
- files~=^tests/kernels/.*_rocm.*\.py
80+
- files=vllm/platforms/rocm.py
81+
- title~=(?i)AMD
82+
- title~=(?i)ROCm
83+
actions:
84+
label:
85+
add:
86+
- rocm
87+
6888
- name: label-structured-output
6989
description: Automatically apply structured-output label
7090
conditions:

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,5 +200,5 @@ benchmarks/**/*.json
200200
actionlint
201201
shellcheck*/
202202

203-
# Ingore moe/marlin_moe gen code
203+
# Ignore moe/marlin_moe gen code
204204
csrc/moe/marlin_moe_wna16/kernel_*

benchmarks/benchmark_long_document_qa_throughput.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ def main(args):
142142
)
143143

144144

145-
if __name__ == "__main__":
145+
def create_argument_parser():
146146
parser = FlexibleArgumentParser(
147147
description="Benchmark the performance with or "
148148
"without automatic prefix caching."
@@ -192,5 +192,11 @@ def main(args):
192192
)
193193

194194
parser = EngineArgs.add_cli_args(parser)
195+
196+
return parser
197+
198+
199+
if __name__ == "__main__":
200+
parser = create_argument_parser()
195201
args = parser.parse_args()
196202
main(args)

benchmarks/benchmark_prefix_caching.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ def main(args):
218218
)
219219

220220

221-
if __name__ == "__main__":
221+
def create_argument_parser():
222222
parser = FlexibleArgumentParser(
223223
description="Benchmark the performance with or without "
224224
"automatic prefix caching."
@@ -268,5 +268,11 @@ def main(args):
268268
)
269269

270270
parser = EngineArgs.add_cli_args(parser)
271+
272+
return parser
273+
274+
275+
if __name__ == "__main__":
276+
parser = create_argument_parser()
271277
args = parser.parse_args()
272278
main(args)

benchmarks/benchmark_prioritization.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ def main(args: argparse.Namespace):
161161
json.dump(results, f, indent=4)
162162

163163

164-
if __name__ == "__main__":
164+
def create_argument_parser():
165165
parser = FlexibleArgumentParser(description="Benchmark the throughput.")
166166
parser.add_argument(
167167
"--backend", type=str, choices=["vllm", "hf", "mii"], default="vllm"
@@ -204,6 +204,12 @@ def main(args: argparse.Namespace):
204204
)
205205

206206
parser = EngineArgs.add_cli_args(parser)
207+
208+
return parser
209+
210+
211+
if __name__ == "__main__":
212+
parser = create_argument_parser()
207213
args = parser.parse_args()
208214
if args.tokenizer is None:
209215
args.tokenizer = args.model

benchmarks/benchmark_serving.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -875,7 +875,7 @@ def main(args: argparse.Namespace):
875875
save_to_pytorch_benchmark_format(args, result_json, file_name)
876876

877877

878-
if __name__ == "__main__":
878+
def create_argument_parser():
879879
parser = FlexibleArgumentParser(
880880
description="Benchmark the online serving throughput."
881881
)
@@ -1225,6 +1225,10 @@ def main(args: argparse.Namespace):
12251225
"script chooses a LoRA module at random.",
12261226
)
12271227

1228-
args = parser.parse_args()
1228+
return parser
1229+
12291230

1231+
if __name__ == "__main__":
1232+
parser = create_argument_parser()
1233+
args = parser.parse_args()
12301234
main(args)

benchmarks/benchmark_serving_structured_output.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -850,7 +850,7 @@ def main(args: argparse.Namespace):
850850
json.dump(results, outfile, indent=4)
851851

852852

853-
if __name__ == "__main__":
853+
def create_argument_parser():
854854
parser = FlexibleArgumentParser(
855855
description="Benchmark the online serving throughput."
856856
)
@@ -1034,5 +1034,10 @@ def main(args: argparse.Namespace):
10341034
help="Ratio of Structured Outputs requests",
10351035
)
10361036

1037+
return parser
1038+
1039+
1040+
if __name__ == "__main__":
1041+
parser = create_argument_parser()
10371042
args = parser.parse_args()
10381043
main(args)

benchmarks/benchmark_throughput.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -595,7 +595,7 @@ def validate_args(args):
595595
)
596596

597597

598-
if __name__ == "__main__":
598+
def create_argument_parser():
599599
parser = FlexibleArgumentParser(description="Benchmark the throughput.")
600600
parser.add_argument(
601601
"--backend",
@@ -717,6 +717,12 @@ def validate_args(args):
717717
)
718718

719719
parser = AsyncEngineArgs.add_cli_args(parser)
720+
721+
return parser
722+
723+
724+
if __name__ == "__main__":
725+
parser = create_argument_parser()
720726
args = parser.parse_args()
721727
if args.tokenizer is None:
722728
args.tokenizer = args.model

benchmarks/kernels/bench_int8_gemm.py

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
import argparse
4+
import copy
5+
import itertools
6+
7+
import torch
8+
from weight_shapes import WEIGHT_SHAPES
9+
10+
from vllm._custom_ops import cutlass_scaled_mm as vllm_scaled_mm
11+
from vllm._custom_ops import scaled_int8_quant as vllm_scaled_int8_quant
12+
from vllm.triton_utils import triton
13+
14+
PROVIDER_CFGS = {
15+
"torch-bf16": dict(enabled=True),
16+
"int8-tensor-w-token-a": dict(
17+
w="tensor", a="token", no_a_quant=False, enabled=False
18+
),
19+
"int8-tensor-w-tensor-a": dict(
20+
w="tensor", a="tensor", no_a_quant=False, enabled=True
21+
),
22+
"int8-channel-w-token-a": dict(
23+
w="channel", a="token", no_a_quant=False, enabled=True
24+
),
25+
"int8-channel-w-tensor-a": dict(
26+
w="channel", a="tensor", no_a_quant=False, enabled=False
27+
),
28+
"int8-tensor-w-token-a-noquant": dict(
29+
w="tensor", a="token", no_a_quant=True, enabled=False
30+
),
31+
"int8-tensor-w-tensor-a-noquant": dict(
32+
w="tensor", a="tensor", no_a_quant=True, enabled=True
33+
),
34+
"int8-channel-w-token-a-noquant": dict(
35+
w="channel", a="token", no_a_quant=True, enabled=True
36+
),
37+
"int8-channel-w-tensor-a-noquant": dict(
38+
w="channel", a="tensor", no_a_quant=True, enabled=False
39+
),
40+
}
41+
42+
43+
def _quant_weight(b, w_type, device):
44+
if w_type == "tensor":
45+
scale_b = torch.ones(1, device=device, dtype=torch.float32)
46+
b_int8, scale_b_int8, _ = vllm_scaled_int8_quant(b, scale_b)
47+
assert scale_b_int8.numel() == 1
48+
else: # channel
49+
b_int8, scale_b_int8, _ = vllm_scaled_int8_quant(b)
50+
assert scale_b_int8.numel() == b.shape[0]
51+
return b_int8.t(), scale_b_int8
52+
53+
54+
def build_int8_runner(cfg, a, b, dtype, device):
55+
# quant before running the kernel
56+
b_int8, scale_b_int8 = _quant_weight(b, cfg["w"], device)
57+
58+
scale_a_const = None
59+
if cfg["a"] == "tensor":
60+
scale_a_const = torch.ones(1, device=device, dtype=torch.float32)
61+
62+
# no quant, create activation ahead
63+
if cfg["no_a_quant"]:
64+
if cfg["a"] == "tensor":
65+
a_int8, scale_a_int8, _ = vllm_scaled_int8_quant(a, scale_a_const)
66+
else: # token
67+
a_int8, scale_a_int8, _ = vllm_scaled_int8_quant(a)
68+
69+
def run_quant():
70+
return vllm_scaled_mm(a_int8, b_int8, scale_a_int8, scale_b_int8, dtype)
71+
72+
return run_quant
73+
74+
# dynamic quant, create activation inside
75+
if cfg["a"] == "tensor":
76+
77+
def run_quant():
78+
a_int8, scale_a_int8, _ = vllm_scaled_int8_quant(a, scale_a_const)
79+
return vllm_scaled_mm(a_int8, b_int8, scale_a_int8, scale_b_int8, dtype)
80+
81+
else: # token
82+
83+
def run_quant():
84+
a_int8, scale_a_int8, _ = vllm_scaled_int8_quant(a)
85+
return vllm_scaled_mm(a_int8, b_int8, scale_a_int8, scale_b_int8, dtype)
86+
87+
return run_quant
88+
89+
90+
_enabled = [k for k, v in PROVIDER_CFGS.items() if v.get("enabled")]
91+
92+
93+
@triton.testing.perf_report(
94+
triton.testing.Benchmark(
95+
x_names=["batch_size"],
96+
x_vals=[1, 16, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384],
97+
x_log=False,
98+
line_arg="provider",
99+
line_vals=_enabled,
100+
line_names=[k for k in _enabled],
101+
ylabel="TFLOP/s (larger is better)",
102+
plot_name="BF16 vs INT8 GEMMs",
103+
args={},
104+
)
105+
)
106+
def benchmark(batch_size, provider, N, K):
107+
M = batch_size
108+
device = "cuda"
109+
dtype = torch.bfloat16
110+
a = torch.randn((M, K), device=device, dtype=dtype)
111+
b = torch.randn((N, K), device=device, dtype=dtype)
112+
113+
quantiles = [0.5, 0.2, 0.8]
114+
115+
if provider == "torch-bf16":
116+
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
117+
lambda: torch.nn.functional.linear(a, b), quantiles=quantiles
118+
)
119+
else:
120+
cfg = PROVIDER_CFGS[provider]
121+
run_quant = build_int8_runner(cfg, a, b, dtype, device)
122+
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
123+
lambda: run_quant(), quantiles=quantiles
124+
)
125+
126+
to_tflops = lambda t_ms: (2 * M * N * K) * 1e-12 / (t_ms * 1e-3)
127+
return to_tflops(ms), to_tflops(max_ms), to_tflops(min_ms)
128+
129+
130+
def prepare_shapes(args):
131+
KN_model_names = []
132+
for model, tp_size in itertools.product(args.models, args.tp_sizes):
133+
for KN, tp_dim in copy.deepcopy(WEIGHT_SHAPES[model]):
134+
KN[tp_dim] //= tp_size
135+
KN.append(model)
136+
KN_model_names.append(KN)
137+
return KN_model_names
138+
139+
140+
if __name__ == "__main__":
141+
parser = argparse.ArgumentParser()
142+
parser.add_argument(
143+
"--models",
144+
nargs="+",
145+
type=str,
146+
default=["meta-llama/Llama-3.1-8B-Instruct"],
147+
choices=list(WEIGHT_SHAPES.keys()),
148+
help="List of models to benchmark",
149+
)
150+
parser.add_argument(
151+
"--tp-sizes",
152+
nargs="+",
153+
type=int,
154+
default=[1],
155+
help="List of tensor parallel sizes",
156+
)
157+
args = parser.parse_args()
158+
159+
for K, N, model in prepare_shapes(args):
160+
print(f"{model}, N={N} K={K}, BF16 vs INT8 GEMMs TFLOP/s:")
161+
benchmark.run(
162+
print_data=True,
163+
show_plots=True,
164+
save_path=f"bench_int8_res_n{N}_k{K}",
165+
N=N,
166+
K=K,
167+
)
168+
169+
print("Benchmark finished!")

csrc/attention/paged_attention_v1.cu

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,6 @@ void paged_attention_v1_launcher(
6565
int kv_block_stride = key_cache.stride(0);
6666
int kv_head_stride = key_cache.stride(1);
6767

68-
[[maybe_unused]] int thread_group_size = MAX(WARP_SIZE / BLOCK_SIZE, 1);
69-
assert(head_size % thread_group_size == 0);
70-
7168
// NOTE: alibi_slopes is optional.
7269
const float* alibi_slopes_ptr =
7370
alibi_slopes
@@ -193,4 +190,4 @@ void paged_attention_v1(
193190
#undef WARP_SIZE
194191
#undef MAX
195192
#undef MIN
196-
#undef DIVIDE_ROUND_UP
193+
#undef DIVIDE_ROUND_UP

0 commit comments

Comments
 (0)