[Bugfix][Nixl] Fix full prefix cache hit bug #18632

robertgshaw2-redhat · 2025-05-23T20:24:54Z

SUMMARY:

in case of full prefix cache hit locally on D worker, we are leaking memory on the P worker side since we are not currently calling send_notif since we skip calling update_state_after_alloc
also fixes the path where we do get a cache hit, which was passing the wrong thing

Signed-off-by: [email protected] <[email protected]>

github-actions · 2025-05-23T20:25:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat · 2025-05-23T20:45:20Z

@njhill - can you let me know if this works okay with multi-connector?

njhill · 2025-05-23T23:08:15Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                    # If remote_blocks and num_external_tokens = 0, we have
+                    # a full prefix cache hit on the D worker. We need to call
+                    # send_notif in _read_blocks to free the memory on the P.
+                    local_block_ids = (blocks.get_unhashed_block_ids()
+                                       if num_external_tokens > 0 else [])
                    # Get unhashed blocks to pull from remote.
                    self._reqs_need_recv[request.request_id] = (
-                        request, blocks.get_unhashed_block_ids())
+                        request, local_block_ids)


@robertgshaw2-redhat I'm still not sure that this part or the change to always call update_state_after_alloc is needed. I'd already added logic for this case in get_num_new_matched_tokens above:

vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Lines 215 to 222 in f203673

# NOTE: if count is 0 here, we have less than block_size

# tokens to pull after subtracting the local prefix cache hit.

# The remote only sends fully computed blocks, so there is

# nothing to transfer but we still need to notify the

# prefill worker so that the remote blocks are freed.

if all(p in params for p in ("remote_engine_id", "remote_host",

"remote_port")):

self._reqs_need_recv[request.request_id] = (request, [])

I can see that the other two fixes below in build_connector_meta and _read_blocks are of course needed though.

If you think it's better to have this logic in this method then we can remove it from the other one. But again I feel it's logically clearer to not call update_state_after_alloc if 0 was returned from get_num_new_matched_tokens.

I think that get_num_new_matched_tokens should be a pure function. Adding a side effect to it is surprising given the name of the method and the fact that we will have different behavior depending on what happens if the request is or is not able to be scheduled. This issue is actually causing a bug right now.

If allocate_slots returns None, the request will remain in the waiting queue. this will cause us to add the requests to reqs_need_recv more than one and as a result we will call read_blocks twice which will do a double free on the P worker side. Similarly this will happen if the request is preempted (it will get re-added to waiting). This is because we are not properly updating the request to have do_remote_prefill=False when it is added to reqs_need_recv from the get_num_new_matched_tokens function.

This is all just evidence that putting a side effect into this function is not a good idea. The update_state_after_alloc is where we should handle everything related to reqs_need_recv so we have a single place where all the logic is handled.

I removed those lines from get_num_new_matched_tokens

@robertgshaw2-redhat that makes sense, I agree about the pure function thing. I did also notice the fact that this could result in a double free on the P worker side in the case that it can't be scheduled, which isn't ideal (though I think would probably be harmless).

But to me, thinking from the pov of a generic connector interface, it still feels a bit odd given the connector isn't offering any tokens. I guess we should very clearly document the semantics and expectations for the interface.

A related quirk is that in the async load case, I think currently update_state_after_alloc will be called twice for a request (a second time once the request moves out of WAITING_FOR_REMOTE_KVS).

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat · 2025-05-24T13:41:13Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

@@ -212,15 +212,6 @@ def get_num_new_matched_tokens(
            if count > 0:
                return count, True

-            # NOTE: if count is 0 here, we have less than block_size


this is now handled in update_state_after_alloc

njhill · 2025-05-25T20:51:58Z

@robertgshaw2-redhat changes will be needed to multi-connector too, I've pushed them to a branch, feel free to pull into this PR: njhill@4150a41

njhill

LGTM, with the multi-connector changes

…ix-cache-hit Signed-off-by: Nick Hill <[email protected]>

- Call get_num_new_matched_tokens for every connector - Call update_state_after_alloc for every connector, but with no blocks/tokens for all but the "chosen" connector (the first one to return non-zero tokens from get_num_new_matched_tokens). Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Juncheng Gu <[email protected]> Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-06-03T14:17:19Z

I just have one more thing to fix up in the mulit-connector test now that the semantics have changed.

Signed-off-by: Nick Hill <[email protected]>

…ix-cache-hit

mergify · 2025-06-03T23:54:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ix-cache-hit # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py

mgoin · 2025-06-04T21:26:25Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively addresses a memory leak in the Nixl connector related to full prefix cache hits and also fixes an issue with how cache hits were handled. The core of the fix involves ensuring update_state_after_alloc is consistently called across all relevant connectors, even when no external tokens are loaded, which allows for proper notification and resource cleanup on the P worker side. The changes in MultiConnector and NixlConnectorScheduler are key to this. Test cases have been updated appropriately to reflect these changes and the enhanced logging.

Overall, the changes look good and directly target the described issues. I have one point for clarification regarding a behavioral change in MultiConnector.get_num_new_matched_tokens.

Summary of Findings

Memory Leak Fix in Nixl Connector: The primary goal of this PR, fixing a memory leak on the Nixl P-worker during full prefix cache hits on the D-worker, appears to be successfully addressed. The core changes ensure that update_state_after_alloc is called for all relevant connector components, allowing for proper notifications and resource cleanup.
Behavioral Change in MultiConnector.get_num_new_matched_tokens: The get_num_new_matched_tokens method in MultiConnector now iterates through all sub-connectors, calling the method on each, even if a match was found earlier. Clarification on the necessity and impact of this change would be beneficial.
Test Coverage and Logging: The tests in test_multi_connector.py have been updated to reflect the new logic and include more detailed event logging, which is good for verifying the fix and aiding future debugging.

Merge Readiness

The pull request seems to address the reported memory leak effectively. The changes are logical and the tests have been updated accordingly. There is one point regarding a behavioral change in MultiConnector.get_num_new_matched_tokens that would benefit from clarification. Assuming this behavior is intended and understood, the PR appears to be in good shape for merging after addressing or clarifying that point. As an AI, I am not authorized to approve pull requests; this assessment is based on the code review.

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Signed-off-by: Nick Hill <[email protected]>

…ix-cache-hit

mergify · 2025-06-04T23:27:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ix-cache-hit Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

* [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * remove logging Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: minpeter <[email protected]>

…quest (#12) * [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix (vllm-project#18100) Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> * Prevent the cross-encoder logic from being applied to classification tasks (vllm-project#18838) Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Add ability to use CUDAGraphs with use_inductor=False (vllm-project#17345) Signed-off-by: rzou <[email protected]> * [Bugfix][TPU] fix moe custom kernel import (vllm-project#18853) Signed-off-by: Chengji Yao <[email protected]> * [Doc][Neuron] Update documentation for Neuron (vllm-project#18868) Signed-off-by: Elaine Zhao <[email protected]> * Skip device and quant Pydantic validation to make plugin device work (vllm-project#18843) Signed-off-by: Yikun Jiang <[email protected]> * Fixes a dead link in nightly benchmark readme (vllm-project#18856) Signed-off-by: Brent Salisbury <[email protected]> * [Neuron] Add multi-LoRA support for Neuron. (vllm-project#18284) Signed-off-by: Satyajith Chilappagari <[email protected]> * [LoRA] Add LoRA support for InternVL (vllm-project#18842) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Remove redundant spaces from compatibility_matrix.md (vllm-project#18891) Signed-off-by: windsonsea <[email protected]> * [doc] add CLI doc (vllm-project#18871) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix misleading information in the documentation (vllm-project#18845) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Replace TODO in serving transcription (vllm-project#18895) Signed-off-by: NickLucche <[email protected]> * [Bugfix] Ensure tensors are contiguous during serialisation (vllm-project#18860) Signed-off-by: Lukas Geiger <[email protected]> * [BugFix] Update pydantic to fix error on python 3.10 (vllm-project#18852) Signed-off-by: luka <[email protected]> * Fix an error in dummy weight loading for quantization models (vllm-project#18855) Signed-off-by: Chenyaaang <[email protected]> * [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. (vllm-project#18692) Signed-off-by: Duyi-Wang <[email protected]> * [Doc] Fix codeblocks formatting in LoRA adapters documentation (vllm-project#18907) Signed-off-by: Zerohertz <[email protected]> * [Bugfix] Fix the failing gte embedding test (vllm-project#18720) Signed-off-by: Isotr0py <[email protected]> * [Attention][V1] Toggle for v1 attention backend (vllm-project#18275) Signed-off-by: Gregory Shtrasberg <[email protected]> * [ROCm][V0][Attention] Revert to the previous FA triton kernel (vllm-project#18226) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Deprecation] Disallow pos-args other than `model` when initializing `LLM` (vllm-project#18802) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Remove duplicate init for self.vllm_config (vllm-project#18896) Signed-off-by: googs1025 <[email protected]> * [V1] Allocate kv_cache with stride order for V1 (vllm-project#18775) Signed-off-by: nicklucche <[email protected]> * [BugFix] Make DP work with connector-delayed new requests (vllm-project#18559) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Will Eaton <[email protected]> * [P/D] NixlConnector DP fixes (vllm-project#18903) Signed-off-by: Will Eaton <[email protected]> * Use standalone_compile by default in torch >= 2.8.0 (vllm-project#18846) Signed-off-by: rzou <[email protected]> * [TPU] remove transpose ops in moe kernel (vllm-project#18923) Signed-off-by: Chengji Yao <[email protected]> * [Bugfix] Fix PP default fallback behavior for V1 (vllm-project#18915) Signed-off-by: mgoin <[email protected]> * [Misc] Update type annotation for rotary embedding `base` (vllm-project#18914) Signed-off-by: DarkLight1337 <[email protected]> * [TPU][CI/CD] Clean up docker for TPU tests. (vllm-project#18926) Signed-off-by: Carol Zheng <[email protected]> * improve the robustness of parsing vlms config in AutoRound (vllm-project#18894) Signed-off-by: wenhuach21 <[email protected]> * [Bugfix] Consistent ascii handling in tool parsers (vllm-project#18883) Signed-off-by: chaunceyjiang <[email protected]> * [Model] Use AutoWeightsLoader for mamba2 (vllm-project#18918) Signed-off-by: iLeGend <[email protected]> * [docs] fix: fix markdown syntax (vllm-project#18927) * [ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. (vllm-project#18938) Signed-off-by: vllmellm <[email protected]> * [Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy (vllm-project#18861) Signed-off-by: mgoin <[email protected]> * [Deprecation] Remove mean pooling default for `Qwen2EmbeddingModel` (vllm-project#18913) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Fix benchmarks/README.md for speculative decoding (vllm-project#18897) Signed-off-by: rabi <[email protected]> * [doc] add mkdocs doc (vllm-project#18930) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Model] Use in-place adds in SigLIP (vllm-project#18922) Signed-off-by: Lukas Geiger <[email protected]> * [Bugfix][Failing Test] Fix test_vllm_port.py (vllm-project#18618) Signed-off-by: rabi <[email protected]> * [Misc]Fix typo (vllm-project#18947) * [Bugfix][TPU] Fix tpu model runner testcase failure (vllm-project#18810) Signed-off-by: Carol Zheng <[email protected]> * [CI/Build] remove regex from build dependencies (vllm-project#18945) Signed-off-by: Daniele Trifirò <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Feature] minicpm eagle support (vllm-project#18943) Signed-off-by: huangyuxiang03 <[email protected]> Co-authored-by: huangyuxiang03 <[email protected]> * [doc] show the count for fork and watch (vllm-project#18950) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Docs] Update SECURITY.md with link to our security guide (vllm-project#18961) Signed-off-by: Russell Bryant <[email protected]> * Improve "failed to get the hash of the compiled graph" error (vllm-project#18956) Signed-off-by: rzou <[email protected]> * [Perf] API-server scaleout with many-to-many server-engine comms (vllm-project#17546) * Benchmark script for fp8 vs bf16 gemm (vllm-project#17126) Signed-off-by: mgoin <[email protected]> * [VLM] Add PP support and fix GPTQ inference for Ovis models (vllm-project#18958) Signed-off-by: isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] add group_size is -1 in awq quantization (vllm-project#18910) Signed-off-by: rongfu.leng <[email protected]> * Tool parser regex timeout handling (vllm-project#18960) Signed-off-by: Will Eaton <[email protected]> * [Docs] Correct multiprocessing design doc (vllm-project#18964) Signed-off-by: Lukas Geiger <[email protected]> * create util function for batched arange (vllm-project#18937) * [Frontend] Add rerank support to run_batch endpoint (vllm-project#16278) Signed-off-by: Pooya Davoodi <[email protected]> * [Misc] Fix estimated max model len msg (vllm-project#18966) Signed-off-by: Yong Hoon Shin <[email protected]> * [Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled (vllm-project#18879) Signed-off-by: chaunceyjiang <[email protected]> * fix security issue of logging llm output (vllm-project#18980) Signed-off-by: Lu Fang <[email protected]> Co-authored-by: Lucia (Lu) Fang <[email protected]> * [Neuron] Add Multi-Modal model support for Neuron (vllm-project#18921) Signed-off-by: Satyajith Chilappagari <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Rohith Nallamaddi <[email protected]> Co-authored-by: FeliciaLuo <[email protected]> Co-authored-by: Elaine Zhao <[email protected]> * [doc] fix the list rendering issue - security.md (vllm-project#18982) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [BugFix] Pydantic part 2 (vllm-project#18911) Signed-off-by: luka <[email protected]> * [FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 (vllm-project#18825) Signed-off-by: vllmellm <[email protected]> * [Bugfix] Fix for issue 17396 (vllm-project#18773) Signed-off-by: Fred Reiss <[email protected]> * [ROCm][Kernel] Add gfx950 support for skinny gemms (vllm-project#18010) Signed-off-by: charlifu <[email protected]> * [P/D] NixlConnector use cache device index for memory registration (vllm-project#18969) Signed-off-by: Piotr Tarasiewicz <[email protected]> * [BugFix] Fix multi-node offline data-parallel (vllm-project#18981) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Yizhou Liu <[email protected]> * [Misc] add return token strs for tokenize (vllm-project#18941) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc][Benchmark] Add support for CustomDataset (vllm-project#18511) * [Bugfix] Fix EAGLE3 broken logits (vllm-project#18909) Signed-off-by: Benjamin Chislett <[email protected]> * [Core] Rework dtype resolution (vllm-project#18751) Signed-off-by: DarkLight1337 <[email protected]> * [LoRA] Support dynamically initialize `packed_modules_mapping` for VLM with arbitrary components (vllm-project#18987) Signed-off-by: isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [doc] small fix - mkdocs (vllm-project#18996) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Let max_num_batched_tokens use human_readable_int for large numbers (vllm-project#18968) Signed-off-by: mgoin <[email protected]> * [BugFix] fix data parallel construct ipv6 url addres (vllm-project#18991) Signed-off-by: rongfu.leng <[email protected]> * [BugFix] Fix incorrect metrics shutdown error log message (vllm-project#18992) Signed-off-by: Nick Hill <[email protected]> * [doc] wrong output (vllm-project#19000) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context (vllm-project#18935) Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: zhuhaoran <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [Bugfix][Nixl] Fix DP Metadata Handshake (vllm-project#19008) Signed-off-by: [email protected] <[email protected]> * [Core] Support inplace model weights loading (vllm-project#18745) Signed-off-by: 22quinn <[email protected]> * [doc] add pytest tips (vllm-project#19010) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Model] enable data parallel for Llama4 vision encoder (vllm-project#18368) Signed-off-by: yzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: yzhen <[email protected]> * [Frontend] enable custom logging for the uvicorn server (OpenAI API server) (vllm-project#18403) Signed-off-by: François Paupier <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Bugfix][Model] Attempt to fix eagle in V0. (vllm-project#18978) Signed-off-by: Gregory Shtrasberg <[email protected]> * add an absolute path for run.sh (vllm-project#18258) Signed-off-by: calvin chen <[email protected]> * [Hardware][TPU] Initial support of model parallelism with single worker using SPMD (vllm-project#18011) Signed-off-by: Siyuan Liu <[email protected]> Co-authored-by: Hossein Sarshar <[email protected]> Co-authored-by: Chengji Yao <[email protected]> * [Doc] Remove duplicate TOCs during MkDocs migration (vllm-project#19021) Signed-off-by: Zerohertz <[email protected]> * [Bugfix][EP+DP] Use pplx-kernel internode instead of intranode (vllm-project#19034) Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> * Adding "LoRA Test %N" to AMD production tests (vllm-project#18929) Signed-off-by: Yida Wu <[email protected]> * [CPU][CI] Re-enable the CPU CI tests (vllm-project#19046) Signed-off-by: jiang.li <[email protected]> * [ROCm][Build] Clean up the ROCm build (vllm-project#19040) Signed-off-by: Gregory Shtrasberg <[email protected]> * [V1] Support DP with Ray (vllm-project#18779) * Add tarsier model support (vllm-project#18985) Signed-off-by: 汪志鹏 <[email protected]> * [bugfix] small fix logic issue (vllm-project#18999) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Reduce logs in CLI scripts and plugin loader (vllm-project#18970) Signed-off-by: mgoin <[email protected]> * [Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure (vllm-project#19019) Signed-off-by: Lu Fang <[email protected]> * [v1][KVCacheManager] Rename BlockHashType to BlockHash (vllm-project#19015) Signed-off-by: Chen Zhang <[email protected]> * Update docker docs with ARM CUDA cross-compile (vllm-project#19037) Signed-off-by: mgoin <[email protected]> * [Doc] Add InternVL LoRA support (vllm-project#19055) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Update `WeightsMapper` for qwen2-vl/qwen2.5-vl (vllm-project#19054) Signed-off-by: Isotr0py <[email protected]> * [Doc] Update V1 user guide for embedding and enc-dec models (vllm-project#19060) Signed-off-by: DarkLight1337 <[email protected]> * [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix Signed-off-by: Amog Kamsetty <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * update config Signed-off-by: Amog Kamsetty <[email protected]> * add Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Elaine Zhao <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: Brent Salisbury <[email protected]> Signed-off-by: Satyajith Chilappagari <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Duyi-Wang <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Will Eaton <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Carol Zheng <[email protected]> Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: iLeGend <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: rabi <[email protected]> Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: huangyuxiang03 <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: isotr0py <[email protected]> Signed-off-by: rongfu.leng <[email protected]> Signed-off-by: Pooya Davoodi <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: charlifu <[email protected]> Signed-off-by: Piotr Tarasiewicz <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: yzhen <[email protected]> Signed-off-by: François Paupier <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: tjtanaa <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: aws-elaineyz <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Brent Salisbury <[email protected]> Co-authored-by: Satyajith Chilappagari <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Duyi-Wang <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Will Eaton <[email protected]> Co-authored-by: Will Eaton <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Carol Zheng <[email protected]> Co-authored-by: Wenhua Cheng <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: iLeGend <[email protected]> Co-authored-by: H <[email protected]> Co-authored-by: vllmellm <[email protected]> Co-authored-by: Rabi Mishra <[email protected]> Co-authored-by: Always-Naive <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Shawn Huang <[email protected]> Co-authored-by: huangyuxiang03 <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Pooya Davoodi <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Lucia (Lu) Fang <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Rohith Nallamaddi <[email protected]> Co-authored-by: FeliciaLuo <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: ptarasiewiczNV <[email protected]> Co-authored-by: Yizhou Liu <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: zhrrr <[email protected]> Co-authored-by: zhuhaoran <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: yzhen <[email protected]> Co-authored-by: Frαnçois <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Hossein Sarshar <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]>

* [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * remove logging Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

…quest (#12) * [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix (vllm-project#18100) Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> * Prevent the cross-encoder logic from being applied to classification tasks (vllm-project#18838) Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Add ability to use CUDAGraphs with use_inductor=False (vllm-project#17345) Signed-off-by: rzou <[email protected]> * [Bugfix][TPU] fix moe custom kernel import (vllm-project#18853) Signed-off-by: Chengji Yao <[email protected]> * [Doc][Neuron] Update documentation for Neuron (vllm-project#18868) Signed-off-by: Elaine Zhao <[email protected]> * Skip device and quant Pydantic validation to make plugin device work (vllm-project#18843) Signed-off-by: Yikun Jiang <[email protected]> * Fixes a dead link in nightly benchmark readme (vllm-project#18856) Signed-off-by: Brent Salisbury <[email protected]> * [Neuron] Add multi-LoRA support for Neuron. (vllm-project#18284) Signed-off-by: Satyajith Chilappagari <[email protected]> * [LoRA] Add LoRA support for InternVL (vllm-project#18842) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Remove redundant spaces from compatibility_matrix.md (vllm-project#18891) Signed-off-by: windsonsea <[email protected]> * [doc] add CLI doc (vllm-project#18871) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix misleading information in the documentation (vllm-project#18845) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Replace TODO in serving transcription (vllm-project#18895) Signed-off-by: NickLucche <[email protected]> * [Bugfix] Ensure tensors are contiguous during serialisation (vllm-project#18860) Signed-off-by: Lukas Geiger <[email protected]> * [BugFix] Update pydantic to fix error on python 3.10 (vllm-project#18852) Signed-off-by: luka <[email protected]> * Fix an error in dummy weight loading for quantization models (vllm-project#18855) Signed-off-by: Chenyaaang <[email protected]> * [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. (vllm-project#18692) Signed-off-by: Duyi-Wang <[email protected]> * [Doc] Fix codeblocks formatting in LoRA adapters documentation (vllm-project#18907) Signed-off-by: Zerohertz <[email protected]> * [Bugfix] Fix the failing gte embedding test (vllm-project#18720) Signed-off-by: Isotr0py <[email protected]> * [Attention][V1] Toggle for v1 attention backend (vllm-project#18275) Signed-off-by: Gregory Shtrasberg <[email protected]> * [ROCm][V0][Attention] Revert to the previous FA triton kernel (vllm-project#18226) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Deprecation] Disallow pos-args other than `model` when initializing `LLM` (vllm-project#18802) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Remove duplicate init for self.vllm_config (vllm-project#18896) Signed-off-by: googs1025 <[email protected]> * [V1] Allocate kv_cache with stride order for V1 (vllm-project#18775) Signed-off-by: nicklucche <[email protected]> * [BugFix] Make DP work with connector-delayed new requests (vllm-project#18559) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Will Eaton <[email protected]> * [P/D] NixlConnector DP fixes (vllm-project#18903) Signed-off-by: Will Eaton <[email protected]> * Use standalone_compile by default in torch >= 2.8.0 (vllm-project#18846) Signed-off-by: rzou <[email protected]> * [TPU] remove transpose ops in moe kernel (vllm-project#18923) Signed-off-by: Chengji Yao <[email protected]> * [Bugfix] Fix PP default fallback behavior for V1 (vllm-project#18915) Signed-off-by: mgoin <[email protected]> * [Misc] Update type annotation for rotary embedding `base` (vllm-project#18914) Signed-off-by: DarkLight1337 <[email protected]> * [TPU][CI/CD] Clean up docker for TPU tests. (vllm-project#18926) Signed-off-by: Carol Zheng <[email protected]> * improve the robustness of parsing vlms config in AutoRound (vllm-project#18894) Signed-off-by: wenhuach21 <[email protected]> * [Bugfix] Consistent ascii handling in tool parsers (vllm-project#18883) Signed-off-by: chaunceyjiang <[email protected]> * [Model] Use AutoWeightsLoader for mamba2 (vllm-project#18918) Signed-off-by: iLeGend <[email protected]> * [docs] fix: fix markdown syntax (vllm-project#18927) * [ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. (vllm-project#18938) Signed-off-by: vllmellm <[email protected]> * [Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy (vllm-project#18861) Signed-off-by: mgoin <[email protected]> * [Deprecation] Remove mean pooling default for `Qwen2EmbeddingModel` (vllm-project#18913) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Fix benchmarks/README.md for speculative decoding (vllm-project#18897) Signed-off-by: rabi <[email protected]> * [doc] add mkdocs doc (vllm-project#18930) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Model] Use in-place adds in SigLIP (vllm-project#18922) Signed-off-by: Lukas Geiger <[email protected]> * [Bugfix][Failing Test] Fix test_vllm_port.py (vllm-project#18618) Signed-off-by: rabi <[email protected]> * [Misc]Fix typo (vllm-project#18947) * [Bugfix][TPU] Fix tpu model runner testcase failure (vllm-project#18810) Signed-off-by: Carol Zheng <[email protected]> * [CI/Build] remove regex from build dependencies (vllm-project#18945) Signed-off-by: Daniele Trifirò <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Feature] minicpm eagle support (vllm-project#18943) Signed-off-by: huangyuxiang03 <[email protected]> Co-authored-by: huangyuxiang03 <[email protected]> * [doc] show the count for fork and watch (vllm-project#18950) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Docs] Update SECURITY.md with link to our security guide (vllm-project#18961) Signed-off-by: Russell Bryant <[email protected]> * Improve "failed to get the hash of the compiled graph" error (vllm-project#18956) Signed-off-by: rzou <[email protected]> * [Perf] API-server scaleout with many-to-many server-engine comms (vllm-project#17546) * Benchmark script for fp8 vs bf16 gemm (vllm-project#17126) Signed-off-by: mgoin <[email protected]> * [VLM] Add PP support and fix GPTQ inference for Ovis models (vllm-project#18958) Signed-off-by: isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] add group_size is -1 in awq quantization (vllm-project#18910) Signed-off-by: rongfu.leng <[email protected]> * Tool parser regex timeout handling (vllm-project#18960) Signed-off-by: Will Eaton <[email protected]> * [Docs] Correct multiprocessing design doc (vllm-project#18964) Signed-off-by: Lukas Geiger <[email protected]> * create util function for batched arange (vllm-project#18937) * [Frontend] Add rerank support to run_batch endpoint (vllm-project#16278) Signed-off-by: Pooya Davoodi <[email protected]> * [Misc] Fix estimated max model len msg (vllm-project#18966) Signed-off-by: Yong Hoon Shin <[email protected]> * [Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled (vllm-project#18879) Signed-off-by: chaunceyjiang <[email protected]> * fix security issue of logging llm output (vllm-project#18980) Signed-off-by: Lu Fang <[email protected]> Co-authored-by: Lucia (Lu) Fang <[email protected]> * [Neuron] Add Multi-Modal model support for Neuron (vllm-project#18921) Signed-off-by: Satyajith Chilappagari <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Rohith Nallamaddi <[email protected]> Co-authored-by: FeliciaLuo <[email protected]> Co-authored-by: Elaine Zhao <[email protected]> * [doc] fix the list rendering issue - security.md (vllm-project#18982) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [BugFix] Pydantic part 2 (vllm-project#18911) Signed-off-by: luka <[email protected]> * [FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 (vllm-project#18825) Signed-off-by: vllmellm <[email protected]> * [Bugfix] Fix for issue 17396 (vllm-project#18773) Signed-off-by: Fred Reiss <[email protected]> * [ROCm][Kernel] Add gfx950 support for skinny gemms (vllm-project#18010) Signed-off-by: charlifu <[email protected]> * [P/D] NixlConnector use cache device index for memory registration (vllm-project#18969) Signed-off-by: Piotr Tarasiewicz <[email protected]> * [BugFix] Fix multi-node offline data-parallel (vllm-project#18981) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Yizhou Liu <[email protected]> * [Misc] add return token strs for tokenize (vllm-project#18941) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc][Benchmark] Add support for CustomDataset (vllm-project#18511) * [Bugfix] Fix EAGLE3 broken logits (vllm-project#18909) Signed-off-by: Benjamin Chislett <[email protected]> * [Core] Rework dtype resolution (vllm-project#18751) Signed-off-by: DarkLight1337 <[email protected]> * [LoRA] Support dynamically initialize `packed_modules_mapping` for VLM with arbitrary components (vllm-project#18987) Signed-off-by: isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [doc] small fix - mkdocs (vllm-project#18996) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Let max_num_batched_tokens use human_readable_int for large numbers (vllm-project#18968) Signed-off-by: mgoin <[email protected]> * [BugFix] fix data parallel construct ipv6 url addres (vllm-project#18991) Signed-off-by: rongfu.leng <[email protected]> * [BugFix] Fix incorrect metrics shutdown error log message (vllm-project#18992) Signed-off-by: Nick Hill <[email protected]> * [doc] wrong output (vllm-project#19000) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context (vllm-project#18935) Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: zhuhaoran <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [Bugfix][Nixl] Fix DP Metadata Handshake (vllm-project#19008) Signed-off-by: [email protected] <[email protected]> * [Core] Support inplace model weights loading (vllm-project#18745) Signed-off-by: 22quinn <[email protected]> * [doc] add pytest tips (vllm-project#19010) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Model] enable data parallel for Llama4 vision encoder (vllm-project#18368) Signed-off-by: yzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: yzhen <[email protected]> * [Frontend] enable custom logging for the uvicorn server (OpenAI API server) (vllm-project#18403) Signed-off-by: François Paupier <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Bugfix][Model] Attempt to fix eagle in V0. (vllm-project#18978) Signed-off-by: Gregory Shtrasberg <[email protected]> * add an absolute path for run.sh (vllm-project#18258) Signed-off-by: calvin chen <[email protected]> * [Hardware][TPU] Initial support of model parallelism with single worker using SPMD (vllm-project#18011) Signed-off-by: Siyuan Liu <[email protected]> Co-authored-by: Hossein Sarshar <[email protected]> Co-authored-by: Chengji Yao <[email protected]> * [Doc] Remove duplicate TOCs during MkDocs migration (vllm-project#19021) Signed-off-by: Zerohertz <[email protected]> * [Bugfix][EP+DP] Use pplx-kernel internode instead of intranode (vllm-project#19034) Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> * Adding "LoRA Test %N" to AMD production tests (vllm-project#18929) Signed-off-by: Yida Wu <[email protected]> * [CPU][CI] Re-enable the CPU CI tests (vllm-project#19046) Signed-off-by: jiang.li <[email protected]> * [ROCm][Build] Clean up the ROCm build (vllm-project#19040) Signed-off-by: Gregory Shtrasberg <[email protected]> * [V1] Support DP with Ray (vllm-project#18779) * Add tarsier model support (vllm-project#18985) Signed-off-by: 汪志鹏 <[email protected]> * [bugfix] small fix logic issue (vllm-project#18999) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Reduce logs in CLI scripts and plugin loader (vllm-project#18970) Signed-off-by: mgoin <[email protected]> * [Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure (vllm-project#19019) Signed-off-by: Lu Fang <[email protected]> * [v1][KVCacheManager] Rename BlockHashType to BlockHash (vllm-project#19015) Signed-off-by: Chen Zhang <[email protected]> * Update docker docs with ARM CUDA cross-compile (vllm-project#19037) Signed-off-by: mgoin <[email protected]> * [Doc] Add InternVL LoRA support (vllm-project#19055) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Update `WeightsMapper` for qwen2-vl/qwen2.5-vl (vllm-project#19054) Signed-off-by: Isotr0py <[email protected]> * [Doc] Update V1 user guide for embedding and enc-dec models (vllm-project#19060) Signed-off-by: DarkLight1337 <[email protected]> * [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix Signed-off-by: Amog Kamsetty <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * update config Signed-off-by: Amog Kamsetty <[email protected]> * add Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Elaine Zhao <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: Brent Salisbury <[email protected]> Signed-off-by: Satyajith Chilappagari <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Duyi-Wang <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Will Eaton <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Carol Zheng <[email protected]> Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: iLeGend <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: rabi <[email protected]> Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: huangyuxiang03 <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: isotr0py <[email protected]> Signed-off-by: rongfu.leng <[email protected]> Signed-off-by: Pooya Davoodi <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: charlifu <[email protected]> Signed-off-by: Piotr Tarasiewicz <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: yzhen <[email protected]> Signed-off-by: François Paupier <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: tjtanaa <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: aws-elaineyz <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Brent Salisbury <[email protected]> Co-authored-by: Satyajith Chilappagari <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Duyi-Wang <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Will Eaton <[email protected]> Co-authored-by: Will Eaton <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Carol Zheng <[email protected]> Co-authored-by: Wenhua Cheng <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: iLeGend <[email protected]> Co-authored-by: H <[email protected]> Co-authored-by: vllmellm <[email protected]> Co-authored-by: Rabi Mishra <[email protected]> Co-authored-by: Always-Naive <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Shawn Huang <[email protected]> Co-authored-by: huangyuxiang03 <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Pooya Davoodi <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Lucia (Lu) Fang <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Rohith Nallamaddi <[email protected]> Co-authored-by: FeliciaLuo <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: ptarasiewiczNV <[email protected]> Co-authored-by: Yizhou Liu <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: zhrrr <[email protected]> Co-authored-by: zhuhaoran <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: yzhen <[email protected]> Co-authored-by: Frαnçois <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Hossein Sarshar <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]>

robertgshaw2-redhat added 2 commits May 23, 2025 20:19

updated

b15f974

Signed-off-by: [email protected] <[email protected]>

updated

95408aa

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat requested review from WoosukKwon, njhill, ywang96, comaniac and alexm-redhat as code owners May 23, 2025 20:24

mergify bot added the v1 label May 23, 2025

updated

61a2900

Signed-off-by: [email protected] <[email protected]>

njhill reviewed May 23, 2025

View reviewed changes

updated

6bde0f1

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat commented May 24, 2025

View reviewed changes

njhill approved these changes May 25, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label May 27, 2025

This was referenced Jun 2, 2025

[Bug]: NixlConnector should not skip short do_remote_prefill requests in connector metadata #18591

Closed

FIX: NixlConnector: do not skip short do_remote_prefill requests #18590

Closed

njhill and others added 5 commits June 2, 2025 14:03

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

2c3cb80

…ix-cache-hit Signed-off-by: Nick Hill <[email protected]>

update test_prompt_less_than_block_size()

f6ed8c4

Signed-off-by: Juncheng Gu <[email protected]> Signed-off-by: Nick Hill <[email protected]>

revert test changes

c5546c3

Signed-off-by: Nick Hill <[email protected]>

fix multi_connector import

fb844a5

Signed-off-by: Nick Hill <[email protected]>

njhill mentioned this pull request Jun 3, 2025

[WIP] [Core][P/D] CPU connector for PD disagg #18332

Open

15 tasks

njhill added 2 commits June 3, 2025 13:23

fix multi_connector test

9e435e0

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

354d775

…ix-cache-hit

mergify bot added the needs-rebase label Jun 3, 2025

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

45bd917

…ix-cache-hit # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify bot removed the needs-rebase label Jun 4, 2025

njhill requested a review from mgoin June 4, 2025 16:20

mgoin reviewed Jun 4, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jun 4, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

njhill added 2 commits June 4, 2025 14:54

address @mgoin review comments

0c30192

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

cac5027

…ix-cache-hit

njhill enabled auto-merge (squash) June 4, 2025 22:26

mgoin approved these changes Jun 4, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 4, 2025

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-pref…

3eaea72

…ix-cache-hit Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify bot removed the needs-rebase label Jun 4, 2025

njhill merged commit c56ed8b into vllm-project:main Jun 5, 2025
70 checks passed

	# NOTE: if count is 0 here, we have less than block_size
	# tokens to pull after subtracting the local prefix cache hit.
	# The remote only sends fully computed blocks, so there is
	# nothing to transfer but we still need to notify the
	# prefill worker so that the remote blocks are freed.
	if all(p in params for p in ("remote_engine_id", "remote_host",
	"remote_port")):
	self._reqs_need_recv[request.request_id] = (request, [])

Uh oh!

[Bugfix][Nixl] Fix full prefix cache hit bug #18632

[Bugfix][Nixl] Fix full prefix cache hit bug #18632

Uh oh!

Conversation

robertgshaw2-redhat commented May 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 23, 2025

Uh oh!

robertgshaw2-redhat commented May 23, 2025

Uh oh!

njhill May 23, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 24, 2025

Choose a reason for hiding this comment

Uh oh!

njhill May 25, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat May 24, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented May 25, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Jun 3, 2025

Uh oh!

mergify bot commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

mgoin commented Jun 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented May 23, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat May 24, 2025 •

edited

Loading