[Perf] Overlap workspace_buffer.fill_(0) with compute in MLA attention #32973

robertgshaw2-redhat · 2026-01-23T22:19:27Z

Move workspace_buffer.fill_(0) for TRT-LLM ragged attention to run in a separate CUDA stream (aux_stream) so it can overlap with the compute operations that precede the attention kernel:

gather_and_maybe_dequant_cache (or cp_gather_cache for context parallel)
kv_b_proj
_concat_k_nope_k_pe

The fill operation is launched at the start of each loop iteration in _compute_prefill_context and _context_parallel_compute_prefill_context, allowing it to execute concurrently with these compute operations on the main stream. The main stream then waits for completion right before the attention kernel uses the workspace buffer.

This optimization only affects the TRT-LLM ragged DeepSeek prefill path. Other attention backends are unaffected.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Move workspace_buffer.fill_(0) for TRT-LLM ragged attention to run in a separate CUDA stream (aux_stream) so it can overlap with the compute operations that precede the attention kernel: - gather_and_maybe_dequant_cache (or cp_gather_cache for context parallel) - kv_b_proj - _concat_k_nope_k_pe The fill operation is launched at the start of each loop iteration in _compute_prefill_context and _context_parallel_compute_prefill_context, allowing it to execute concurrently with these compute operations on the main stream. The main stream then waits for completion right before the attention kernel uses the workspace buffer. This optimization only affects the TRT-LLM ragged DeepSeek prefill path. Other attention backends are unaffected. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Robert Shaw <robshaw@redhat.com>

mgoin · 2026-01-23T22:22:26Z

You should try including the output buffer creation too since that is also zero initialized

gemini-code-assist

Code Review

This pull request introduces a performance optimization for the TRT-LLM ragged DeepSeek prefill path in MLA attention. The change moves the workspace_buffer.fill_(0) operation to a separate CUDA stream, allowing it to overlap with preceding compute operations. This is achieved by introducing _start_workspace_fill_async and _wait_workspace_fill methods, which use a CUDA event for synchronization between the auxiliary stream and the main computation stream. The implementation is clean, includes a fallback for non-CUDA environments, and correctly places the synchronization points to maximize overlap. The changes are well-contained and only affect the intended prefill path. Overall, this is a solid performance improvement.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-01-23T22:37:18Z

vllm/model_executor/layers/attention/mla_attention.py

        for i in range(iters):
+            # Start workspace buffer fill in aux stream to overlap with
+            # gather_and_maybe_dequant_cache, kv_b_proj, and _concat_k_nope_k_pe
+            self._start_workspace_fill_async(prefill_metadata)


Race condition: aux stream overwrites buffer during kernel execution

High Severity

The _start_workspace_fill_async() call at the start of each loop iteration can cause the aux stream to write zeros to workspace_buffer while the previous iteration's attention kernel is still reading from it. The aux stream proceeds immediately after its previous fill completes, but there's no synchronization ensuring the main stream has finished using the buffer. Since CUDA streams execute independently, fill_i+1 and attn_i can run concurrently, corrupting the workspace data and causing incorrect attention outputs.

Additional Locations (1)

vllm/model_executor/layers/attention/mla_attention.py#L1794-L1795

cursor · 2026-01-23T22:37:18Z

vllm/model_executor/layers/attention/mla_attention.py

        for i in range(iters):
+            # Start workspace buffer fill in aux stream to overlap with
+            # gather_and_maybe_dequant_cache, kv_b_proj, and _concat_k_nope_k_pe
+            self._start_workspace_fill_async(prefill_metadata)


Race between new tokens kernel and first context fill

High Severity

When has_context is true, _run_prefill_new_tokens_trtllm_ragged uses workspace_buffer on the main stream, then _compute_prefill_context immediately calls _start_workspace_fill_async() which queues a fill on the aux stream. Since the "new tokens" attention kernel may still be executing on the main stream when the aux stream starts the fill, both operations access workspace_buffer concurrently without synchronization. This is the same root cause as the inter-iteration race but affects a different code path.

Additional Locations (1)

vllm/model_executor/layers/attention/mla_attention.py#L1794-L1795

Add a new `separate_stream` context manager that provides a cleaner interface for running operations on a separate CUDA stream with optional deferred synchronization. This allows overlapping work on the main stream while operations run in the background, with explicit control over when to synchronize. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Robert Shaw <robshaw@redhat.com>

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

cursor bot reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Overlap workspace_buffer.fill_(0) with compute in MLA attention #32973

[Perf] Overlap workspace_buffer.fill_(0) with compute in MLA attention #32973

robertgshaw2-redhat commented Jan 23, 2026 •

edited by github-actions bot

Loading

Uh oh!

mgoin commented Jan 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 23, 2026

Uh oh!

cursor bot Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Perf] Overlap workspace_buffer.fill_(0) with compute in MLA attention #32973

Are you sure you want to change the base?

[Perf] Overlap workspace_buffer.fill_(0) with compute in MLA attention #32973

Conversation

robertgshaw2-redhat commented Jan 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mgoin commented Jan 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Race condition: aux stream overwrites buffer during kernel execution

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Race between new tokens kernel and first context fill

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robertgshaw2-redhat commented Jan 23, 2026 •

edited by github-actions bot

Loading