Skip to content

Conversation

@robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jan 23, 2026

Move workspace_buffer.fill_(0) for TRT-LLM ragged attention to run in a separate CUDA stream (aux_stream) so it can overlap with the compute operations that precede the attention kernel:

  • gather_and_maybe_dequant_cache (or cp_gather_cache for context parallel)
  • kv_b_proj
  • _concat_k_nope_k_pe

The fill operation is launched at the start of each loop iteration in _compute_prefill_context and _context_parallel_compute_prefill_context, allowing it to execute concurrently with these compute operations on the main stream. The main stream then waits for completion right before the attention kernel uses the workspace buffer.

This optimization only affects the TRT-LLM ragged DeepSeek prefill path. Other attention backends are unaffected.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Move workspace_buffer.fill_(0) for TRT-LLM ragged attention to run in
a separate CUDA stream (aux_stream) so it can overlap with the compute
operations that precede the attention kernel:

- gather_and_maybe_dequant_cache (or cp_gather_cache for context parallel)
- kv_b_proj
- _concat_k_nope_k_pe

The fill operation is launched at the start of each loop iteration in
_compute_prefill_context and _context_parallel_compute_prefill_context,
allowing it to execute concurrently with these compute operations on
the main stream. The main stream then waits for completion right before
the attention kernel uses the workspace buffer.

This optimization only affects the TRT-LLM ragged DeepSeek prefill path.
Other attention backends are unaffected.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mgoin
Copy link
Member

mgoin commented Jan 23, 2026

You should try including the output buffer creation too since that is also zero initialized

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for the TRT-LLM ragged DeepSeek prefill path in MLA attention. The change moves the workspace_buffer.fill_(0) operation to a separate CUDA stream, allowing it to overlap with preceding compute operations. This is achieved by introducing _start_workspace_fill_async and _wait_workspace_fill methods, which use a CUDA event for synchronization between the auxiliary stream and the main computation stream. The implementation is clean, includes a fallback for non-CUDA environments, and correctly places the synchronization points to maximize overlap. The changes are well-contained and only affect the intended prefill path. Overall, this is a solid performance improvement.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

for i in range(iters):
# Start workspace buffer fill in aux stream to overlap with
# gather_and_maybe_dequant_cache, kv_b_proj, and _concat_k_nope_k_pe
self._start_workspace_fill_async(prefill_metadata)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: aux stream overwrites buffer during kernel execution

High Severity

The _start_workspace_fill_async() call at the start of each loop iteration can cause the aux stream to write zeros to workspace_buffer while the previous iteration's attention kernel is still reading from it. The aux stream proceeds immediately after its previous fill completes, but there's no synchronization ensuring the main stream has finished using the buffer. Since CUDA streams execute independently, fill_i+1 and attn_i can run concurrently, corrupting the workspace data and causing incorrect attention outputs.

Additional Locations (1)

Fix in Cursor Fix in Web

for i in range(iters):
# Start workspace buffer fill in aux stream to overlap with
# gather_and_maybe_dequant_cache, kv_b_proj, and _concat_k_nope_k_pe
self._start_workspace_fill_async(prefill_metadata)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race between new tokens kernel and first context fill

High Severity

When has_context is true, _run_prefill_new_tokens_trtllm_ragged uses workspace_buffer on the main stream, then _compute_prefill_context immediately calls _start_workspace_fill_async() which queues a fill on the aux stream. Since the "new tokens" attention kernel may still be executing on the main stream when the aux stream starts the fill, both operations access workspace_buffer concurrently without synchronization. This is the same root cause as the inter-iteration race but affects a different code path.

Additional Locations (1)

Fix in Cursor Fix in Web

Add a new `separate_stream` context manager that provides a cleaner
interface for running operations on a separate CUDA stream with optional
deferred synchronization. This allows overlapping work on the main stream
while operations run in the background, with explicit control over when
to synchronize.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants