-
-
Notifications
You must be signed in to change notification settings - Fork 12.9k
[Perf] Overlap workspace_buffer.fill_(0) with compute in MLA attention #32973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Move workspace_buffer.fill_(0) for TRT-LLM ragged attention to run in a separate CUDA stream (aux_stream) so it can overlap with the compute operations that precede the attention kernel: - gather_and_maybe_dequant_cache (or cp_gather_cache for context parallel) - kv_b_proj - _concat_k_nope_k_pe The fill operation is launched at the start of each loop iteration in _compute_prefill_context and _context_parallel_compute_prefill_context, allowing it to execute concurrently with these compute operations on the main stream. The main stream then waits for completion right before the attention kernel uses the workspace buffer. This optimization only affects the TRT-LLM ragged DeepSeek prefill path. Other attention backends are unaffected. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
You should try including the output buffer creation too since that is also zero initialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for the TRT-LLM ragged DeepSeek prefill path in MLA attention. The change moves the workspace_buffer.fill_(0) operation to a separate CUDA stream, allowing it to overlap with preceding compute operations. This is achieved by introducing _start_workspace_fill_async and _wait_workspace_fill methods, which use a CUDA event for synchronization between the auxiliary stream and the main computation stream. The implementation is clean, includes a fallback for non-CUDA environments, and correctly places the synchronization points to maximize overlap. The changes are well-contained and only affect the intended prefill path. Overall, this is a solid performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| for i in range(iters): | ||
| # Start workspace buffer fill in aux stream to overlap with | ||
| # gather_and_maybe_dequant_cache, kv_b_proj, and _concat_k_nope_k_pe | ||
| self._start_workspace_fill_async(prefill_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: aux stream overwrites buffer during kernel execution
High Severity
The _start_workspace_fill_async() call at the start of each loop iteration can cause the aux stream to write zeros to workspace_buffer while the previous iteration's attention kernel is still reading from it. The aux stream proceeds immediately after its previous fill completes, but there's no synchronization ensuring the main stream has finished using the buffer. Since CUDA streams execute independently, fill_i+1 and attn_i can run concurrently, corrupting the workspace data and causing incorrect attention outputs.
Additional Locations (1)
| for i in range(iters): | ||
| # Start workspace buffer fill in aux stream to overlap with | ||
| # gather_and_maybe_dequant_cache, kv_b_proj, and _concat_k_nope_k_pe | ||
| self._start_workspace_fill_async(prefill_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race between new tokens kernel and first context fill
High Severity
When has_context is true, _run_prefill_new_tokens_trtllm_ragged uses workspace_buffer on the main stream, then _compute_prefill_context immediately calls _start_workspace_fill_async() which queues a fill on the aux stream. Since the "new tokens" attention kernel may still be executing on the main stream when the aux stream starts the fill, both operations access workspace_buffer concurrently without synchronization. This is the same root cause as the inter-iteration race but affects a different code path.
Additional Locations (1)
Add a new `separate_stream` context manager that provides a cleaner interface for running operations on a separate CUDA stream with optional deferred synchronization. This allows overlapping work on the main stream while operations run in the background, with explicit control over when to synchronize. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Robert Shaw <robshaw@redhat.com>
Move workspace_buffer.fill_(0) for TRT-LLM ragged attention to run in a separate CUDA stream (aux_stream) so it can overlap with the compute operations that precede the attention kernel:
The fill operation is launched at the start of each loop iteration in _compute_prefill_context and _context_parallel_compute_prefill_context, allowing it to execute concurrently with these compute operations on the main stream. The main stream then waits for completion right before the attention kernel uses the workspace buffer.
This optimization only affects the TRT-LLM ragged DeepSeek prefill path. Other attention backends are unaffected.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.