[CUDA] Enable full cudagraph for FlashMLA #18581

ProExpertProg · 2025-05-23T02:26:13Z

Enable fullgraph CUDAGraph capture for the FlashMLA decode case.

Hacks:

building the capture metadata
prefill batch bypasses compiled code and manually calls eager code

Tested with:

python examples/offline_inference/basic/generate.py --model deepseek-ai/DeepSeek-V2-Lite --trust-remote-code -O {"full_cuda_graph":true}

github-actions · 2025-05-23T02:31:28Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

hypnopump · 2025-05-23T09:20:54Z

Could you give some details on speedup associated with this modification?

mergify · 2025-05-23T16:48:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg · 2025-05-27T15:42:07Z

vllm/v1/worker/gpu_model_runner.py

+            direct_call = has_prefill(attn_metadata) and self.full_cuda_graph
+            if direct_call:
+                # Skip the outer model layer as inner model is compiled
+                model_output = self.model.model.forward(


I'm thinking about setting a context to bypass Inductor backend directly here, but I guess we could also separately compile (and capture) prefill stage - any thoughts?
cc @youkaichao

this is hacky and depends on the fact that self.model.model is the underlying nn.Module we compile (which might not be the case if we are running multi-modality models).

I think we can have some fields in the forward context, like not_use_cudagraph, and set it to true only when we capture full cudagraph but get prefill data in the current batch. then, when we decide to replay the cudagraph, we can check this field.

it should just be changing this line

vllm/vllm/compilation/cuda_piecewise_backend.py

Line 212 in b9f61e1

entry.cudagraph.replay()

to return entry.runnable(*args)

ProExpertProg · 2025-05-28T17:34:53Z

Could you give some details on speedup associated with this modification?

I haven't necessarily profiled this but it's meant to enable the double-batch-overlap optimization (prototype in #18415)

vllm/v1/attention/backends/mla/flashmla.py

mergify · 2025-06-04T01:44:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

izhuhaoran · 2025-06-04T09:29:06Z

Hi, any further progress on this pr?

Signed-off-by: luka <[email protected]>

ProExpertProg · 2025-06-04T20:59:35Z

Hi, any further progress on this pr?

Almost ready for review!

Signed-off-by: luka <[email protected]>

ProExpertProg · 2025-06-05T06:14:58Z

Currently experiencing some issues when batching (in unit test), need to investigate further.

mergify · 2025-06-05T16:49:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 23, 2025 02:26

mergify bot added the v1 label May 23, 2025

mergify bot added the needs-rebase label May 23, 2025

ProExpertProg commented May 27, 2025

View reviewed changes

ProExpertProg force-pushed the luka/mla-full-cudagraph branch from d5c7a35 to c794889 Compare May 28, 2025 18:54

ProExpertProg requested review from jeejeelee, mgoin and russellb as code owners May 28, 2025 18:54

mergify bot added frontend structured-output labels May 28, 2025

github-project-automation bot added this to Structured Output May 28, 2025

mergify bot removed the needs-rebase label May 28, 2025

ProExpertProg force-pushed the luka/mla-full-cudagraph branch from c794889 to 80f20ce Compare May 28, 2025 19:15

ProExpertProg commented May 29, 2025

View reviewed changes

vllm/v1/attention/backends/mla/flashmla.py Outdated Show resolved Hide resolved

ProExpertProg force-pushed the luka/mla-full-cudagraph branch 2 times, most recently from 976e852 to 40e7248 Compare May 30, 2025 20:21

mergify bot added the needs-rebase label Jun 4, 2025

ProExpertProg added 2 commits June 4, 2025 17:05

ugly working prototype (+compilation config on runner)

51d19eb

Signed-off-by: luka <[email protected]>

Use build_for_cudagraph_capture for metadata

51d289c

Signed-off-by: luka <[email protected]>

ProExpertProg force-pushed the luka/mla-full-cudagraph branch from 5e3f7ab to 30562a2 Compare June 4, 2025 21:27

mergify bot removed the needs-rebase label Jun 4, 2025

skip_cudagraphs on fwd context

9ea599d

Signed-off-by: luka <[email protected]>

ProExpertProg force-pushed the luka/mla-full-cudagraph branch from 30562a2 to 9ea599d Compare June 5, 2025 00:30

mergify bot added the needs-rebase label Jun 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CUDA] Enable full cudagraph for FlashMLA #18581

[CUDA] Enable full cudagraph for FlashMLA #18581

ProExpertProg commented May 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 23, 2025

Uh oh!

hypnopump commented May 23, 2025

Uh oh!

mergify bot commented May 23, 2025

Uh oh!

ProExpertProg May 27, 2025

Uh oh!

youkaichao Jun 2, 2025

Uh oh!

ProExpertProg commented May 28, 2025

Uh oh!

Uh oh!

mergify bot commented Jun 4, 2025

Uh oh!

izhuhaoran commented Jun 4, 2025

Uh oh!

ProExpertProg commented Jun 4, 2025

Uh oh!

ProExpertProg commented Jun 5, 2025

Uh oh!

mergify bot commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

[CUDA] Enable full cudagraph for FlashMLA #18581

Are you sure you want to change the base?

[CUDA] Enable full cudagraph for FlashMLA #18581

Conversation

ProExpertProg commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 23, 2025

Uh oh!

hypnopump commented May 23, 2025

Uh oh!

mergify bot commented May 23, 2025

Uh oh!

ProExpertProg May 27, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented May 28, 2025

Uh oh!

Uh oh!

mergify bot commented Jun 4, 2025

Uh oh!

izhuhaoran commented Jun 4, 2025

Uh oh!

ProExpertProg commented Jun 4, 2025

Uh oh!

ProExpertProg commented Jun 5, 2025

Uh oh!

mergify bot commented Jun 5, 2025

Uh oh!

Uh oh!

ProExpertProg commented May 23, 2025 •

edited

Loading