[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill #29710

minosfuture · 2025-11-29T01:41:37Z

Purpose

Reduce k tensor concatenation latency, from 3.16ms to 1.61ms for batch size 32768 (i.e., k.shape=torch.Size([32768, 128, 192]), k_nope.shape=torch.Size([32768, 128, 128]), k_pe.shape=torch.Size([32768, 1, 64]).

Test Plan

verified perf improvement with trace
verified accuracy

Test Result

local-completions (model=nvidia/DeepSeek-R1-0528-FP4,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=32), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.96	±	0.0197
		strict-match	5	exact_match	↑	0.96	±	0.0197

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…MLA prefill Signed-off-by: Ming Yang <[email protected]>

chatgpt-codex-connector · 2025-11-29T01:41:45Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request introduces a performance optimization for MLA prefill by replacing torch.cat with a more efficient direct copy and broadcast mechanism in a new _concat_k_nope_k_pe method. This change correctly avoids the overhead of concatenating expanded, non-contiguous tensors and is applied consistently in _compute_prefill_context, _context_parallel_compute_prefill_context, and _forward_prefill. The implementation is clean and well-documented, leading to the significant latency reduction shown in the PR description. The change looks good.

[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in …

8d4142b

…MLA prefill Signed-off-by: Ming Yang <[email protected]>

minosfuture requested a review from pavanimajety as a code owner November 29, 2025 01:41

mergify bot added the v1 label Nov 29, 2025

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

Merge branch 'main' into optimize_cat

1288a10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill #29710

[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill #29710

minosfuture commented Nov 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill #29710

Are you sure you want to change the base?

[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill #29710

Conversation

minosfuture commented Nov 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

minosfuture commented Nov 29, 2025 •

edited by github-actions bot

Loading