[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill #29710
+30
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Reduce k tensor concatenation latency, from 3.16ms to 1.61ms for batch size 32768 (i.e.,
k.shape=torch.Size([32768, 128, 192]), k_nope.shape=torch.Size([32768, 128, 128]), k_pe.shape=torch.Size([32768, 1, 64]).Test Plan
Test Result
local-completions (model=nvidia/DeepSeek-R1-0528-FP4,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=32), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.