Skip to content

[0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek mla decode #1311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 22, 2025

Conversation

ganyi1996ppo
Copy link
Collaborator

What this PR does / why we need it?

After the disaggregated PD merged, the kv cache on deepseek will become two piece of independent buffer for kv transfer or computation. However, the current kernel, namely paged_attention_mla can only accept k_cache as a single parameter, this make us have to concat these two piece of kv cache together before the attention thus incurs a memory peak inside the attention in eager mode. In this PR we introduce a torch_npu.atb.npu_multi_head_latent_attention for mla decode path, which will be used as default path for both eager mode and aclgraph after the related torch_npu is public available. Since its still a restrict package, we add VLLM_ASCEND_MLA_PA to control its usage. This flag will be removed in the future.

Does this PR introduce any user-facing change?

Yes, add a new flag named VLLM_ASCEND_MLA_PA, but it will be removed eventually after the newest torch_npu is released.

How was this patch tested?

@ganyi1996ppo
Copy link
Collaborator Author

I'll cherry-pick this PR back to the main branch after its tested and merged

@@ -1193,10 +1207,11 @@ def forward(
decode_k_nope, decode_k_pe,
kv_cache, attn_metadata)
else:
combined_cache = torch.cat([kv_cache[0], kv_cache[1]], dim=-1)
Copy link
Collaborator

@MengqingCao MengqingCao Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the oom issue mainly caused by this combined_cache, is removing this enough to this pr? If so, maybe we could add npu_multi_head_latent_attention when the torch-npu is available officially

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove is not enough in this case cause, as you can see, the paged_attention_mla can only receive the concatenated cache as its input parameter, withoutnpu_multi_head_latent_attention the cat seems inevitable.

@ganyi1996ppo
Copy link
Collaborator Author

dbo test seems will oom in ci

Processed prompts: 0%| | 0/41 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 519, in run_engine_core
raise e
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 508, in run_engine_core
engine_core.run_busy_loop()
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 535, in run_busy_loop
self._process_engine_step()
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 560, in _process_engine_step
outputs, model_executed = self.step_fn()
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 231, in step
model_output = self.execute_model(scheduler_output)
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 217, in execute_model
raise err
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 211, in execute_model
return self.model_executor.execute_model(scheduler_output)
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/executor/multiproc_executor.py", line 163, in execute_model
(output, ) = self.collective_rpc("execute_model",
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
result = get_response(w, dequeue_timeout)
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
raise RuntimeError(
RuntimeError: Worker failed with error 'NPU out of memory. Tried to allocate 11.70 GiB (NPU 0; 29.50 GiB total capacity; 25.09 GiB already allocated; 25.09 GiB current active; 3.49 GiB free; 25.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.', please check the stack trace above for the root cause
FAILED

=================================== FAILURES ===================================
____________________ test_models_distributed_DeepSeekV3_dbo ____________________

@Yikun Yikun mentioned this pull request Jun 20, 2025
29 tasks
@wangxiyuan wangxiyuan changed the title fix oom issue in mla and enable mla_pa for deepseek mla decode [0.9.1]fix oom issue in mla and enable mla_pa for deepseek mla decode Jun 20, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation module:quantization labels Jun 20, 2025
@ganyi1996ppo ganyi1996ppo changed the title [0.9.1]fix oom issue in mla and enable mla_pa for deepseek mla decode [0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek mla decode Jun 21, 2025
@ganyi1996ppo ganyi1996ppo force-pushed the ganyi/fix_oom branch 2 times, most recently from b4439dd to 71ec2b4 Compare June 22, 2025 01:52
@ganyi1996ppo ganyi1996ppo merged commit 30ac3d8 into vllm-project:v0.9.1-dev Jun 22, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation module:core module:quantization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants