[0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek mla decode #1311

ganyi1996ppo · 2025-06-20T05:35:52Z

What this PR does / why we need it?

After the disaggregated PD merged, the kv cache on deepseek will become two piece of independent buffer for kv transfer or computation. However, the current kernel, namely paged_attention_mla can only accept k_cache as a single parameter, this make us have to concat these two piece of kv cache together before the attention thus incurs a memory peak inside the attention in eager mode. In this PR we introduce a torch_npu.atb.npu_multi_head_latent_attention for mla decode path, which will be used as default path for both eager mode and aclgraph after the related torch_npu is public available. Since its still a restrict package, we add VLLM_ASCEND_MLA_PA to control its usage. This flag will be removed in the future.

Does this PR introduce any user-facing change?

Yes, add a new flag named VLLM_ASCEND_MLA_PA, but it will be removed eventually after the newest torch_npu is released.

How was this patch tested?

ganyi1996ppo · 2025-06-20T06:14:47Z

I'll cherry-pick this PR back to the main branch after its tested and merged

MengqingCao · 2025-06-20T06:27:07Z

vllm_ascend/attention/mla_v1.py

@@ -1193,10 +1207,11 @@ def forward(
                                            decode_k_nope, decode_k_pe,
                                            kv_cache, attn_metadata)
            else:
-                combined_cache = torch.cat([kv_cache[0], kv_cache[1]], dim=-1)


It seems the oom issue mainly caused by this combined_cache, is removing this enough to this pr? If so, maybe we could add npu_multi_head_latent_attention when the torch-npu is available officially

Remove is not enough in this case cause, as you can see, the paged_attention_mla can only receive the concatenated cache as its input parameter, withoutnpu_multi_head_latent_attention the cat seems inevitable.

ganyi1996ppo · 2025-06-20T07:51:59Z

dbo test seems will oom in ci

Processed prompts: 0%| | 0/41 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 519, in run_engine_core
raise e
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 508, in run_engine_core
engine_core.run_busy_loop()
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 535, in run_busy_loop
self._process_engine_step()
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 560, in _process_engine_step
outputs, model_executed = self.step_fn()
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 231, in step
model_output = self.execute_model(scheduler_output)
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 217, in execute_model
raise err
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/engine/core.py", line 211, in execute_model
return self.model_executor.execute_model(scheduler_output)
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/executor/multiproc_executor.py", line 163, in execute_model
(output, ) = self.collective_rpc("execute_model",
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
result = get_response(w, dequeue_timeout)
File "/__w/vllm-ascend/vllm-ascend/vllm-empty/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
raise RuntimeError(
RuntimeError: Worker failed with error 'NPU out of memory. Tried to allocate 11.70 GiB (NPU 0; 29.50 GiB total capacity; 25.09 GiB already allocated; 25.09 GiB current active; 3.49 GiB free; 25.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.', please check the stack trace above for the root cause
FAILED

=================================== FAILURES ===================================
____________________ test_models_distributed_DeepSeekV3_dbo ____________________

Signed-off-by: ganyi <[email protected]>

Signed-off-by: liziyu <[email protected]>

Signed-off-by: ganyi <[email protected]>

github-actions bot added the module:core label Jun 20, 2025

MengqingCao reviewed Jun 20, 2025

View reviewed changes

Yikun mentioned this pull request Jun 20, 2025

[release] 0.9.1rc1 release checklist #1315

Open

29 tasks

wangxiyuan changed the title ~~fix oom issue in mla and enable mla_pa for deepseek mla decode~~ [0.9.1]fix oom issue in mla and enable mla_pa for deepseek mla decode Jun 20, 2025

github-actions bot added documentation Improvements or additions to documentation module:quantization labels Jun 20, 2025

ganyi1996ppo force-pushed the ganyi/fix_oom branch from 79bda3a to 9a7edd8 Compare June 21, 2025 02:00

ganyi1996ppo changed the title ~~[0.9.1]fix oom issue in mla and enable mla_pa for deepseek mla decode~~ [0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek mla decode Jun 21, 2025

github-actions bot removed the module:core label Jun 21, 2025

ganyi1996ppo force-pushed the ganyi/fix_oom branch from 6c3b6ff to d06dd4f Compare June 21, 2025 13:25

github-actions bot added the module:tests label Jun 21, 2025

ganyi1996ppo force-pushed the ganyi/fix_oom branch 2 times, most recently from b4439dd to 71ec2b4 Compare June 22, 2025 01:52

github-actions bot added module:core and removed module:tests labels Jun 22, 2025

ganyi1996ppo and others added 6 commits June 22, 2025 10:35

fix oom issue in mla and enable mla_pa for deepseek mla decode

cac0aca

Signed-off-by: ganyi <[email protected]>

fix lint

ae91222

Signed-off-by: ganyi <[email protected]>

fix lint

63e064e

Signed-off-by: ganyi <[email protected]>

fix spell

e864652

Signed-off-by: ganyi <[email protected]>

update pd readme & expert_map bug fix

1e2e645

Signed-off-by: liziyu <[email protected]>

fix mc2 issue

6ae2af2

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo force-pushed the ganyi/fix_oom branch from 71ec2b4 to 6ae2af2 Compare June 22, 2025 02:37

ganyi1996ppo merged commit 30ac3d8 into vllm-project:v0.9.1-dev Jun 22, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek mla decode #1311

[0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek mla decode #1311

ganyi1996ppo commented Jun 20, 2025

Uh oh!

ganyi1996ppo commented Jun 20, 2025

Uh oh!

MengqingCao Jun 20, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo Jun 20, 2025

Uh oh!

ganyi1996ppo commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

[0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek mla decode #1311

[0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek mla decode #1311

Conversation

ganyi1996ppo commented Jun 20, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ganyi1996ppo commented Jun 20, 2025

Uh oh!

MengqingCao Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

MengqingCao Jun 20, 2025 •

edited

Loading