[Perf] Use fused ops npu_top_k_top_p #1308

Pr0Wh1teGivee · 2025-06-20T03:30:33Z

What this PR does / why we need it?

Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are not None, otherwise fallback to the original one. The replacement will take place automatically when VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1 .

This patch are using npu_top_k_top_p which required torch_npu>=2.5.1.post1.dev20250619

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested by DeepSeek R1

Yikun · 2025-06-20T09:30:58Z

vllm_ascend/patch/worker/patch_common/patch_sampler.py

    k: torch.Tensor,
 ) -> torch.Tensor:
+    if p is not None and k is not None:
+        return torch_npu.npu_top_k_top_p(logits, p, k)


Which torch_npu version supported this call?

since rc2 b050

Please update commits msg with publich torch_npu version.

such as: https://mirrors.huaweicloud.com/ascend/repos/pypi/torch-npu/

This API introduced by torch_npu-2.5.1.post1.dev20250619

Yikun · 2025-06-23T13:13:13Z

Please do a rebase because we update the torch version morning, better to do a e2e test.
Please add a ut to make sure npu_top_k_top_p is called and get expected results in : https://github.com/vllm-project/vllm-ascend/tree/main/tests/ut

Yikun · 2025-06-24T07:07:16Z

vllm_ascend/patch/worker/patch_common/patch_sampler.py

    logits: torch.Tensor,
-    p: torch.Tensor,
    k: torch.Tensor,
+    p: torch.Tensor,


Hmm....So, the order of _apply_top_k_top_p was wrong.

I suggested to keep same order with upstream https://github.com/vllm-project/vllm/blob/9a3b88328f7e434cac35b90ee463de6689f9a833/vllm/model_executor/layers/sampler.py#L398

Please change L98

Pr0Wh1teGivee · 2025-06-24T08:15:59Z

Please do a rebase because we update the torch version morning, better to do a e2e test.

Please add a ut to make sure npu_top_k_top_p is called and get expected results in : https://github.com/vllm-project/vllm-ascend/tree/main/tests/ut

fixed

wangxiyuan · 2025-06-24T08:55:22Z

tests/ut/worker/patch_common/test_patch_sampler.py

+        mock_npu_op.assert_called_once_with(logits, p, k)
+
+
+if __name__ == "__main__":


no need for this

codecov · 2025-06-24T12:37:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.55%. Comparing base (c30ddb8) to head (d88eb37).
⚠️ Report is 550 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1308      +/-   ##
==========================================
+ Coverage   27.39%   27.55%   +0.16%     
==========================================
  Files          56       57       +1     
  Lines        6191     6238      +47     
==========================================
+ Hits         1696     1719      +23     
- Misses       4495     4519      +24

Flag	Coverage Δ
unittests	`27.55% <100.00%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…huaweicloud.com/ascend/repos/pypi/torch-npu/ Signed-off-by: Pr0Wh1teGivee <[email protected]>

Yikun

LGTM, please address comments in a new PR.

Yikun · 2025-06-25T12:46:30Z

tests/ut/worker/patch_common/test_patch_sampler.py

+        import vllm_ascend.patch.worker.patch_common.patch_sampler
+        importlib.reload(vllm_ascend.patch.worker.patch_common.patch_sampler)


Please refer this [1] to patch test and make TestTopKTopPSamplerOptimize based on TestBase

[1] https://github.com/vllm-project/vllm-ascend/pull/1386/files#diff-eae86bf6e7a9a6ef5d079fa80ca12e946ecff4e587e5b66d3761f2cc7f6bb9c5R4

Yikun · 2025-06-25T12:56:47Z

vllm_ascend/patch/worker/patch_common/patch_sampler.py

    return logits


 def _apply_top_k_top_p(


I trend to rename this _apply_top_k_top_p to apply_top_k_top_p to avoid confused to keep same with:
https://github.com/vllm-project/vllm/blob/c53fec1fcb27aca9475e55c2d1e74c532f5f0364/vllm/v1/sample/ops/topk_topp_sampler.py#L165

### What this PR does / why we need it? Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are not None, otherwise fallback to the original one. The replacement will take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` . This patch are using `npu_top_k_top_p` which required torch_npu>=2.5.1.post1.dev20250619 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by DeepSeek R1 and UT passed Signed-off-by: Pr0Wh1teGivee <[email protected]>

wangxiyuan · 2025-07-03T07:30:34Z

apply_min_p has been removed from vllm main vllm-project/vllm@48fb076 we'll cleanup this patch code once vllm 0.9.2 is comming. Please update to the newest code if you still want this feature.

### What this PR does / why we need it? Fixed 310p failure when using the sampler feature. The root cause is: torch_npu.npu_top_k_top_p uses the operator aclnnApplyTopKTopP, but aclnnApplyTopKTopP currently does not support 310P. First PR that has the issue is #1308. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@207b750 Signed-off-by: leo-pony <[email protected]>

### What this PR does / why we need it? Fixed 310p failure when using the sampler feature. The root cause is: torch_npu.npu_top_k_top_p uses the operator aclnnApplyTopKTopP, but aclnnApplyTopKTopP currently does not support 310P. First PR that has the issue is vllm-project#1308. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@207b750 Signed-off-by: leo-pony <[email protected]>

### What this PR does / why we need it? Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are not None, otherwise fallback to the original one. The replacement will take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` . This patch are using `npu_top_k_top_p` which required torch_npu>=2.5.1.post1.dev20250619 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by DeepSeek R1 and UT passed Signed-off-by: Pr0Wh1teGivee <[email protected]>

### What this PR does / why we need it? Fixed 310p failure when using the sampler feature. The root cause is: torch_npu.npu_top_k_top_p uses the operator aclnnApplyTopKTopP, but aclnnApplyTopKTopP currently does not support 310P. First PR that has the issue is vllm-project#1308. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@207b750 Signed-off-by: leo-pony <[email protected]>

github-actions bot added the documentation Improvements or additions to documentation label Jun 20, 2025

Yikun reviewed Jun 20, 2025

View reviewed changes

github-actions bot added the module:tests label Jun 24, 2025

Yikun reviewed Jun 24, 2025

View reviewed changes

Pr0Wh1teGivee changed the title ~~use fused ops npu_top_k_top_p~~ [Perf] Use fused ops npu_top_k_top_p Jun 24, 2025

wangxiyuan reviewed Jun 24, 2025

View reviewed changes

github-actions bot removed the documentation Improvements or additions to documentation label Jun 24, 2025

use fused ops npu_top_k_top_p which is introduced in https://mirrors.…

d88eb37

…huaweicloud.com/ascend/repos/pypi/torch-npu/ Signed-off-by: Pr0Wh1teGivee <[email protected]>

wangxiyuan approved these changes Jun 25, 2025

View reviewed changes

Yikun approved these changes Jun 25, 2025

View reviewed changes

Yikun merged commit 2fda604 into vllm-project:main Jun 25, 2025
24 checks passed

Yikun added this to the v0.9.1 milestone Jun 26, 2025

Pr0Wh1teGivee mentioned this pull request Jul 23, 2025

[0.9.1][Perf] Use fused ops npu_top_k_top_p #1920

Merged

leo-pony mentioned this pull request Jul 31, 2025

Fixed 310p failure when using the sampler feature #2151

Merged

		mock_npu_op.assert_called_once_with(logits, p, k)


		if __name__ == "__main__":

		import vllm_ascend.patch.worker.patch_common.patch_sampler
		importlib.reload(vllm_ascend.patch.worker.patch_common.patch_sampler)

[Perf] Use fused ops npu_top_k_top_p #1308

[Perf] Use fused ops npu_top_k_top_p #1308

Uh oh!

Conversation

Pr0Wh1teGivee commented Jun 20, 2025 • edited by Yikun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yikun Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yikun commented Jun 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pr0Wh1teGivee commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wangxiyuan commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pr0Wh1teGivee commented Jun 20, 2025 •

edited by Yikun

Loading

Yikun Jun 23, 2025 •

edited

Loading

codecov bot commented Jun 24, 2025 •

edited

Loading