Skip to content

[Kernel] Remove cumsum in groupedmatmul #987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 6, 2025

Conversation

hahazhky
Copy link
Contributor

@hahazhky hahazhky commented May 28, 2025

What this PR does / why we need it?

remove cumsum operator in MOE to improve performance

Does this PR introduce any user-facing change?

How was this patch tested?

it should be tested on a case with mc2 operator and graph mode enabled

gate_up_out_list = torch_npu.npu_grouped_matmul(
x=[expand_x],
weight=[w1],
split_item=2,
group_list_type=0,
group_list_type=1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add some comments to explain why group_list_type=1 is used here, e.g., to avoid the cumulative calculation of the group list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


if VLLM_ENABLE_FIX_ROUTE:
uniform_group_list = hidden_states.shape[0] * all_to_all_group_size * top_k // moe_expert_num
group_list = torch.Tensor([uniform_group_list] * w1.shape[0]).long().npu()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
group_list = torch.Tensor([uniform_group_list] * w1.shape[0]).long().npu()
group_list = torch.Tensor([uniform_group_list] * w1.shape[0]).long().to(hidden_states.device)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

uniform_topk_list = [
(i + rank) % moe_expert_num for i in range(rank * step, (rank + 1) * step)
]
topk_ids = torch.Tensor(uniform_topk_list).int().view(hidden_states.shape[0], -1).npu()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
topk_ids = torch.Tensor(uniform_topk_list).int().view(hidden_states.shape[0], -1).npu()
topk_ids = torch.Tensor(uniform_topk_list).int().view(hidden_states.shape[0], -1).to(hidden_states.device)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@hahazhky hahazhky changed the title add fix routing for performance test [Feature] add fix routing for performance test May 28, 2025
@hahazhky hahazhky changed the title [Feature] add fix routing for performance test [Core][Kernel] add fix routing for performance test May 28, 2025
@MengqingCao
Copy link
Collaborator

overall lgtm now, thanks!

@hahazhky
Copy link
Contributor Author

@wangxiyuan @Yikun @ganyi1996ppo hello, please review this pr, @MengqingCao has replied lgtm

@@ -50,6 +51,14 @@ def fused_experts_with_mc2(
) -> torch.Tensor:
global_bs = 0
moe_expert_num = len(expert_map)

rank = torch.distributed.get_rank()
if VLLM_ENABLE_FIX_ROUTE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance test under the fix route scenario seems pointless, do we really need this env variable in the repo?

@ganyi1996ppo
Copy link
Collaborator

ganyi1996ppo commented May 30, 2025

Looks good except the fix routing part, have we really need to add this? Though the performance may gets better with this opened, but this performance data means nothing in the real production environment right? @hahazhky

@hahazhky
Copy link
Contributor Author

hahazhky commented Jun 3, 2025

Looks good except the fix routing part, have we really need to add this? Though the performance may gets better with this opened, but this performance data means nothing in the real production environment right? @hahazhky

yes,we cannot open this in real production environment,but we can use it as a performance debug option to check the upper limit,and with the training continues,time-cost will be close to this performance

Copy link

github-actions bot commented Jun 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@Yikun Yikun changed the title [Core][Kernel] add fix routing for performance test [Core][Kernel] Remove cumsum in MOE and add fix routing ENV for performance test Jun 4, 2025
Comment on lines 72 to 73
"VLLM_ENABLE_FIX_ROUTE":
lambda: bool(int(os.getenv("VLLM_ENABLE_FIX_ROUTE", '0'))),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"VLLM_ENABLE_FIX_ROUTE":
lambda: bool(int(os.getenv("VLLM_ENABLE_FIX_ROUTE", '0'))),
"VLLM_ASCEND_ENABLE_FIX_ROUTE":
lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_FIX_ROUTE", '0'))),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@@ -61,3 +61,22 @@ def test_models_distributed_DeepSeek():
distributed_executor_backend="mp",
) as vllm_model:
vllm_model.generate_greedy(example_prompts, max_tokens)


def test_models_distributed_fix_route_DeepSeek():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should unset ENV after test executed: @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_TOPK_OPTIMZE": "1"})

Suggested change
def test_models_distributed_fix_route_DeepSeek():
@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_TOPK_OPTIMZE": "1"})
def test_models_distributed_fix_route_DeepSeek():

[1] https://docs.python.org/3/library/unittest.mock.html#unittest.mock.patch.dict

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -50,6 +51,14 @@ def fused_experts_with_mc2(
) -> torch.Tensor:
global_bs = 0
moe_expert_num = len(expert_map)

rank = torch.distributed.get_rank()
if VLLM_ENABLE_FIX_ROUTE:
Copy link
Collaborator

@Yikun Yikun Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure VLLM_ENABLE_FIX_ROUTE is still needed or not after afc4c0c (enable_force_load_balance)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix route is no longer needed, delete relevant code

@hahazhky hahazhky force-pushed the main branch 3 times, most recently from a00ba7b to 936189e Compare June 4, 2025 15:31
@hahazhky hahazhky changed the title [Core][Kernel] Remove cumsum in MOE and add fix routing ENV for performance test [Kernel] Remove cumsum in groupedmatmul Jun 4, 2025
Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except inline comments

"deepseek-ai/DeepSeek-V2-Lite",
dtype=dtype,
tensor_parallel_size=8,
enable_expert_parallel=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this change more general: @pytest.mark.parametrize("enable_expert_parallel", [True, False])?

with VllmRunner(
"deepseek-ai/DeepSeek-V2-Lite",
dtype=dtype,
tensor_parallel_size=8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI fail due to only 4 cards avaiable but you used tp8

@hahazhky hahazhky force-pushed the main branch 3 times, most recently from 1ccff57 to ac92d4b Compare June 5, 2025 06:21
Copy link

github-actions bot commented Jun 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@hahazhky hahazhky force-pushed the main branch 2 times, most recently from 588cd4b to db1d487 Compare June 5, 2025 11:16
@hahazhky hahazhky force-pushed the main branch 2 times, most recently from 9a7bcaa to f057135 Compare June 5, 2025 14:18
@wangxiyuan wangxiyuan added the ready read for review label Jun 6, 2025
@wangxiyuan wangxiyuan merged commit 0b12c2a into vllm-project:main Jun 6, 2025
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ops ready read for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants