Skip to content

[https://nvbugs/6412108][fix] Restore original order — all_reduce the routed partial first, then add the…#15922

Open
trtllm-agent wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6412108
Open

[https://nvbugs/6412108][fix] Restore original order — all_reduce the routed partial first, then add the…#15922
trtllm-agent wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6412108

Conversation

@trtllm-agent

@trtllm-agent trtllm-agent commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: Sharding-IR refactor swapped the order of expert_output + shared_expert_output and all_reduce, scaling the replicated (unsharded) shared_expert output by world_size (8×) and corrupting MMLU/GSM8K outputs.
  • Fix: Restore original order — all_reduce the routed partial first, then add the replicated shared output — and update the comment; also removed the nvbugs/6412108 waiver line.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes

    • Improved the handling of mixture-of-experts outputs to make the final result more stable and consistent across distributed runs.
  • Tests

    • Removed a test waiver, so one previously skipped accuracy check will now run normally.

…replicated shared expert

The shared expert (Qwen3_5MoeMLP) intentionally omits the layer_type hint on
its torch_linear_simple ops and the qwen3.5_moe_400b.yaml shard_layers whitelist
excludes it, so its output is already the full value on every rank. The previous
single-merge-point ordering (add then all_reduce) scaled the replicated shared
output by world_size and dropped MMLU from ~85% to ~0.07%. Restore the original
order: all_reduce the routed partial first, then add the replicated shared output.

Signed-off-by: trtllm-agent <296075020+trtllm-agent@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c98e406-4168-4fda-82ff-43cf04fc61f1

📥 Commits

Reviewing files that changed from the base of the PR and between 3b23a9f and da126a4.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

📝 Walkthrough

Walkthrough

The reduction order in Qwen3_5MoeSparseMoeBlock.forward is changed so the routed expert output is all-reduced before adding the shared expert output, rather than summing first then reducing. A test waiver entry for a Qwen3.5 MoE NVFP4 accuracy test is removed.

Changes

MoE reduction fix and waiver update

Layer / File(s) Summary
Reorder MoE output all-reduce and shared expert addition
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5_moe.py, tests/integration/test_lists/waives.txt
All-reduce is now applied to the routed expert output first, then the shared expert output is added; the previously waived NVFP4 accuracy test for TestQwen3_5_397B_MoE is unwaived.

Estimated code review effort: 2 (Simple) | ~10 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#15914: Both PRs modify tests/integration/test_lists/waives.txt around Qwen3_5_397B MoE NVFP4 skip/waiver entries.

Suggested reviewers: xinhe-nv, jiaganc, Superjomn

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the bug fix and matches the main code change.
Description check ✅ Passed The description explains the root cause, fix, tests, and bug link, so it is mostly complete.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants