Skip to content

Qwen3 MoE Preliminary: add intermediate_size argument to MLP modules #2046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ysjprojects
Copy link
Contributor

@ysjprojects ysjprojects commented May 17, 2025

class Qwen3MoeMLP(nn.Module):
    def __init__(self, config, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.intermediate_size = intermediate_size if intermediate_size is not None else config.intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
        return down_proj

In Qwen3 MoE, the MLP module is created with two possible intermediate sizes. The sparse MoE block uses MLP with size config.moe_intermediate_size while the decoder layer uses MLP with size config.intermediate_size.

This is also observed in DeepseekV3 and likely many more MoE models to come. Therefore we extend this flexibility to LitGPT's own MLP modules.

@Borda Borda enabled auto-merge (squash) May 22, 2025 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants