Add intra_group_size to topology #3696

Ali-Tehrani · 2026-01-28T03:05:54Z

Summary:
Context

Every GB200 node has 2 B200 GPU attached to it, however allows up to 72 B200 connected via NVlink. The planner needs to know how big the intra topology group size is going to be.

This causes the local_world_size to be different from the intra_group_size.

Implementation

Topology class:
- Adds pod_size, and uses that to calculate the intra_group_size (maximum number of processes linked with high intra bandwidth) to Topology class. If isn't given, then it defaults to local_world_size.
shard_estimators.py
- The shard estimators now use the intra_group_size instead of local_world_size, this allows RW/TW/CW to properly account for larger NVlink that comes with the pods.

Differential Revision: D91617887

meta-codesync · 2026-01-28T03:06:27Z

@Ali-Tehrani has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91617887.

Summary: Context --------- Every GB200 node has 2 B200 GPU attached to it, however allows up to 72 B200 connected via NVlink. The planner needs to know how big the intra topology group size is going to be. This causes the `local_world_size` to be different from the `intra_group_size`. Implementation ------------------ - Topology class: - Adds `pod_size`, and uses that to calculate the `intra_group_size` (maximum number of processes linked with high intra bandwidth) to Topology class. If isn't given, then it defaults to local_world_size. - `shard_estimators.py` - The shard estimators now use the `intra_group_size` instead of `local_world_size`, this allows RW/TW/CW to properly account for larger NVlink that comes with the pods. Reviewed By: isururanawaka Differential Revision: D91617887

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 28, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 28, 2026

Ali-Tehrani force-pushed the export-D91617887 branch from 2128c83 to 9d1ba65 Compare January 30, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add intra_group_size to topology #3696

Add intra_group_size to topology #3696

Ali-Tehrani commented Jan 28, 2026

Uh oh!

meta-codesync bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add intra_group_size to topology #3696

Are you sure you want to change the base?

Add intra_group_size to topology #3696

Conversation

Ali-Tehrani commented Jan 28, 2026

Summary: Context

Implementation

Uh oh!

meta-codesync bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Summary:
Context