Skip to content

WIP: Compose TP + DDP/FSDP2 #3498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

WIP: Compose TP + DDP/FSDP2 #3498

wants to merge 3 commits into from

Conversation

S1ro1
Copy link
Member

@S1ro1 S1ro1 commented Apr 10, 2025

Opening this PR to enable discussion on the extra changes this will introduce:

  • for this to work with tp from transformers, we need extra args to recreate the device mesh used there
  • we will need to refactor out TP to not be a self.distributed_type rather an extra attribute to be able to use it with other parallelisms
  • this approach enables us to easily extend towards PP/EP/CtxP etc by just adding more arguments (should probably hide them a bit better to not have the function signature explode -> i.e. ParallelismConfig)
  • hopefully this approach is gonna result in minimal changes in transformers

To iterate more on this right now standing questions are:

  • how we pass paralellism degrees used by transformers to accelerate, currently we use specific plugins (i.e. TP Plugin for tp_size, Deepspeed plugin for dp/tp degree), this approach will stop working as we would like to combine multiple plugins (TP + DDP/FSDP + optionally PP)

cc @SunMarc @kmehant

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@S1ro1
Copy link
Member Author

S1ro1 commented Apr 11, 2025

Managed to make a minimal repro of composable tp+fsdp2 working, though it requires nightly torch 🚀

Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@kmehant
Copy link
Contributor

kmehant commented May 12, 2025

pulse ping

@S1ro1
Copy link
Member Author

S1ro1 commented May 12, 2025

@kmehant we have internally decided to have as much logic as possible in transformers, so this is postponed until that is ~resolved. You can watch that here

@kmehant
Copy link
Contributor

kmehant commented May 12, 2025

I see @S1ro1 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants