-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Open
Description
Your issue may already be reported!
Please search on the issue tracker before creating one.
Context
- Pytorch version: 2.6.0
- Operating System and version: linux
Your Environment
- Installed using source? [yes/no]: yes
- Are you planning to deploy it using docker container? [yes/no]:no
- Is it a CPU or GPU environment?:gpu
- Which example are you using: fsdp2/examples/distributed/FSDP2/train.py
- Link to code or data to repro [if any]:no
Expected Behavior
(1) normal loss like 1.95 1.86 1.73 etc.
(2) unsharded_param.gard of module is not zero
Current Behavior
(1) abnormal loss -13857836160.0, -15615669120.0 , -17379222400.0
(2) unsharded_param.gard of module is zero in every layer when i user logger to debug.
Possible Solution
Steps to Reproduce
1.run case fsdp2/examples/distributed/FSDP2/train.py
2.
...
Failure Logs [if any]
Metadata
Metadata
Assignees
Labels
No labels