Skip to content

fsdp2 example unsharded_param.grad zero #1377

@Nju-Ben

Description

@Nju-Ben

Your issue may already be reported!
Please search on the issue tracker before creating one.

Context

  • Pytorch version: 2.6.0
  • Operating System and version: linux

Your Environment

  • Installed using source? [yes/no]: yes
  • Are you planning to deploy it using docker container? [yes/no]:no
  • Is it a CPU or GPU environment?:gpu
  • Which example are you using: fsdp2/examples/distributed/FSDP2/train.py
  • Link to code or data to repro [if any]:no

Expected Behavior

(1) normal loss like 1.95 1.86 1.73 etc.
(2) unsharded_param.gard of module is not zero

Current Behavior

(1) abnormal loss -13857836160.0, -15615669120.0 , -17379222400.0
(2) unsharded_param.gard of module is zero in every layer when i user logger to debug.

Possible Solution

Steps to Reproduce

1.run case fsdp2/examples/distributed/FSDP2/train.py
2.
...

Failure Logs [if any]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions