Skip to content

Cuda memory usage does not decrease when increasing the number of cuda cards (fsdp_tp_example.py). #1319

@YangHui90

Description

@YangHui90

According to the implementation of the source code, I did several experiments to study the script running time and cuda memory occupancy.

  • exp1: nproc_per_node=4, nnodes=1 => cuda=2161~2411MB, runtime=63.04s
  • exp2: nproc_per_node=8, nnodes=1 => cuda=2141~2395MB, runtime=70.52s
  • exp3: nproc_per_node=4, nnodes=2 => cuda=2141~2145MB, runtime=233.03s

According to the results of the above three experiments, we find that with the increase of the number of graphics cards, the cuda memory usage did not decrease significantly, but the script running time increased.

Why?

I am looking for the reasons, according to the algorithm principle (FSDP and TP), as the number of video cards increases, the cuda memory and running time should become smaller.

My Environment

  • Pytorch version: 3.11.7
  • Operating System and version: Linux version 3.10.0-1160.114.2.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) )
  • Installed using source? [yes/no]: yes
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Which example are you using: fsdp_tp_example.py
  • Link to code or data to repro [if any]: https://github.com/pytorch/examples/tree/main/distributed/tensor_parallelism

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions