Cuda memory usage does not decrease when increasing the number of cuda cards (fsdp_tp_example.py).

According to the implementation of the source code, I did several experiments to study the script running time and cuda memory occupancy.

- exp1:  nproc_per_node=4, nnodes=1     =>              cuda=2161~2411MB, runtime=63.04s
- exp2:  nproc_per_node=8, nnodes=1     =>              cuda=2141~2395MB, runtime=70.52s
- exp3:  nproc_per_node=4, nnodes=2     =>              cuda=2141~2145MB, runtime=233.03s

According to the results of the above three experiments, we find that with the increase of the number of graphics cards, the cuda memory usage did not decrease significantly, but the script running time increased.

Why?

I am looking for the reasons, according to the algorithm principle (FSDP and TP), as the number of video cards increases, the cuda memory and running time should become smaller.

# My Environment
* Pytorch version: 3.11.7
* Operating System and version: Linux version 3.10.0-1160.114.2.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) )
* Installed using source? [yes/no]: yes
* Are you planning to deploy it using docker container? [yes/no]: no
* Is it a CPU or GPU environment?: GPU
* Which example are you using: fsdp_tp_example.py
* Link to code or data to repro [if any]: https://github.com/pytorch/examples/tree/main/distributed/tensor_parallelism

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuda memory usage does not decrease when increasing the number of cuda cards (fsdp_tp_example.py). #1319

My Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cuda memory usage does not decrease when increasing the number of cuda cards (fsdp_tp_example.py). #1319

Description

My Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions