-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Open
Description
According to the implementation of the source code, I did several experiments to study the script running time and cuda memory occupancy.
- exp1: nproc_per_node=4, nnodes=1 => cuda=2161~2411MB, runtime=63.04s
- exp2: nproc_per_node=8, nnodes=1 => cuda=2141~2395MB, runtime=70.52s
- exp3: nproc_per_node=4, nnodes=2 => cuda=2141~2145MB, runtime=233.03s
According to the results of the above three experiments, we find that with the increase of the number of graphics cards, the cuda memory usage did not decrease significantly, but the script running time increased.
Why?
I am looking for the reasons, according to the algorithm principle (FSDP and TP), as the number of video cards increases, the cuda memory and running time should become smaller.
My Environment
- Pytorch version: 3.11.7
- Operating System and version: Linux version 3.10.0-1160.114.2.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) )
- Installed using source? [yes/no]: yes
- Are you planning to deploy it using docker container? [yes/no]: no
- Is it a CPU or GPU environment?: GPU
- Which example are you using: fsdp_tp_example.py
- Link to code or data to repro [if any]: https://github.com/pytorch/examples/tree/main/distributed/tensor_parallelism
Metadata
Metadata
Assignees
Labels
No labels