speed_check_ddp

Summary

It seems PyTorch offers two (maybe three? including torchx) different methods to handle multi-gpu training, i.e., Spawning, and ElasticTorch(newer). The goal is to compare these methods in speed.

Requirements

PyTorch version: > 1.10
torchelastic

Run Commands (single node)

Spawning

python main_spawn.py --dist-url 'tcp://localhost:23456' --multiprocessing-distributed --world-size 1 --rank 0

ElasticTorch

torchrun --standalone --nnodes=1 --nproc_per_node=$NUM_GPU main_launch.py

Results

On CIFAR-10, TITAN V * 8:

Time per epoch	Spawning	ElasticTorch
Training (sec)	8.52	5.73
Evaluation (sec)	2.62	1.64

=> In my setting, ElasticTorch is faster than Spawning in about 50%

References

Spawning: [1]
ElasticTorch: [1], [2]

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_elastic_torch.py		main_elastic_torch.py
main_hazel.py		main_hazel.py
main_launch.py		main_launch.py
main_spawn.py		main_spawn.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

speed_check_ddp

Summary

Requirements

Run Commands (single node)

Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

aitrics-chris/speed_check_ddp

Folders and files

Latest commit

History

Repository files navigation

speed_check_ddp

Summary

Requirements

Run Commands (single node)

Results

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages