Skip to content

Commit bb40cf8

Browse files
committed
fixed instructions
1 parent 301837a commit bb40cf8

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ Distributed training of an attention model. Forked from: [hkproj/pytorch-transfo
1212

1313
### Setup
1414

15+
Login on each machine and perform the following operations:
16+
1517
1. `sudo apt-get update`
1618
2. `sudo apt-get install net-tools`
1719
3. If you get an error about `seahorse` while installing `net-tools`, do the following:
@@ -39,6 +41,8 @@ Distributed training of an attention model. Forked from: [hkproj/pytorch-transfo
3941

4042
### Local training
4143

44+
Run the following command on any machine. Make sure to not run it on both, otherwise they will end up overwriting each other's checkpoints.
45+
4246
`torchrun --nproc_per_node=2 --nnodes=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:48123 train.py --batch_size 8 --model_folder "/mnt/training-data/weights"`
4347

4448
### Distributed training

0 commit comments

Comments
 (0)