File tree Expand file tree Collapse file tree 1 file changed +4
-0
lines changed Expand file tree Collapse file tree 1 file changed +4
-0
lines changed Original file line number Diff line number Diff line change @@ -12,6 +12,8 @@ Distributed training of an attention model. Forked from: [hkproj/pytorch-transfo
12
12
13
13
### Setup
14
14
15
+ Login on each machine and perform the following operations:
16
+
15
17
1 . ` sudo apt-get update `
16
18
2 . ` sudo apt-get install net-tools `
17
19
3 . If you get an error about ` seahorse ` while installing ` net-tools ` , do the following:
@@ -39,6 +41,8 @@ Distributed training of an attention model. Forked from: [hkproj/pytorch-transfo
39
41
40
42
### Local training
41
43
44
+ Run the following command on any machine. Make sure to not run it on both, otherwise they will end up overwriting each other's checkpoints.
45
+
42
46
` torchrun --nproc_per_node=2 --nnodes=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:48123 train.py --batch_size 8 --model_folder "/mnt/training-data/weights" `
43
47
44
48
### Distributed training
You can’t perform that action at this time.
0 commit comments