EMA accuracy problems #1092

Doraemonzm · 2022-01-18T04:35:40Z

Doraemonzm
Jan 18, 2022

Thanks for your excellent work! @rwightman
I have tried to train my own model （a variant of MobileNet v3） using a script as below :

sh scripts/distributed_train.sh  8  /data/public/imagenet2012 --model final -b 512 --sched step --epochs 450 --decay-epochs 2.4 --decay-rate .97 --opt rmsproptf --opt-eps .001 -j 8 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.3  --drop-connect 0.2 --model-ema --model-ema-decay 0.9999 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .256

After 66 epochs, the ema accuracy stays at about 0.1.
I intercepted part of the training record as follows:

Train: 64 [   0/312 (  0%)]  Loss: 2.770 (2.77)  Time: 21.410s,  191.31/s  (21.410s,  191.31/s)  LR: 1.160e-01  Data: 19.130 (19.130)
Train: 64 [  50/312 ( 16%)]  Loss: 2.804 (2.79)  Time: 0.902s, 4542.33/s  (2.633s, 1555.38/s)  LR: 1.160e-01  Data: 0.102 (1.246)
Train: 64 [ 100/312 ( 32%)]  Loss: 2.797 (2.79)  Time: 0.846s, 4841.70/s  (2.455s, 1668.13/s)  LR: 1.160e-01  Data: 0.083 (0.962)
Train: 64 [ 150/312 ( 48%)]  Loss: 2.790 (2.79)  Time: 0.923s, 4435.88/s  (2.409s, 1700.63/s)  LR: 1.160e-01  Data: 0.091 (0.699)
Train: 64 [ 200/312 ( 64%)]  Loss: 2.766 (2.79)  Time: 7.823s,  523.61/s  (2.411s, 1699.05/s)  LR: 1.160e-01  Data: 0.070 (0.551)
Train: 64 [ 250/312 ( 80%)]  Loss: 2.835 (2.79)  Time: 1.092s, 3750.54/s  (2.375s, 1724.58/s)  LR: 1.160e-01  Data: 0.084 (0.471)
Train: 64 [ 300/312 ( 96%)]  Loss: 2.852 (2.80)  Time: 0.847s, 4836.48/s  (2.376s, 1724.12/s)  LR: 1.160e-01  Data: 0.074 (0.414)
Train: 64 [ 311/312 (100%)]  Loss: 2.896 (2.81)  Time: 0.728s, 5629.42/s  (2.332s, 1756.69/s)  LR: 1.160e-01  Data: 0.000 (0.401)
Distributing BatchNorm running means and vars
Test: [   0/12]  Time: 21.367 (21.367)  Loss:  1.2891 (1.2891)  Acc@1: 69.1162 (69.1162)  Acc@5: 89.9414 (89.9414)
Test: [  12/12]  Time: 0.078 (2.470)  Loss:  0.9946 (1.6403)  Acc@1: 77.0047 (62.6120)  Acc@5: 93.0425 (84.7840)
Test (EMA): [   0/12]  Time: 18.612 (18.612)  Loss:  6.8945 (6.8945)  Acc@1:  0.0000 ( 0.0000)  Acc@5:  1.0498 ( 1.0498)
Test (EMA): [  12/12]  Time: 0.034 (2.172)  Loss:  7.3594 (7.1392)  Acc@1:  0.0000 ( 0.0980)  Acc@5:  0.0000 ( 0.5000)

Train: 65 [   0/312 (  0%)]  Loss: 2.802 (2.80)  Time: 22.782s,  179.79/s  (22.782s,  179.79/s)  LR: 1.125e-01  Data: 18.689 (18.689)
Train: 65 [  50/312 ( 16%)]  Loss: 2.785 (2.79)  Time: 0.833s, 4915.50/s  (2.595s, 1578.53/s)  LR: 1.125e-01  Data: 0.087 (1.400)
Train: 65 [ 100/312 ( 32%)]  Loss: 2.806 (2.80)  Time: 0.820s, 4997.90/s  (2.481s, 1650.78/s)  LR: 1.125e-01  Data: 0.074 (1.261)
Train: 65 [ 150/312 ( 48%)]  Loss: 2.826 (2.80)  Time: 0.920s, 4450.98/s  (2.426s, 1688.32/s)  LR: 1.125e-01  Data: 0.110 (1.078)
Train: 65 [ 200/312 ( 64%)]  Loss: 2.823 (2.81)  Time: 0.928s, 4412.98/s  (2.425s, 1688.87/s)  LR: 1.125e-01  Data: 0.137 (0.931)
Train: 65 [ 250/312 ( 80%)]  Loss: 2.789 (2.81)  Time: 0.831s, 4930.29/s  (2.398s, 1708.20/s)  LR: 1.125e-01  Data: 0.072 (0.899)
Train: 65 [ 300/312 ( 96%)]  Loss: 2.861 (2.81)  Time: 0.840s, 4878.27/s  (2.375s, 1724.34/s)  LR: 1.125e-01  Data: 0.062 (0.926)
Train: 65 [ 311/312 (100%)]  Loss: 2.851 (2.82)  Time: 0.728s, 5629.82/s  (2.347s, 1745.29/s)  LR: 1.125e-01  Data: 0.000 (0.918)
Distributing BatchNorm running means and vars
Test: [   0/12]  Time: 20.165 (20.165)  Loss:  1.3135 (1.3135)  Acc@1: 70.4590 (70.4590)  Acc@5: 90.2100 (90.2100)
Test: [  12/12]  Time: 0.034 (2.633)  Loss:  0.9727 (1.5545)  Acc@1: 78.0660 (64.8620)  Acc@5: 94.1038 (86.2480)
Test (EMA): [   0/12]  Time: 17.258 (17.258)  Loss:  6.8984 (6.8984)  Acc@1:  0.0000 ( 0.0000)  Acc@5:  1.2695 ( 1.2695)
Test (EMA): [  12/12]  Time: 0.034 (2.116)  Loss:  7.3516 (7.1292)  Acc@1:  0.0000 ( 0.0980)  Acc@5:  0.0000 ( 0.5160)

Train: 66 [   0/312 (  0%)]  Loss: 2.767 (2.77)  Time: 25.191s,  162.59/s  (25.191s,  162.59/s)  LR: 1.125e-01  Data: 21.310 (21.310)
Train: 66 [  50/312 ( 16%)]  Loss: 2.748 (2.76)  Time: 0.844s, 4853.36/s  (2.846s, 1439.17/s)  LR: 1.125e-01  Data: 0.084 (1.743)
Train: 66 [ 100/312 ( 32%)]  Loss: 2.826 (2.78)  Time: 0.859s, 4770.71/s  (2.588s, 1582.69/s)  LR: 1.125e-01  Data: 0.106 (1.434)
Train: 66 [ 150/312 ( 48%)]  Loss: 2.809 (2.79)  Time: 0.854s, 4797.23/s  (2.482s, 1650.40/s)  LR: 1.125e-01  Data: 0.094 (1.272)
Train: 66 [ 200/312 ( 64%)]  Loss: 2.808 (2.79)  Time: 0.834s, 4909.97/s  (2.477s, 1653.39/s)  LR: 1.125e-01  Data: 0.078 (1.120)
Train: 66 [ 250/312 ( 80%)]  Loss: 2.856 (2.80)  Time: 0.849s, 4825.69/s  (2.426s, 1688.60/s)  LR: 1.125e-01  Data: 0.079 (0.948)
Train: 66 [ 300/312 ( 96%)]  Loss: 2.836 (2.81)  Time: 0.922s, 4441.56/s  (2.374s, 1725.46/s)  LR: 1.125e-01  Data: 0.088 (0.830)
Train: 66 [ 311/312 (100%)]  Loss: 2.843 (2.81)  Time: 0.727s, 5637.55/s  (2.332s, 1756.42/s)  LR: 1.125e-01  Data: 0.000 (0.803)
Distributing BatchNorm running means and vars
Test: [   0/12]  Time: 19.141 (19.141)  Loss:  1.2900 (1.2900)  Acc@1: 70.1172 (70.1172)  Acc@5: 89.6973 (89.6973)
Test: [  12/12]  Time: 0.036 (2.268)  Loss:  0.9883 (1.5837)  Acc@1: 78.0660 (63.6980)  Acc@5: 92.9245 (85.6780)
Test (EMA): [   0/12]  Time: 17.182 (17.182)  Loss:  6.9023 (6.9023)  Acc@1:  0.0000 ( 0.0000)  Acc@5:  1.6357 ( 1.6357)
Test (EMA): [  12/12]  Time: 0.034 (2.031)  Loss:  7.3477 (7.1205)  Acc@1:  0.0000 ( 0.1000)  Acc@5:  0.0000 ( 0.5480)

Here is the summary:

summary.csv

Any help is appreciated.

Answered by rwightman

Jan 18, 2022

The time period of the EMA weight average is optimizer steps, so it needs to be set relative to your steps per epoch, you have a large global batch size (4096) so very few steps per epoch and need to change your decay factor to make sense (have equivalence to maybe 30-100 epochs, I usually target 10-25% of training duration). Right now your EMA weights probably won't be 'good' until a few hundred epochs have passed... you can look up details on EMA periods, etc

Other aside, it's unlikely a 4096 global batch with RMSProp will be 'great', best results for that optimizer for me have been in the 768-256 range. Maybe tweaked version of LAMB hparams from ResNet strikes back could be more compet…

View full answer

rwightman · 2022-01-18T05:18:15Z

rwightman
Jan 18, 2022
Maintainer

The time period of the EMA weight average is optimizer steps, so it needs to be set relative to your steps per epoch, you have a large global batch size (4096) so very few steps per epoch and need to change your decay factor to make sense (have equivalence to maybe 30-100 epochs, I usually target 10-25% of training duration). Right now your EMA weights probably won't be 'good' until a few hundred epochs have passed... you can look up details on EMA periods, etc

Other aside, it's unlikely a 4096 global batch with RMSProp will be 'great', best results for that optimizer for me have been in the 768-256 range. Maybe tweaked version of LAMB hparams from ResNet strikes back could be more competitive? Or SGD /w grad clipping?

1 reply

Doraemonzm Jan 18, 2022
Author

Thank you so much. I will try as you said.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

EMA accuracy problems #1092

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

EMA accuracy problems #1092

Uh oh!

Uh oh!

Doraemonzm Jan 18, 2022

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

rwightman Jan 18, 2022 Maintainer

Uh oh!

Doraemonzm Jan 18, 2022 Author

Doraemonzm
Jan 18, 2022

Replies: 1 comment 1 reply

rwightman
Jan 18, 2022
Maintainer

Doraemonzm Jan 18, 2022
Author