EMA accuracy problems #1092
-
Thanks for your excellent work! @rwightman sh scripts/distributed_train.sh 8 /data/public/imagenet2012 --model final -b 512 --sched step --epochs 450 --decay-epochs 2.4 --decay-rate .97 --opt rmsproptf --opt-eps .001 -j 8 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.3 --drop-connect 0.2 --model-ema --model-ema-decay 0.9999 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .256 After 66 epochs, the ema accuracy stays at about 0.1. Train: 64 [ 0/312 ( 0%)] Loss: 2.770 (2.77) Time: 21.410s, 191.31/s (21.410s, 191.31/s) LR: 1.160e-01 Data: 19.130 (19.130)
Train: 64 [ 50/312 ( 16%)] Loss: 2.804 (2.79) Time: 0.902s, 4542.33/s (2.633s, 1555.38/s) LR: 1.160e-01 Data: 0.102 (1.246)
Train: 64 [ 100/312 ( 32%)] Loss: 2.797 (2.79) Time: 0.846s, 4841.70/s (2.455s, 1668.13/s) LR: 1.160e-01 Data: 0.083 (0.962)
Train: 64 [ 150/312 ( 48%)] Loss: 2.790 (2.79) Time: 0.923s, 4435.88/s (2.409s, 1700.63/s) LR: 1.160e-01 Data: 0.091 (0.699)
Train: 64 [ 200/312 ( 64%)] Loss: 2.766 (2.79) Time: 7.823s, 523.61/s (2.411s, 1699.05/s) LR: 1.160e-01 Data: 0.070 (0.551)
Train: 64 [ 250/312 ( 80%)] Loss: 2.835 (2.79) Time: 1.092s, 3750.54/s (2.375s, 1724.58/s) LR: 1.160e-01 Data: 0.084 (0.471)
Train: 64 [ 300/312 ( 96%)] Loss: 2.852 (2.80) Time: 0.847s, 4836.48/s (2.376s, 1724.12/s) LR: 1.160e-01 Data: 0.074 (0.414)
Train: 64 [ 311/312 (100%)] Loss: 2.896 (2.81) Time: 0.728s, 5629.42/s (2.332s, 1756.69/s) LR: 1.160e-01 Data: 0.000 (0.401)
Distributing BatchNorm running means and vars
Test: [ 0/12] Time: 21.367 (21.367) Loss: 1.2891 (1.2891) Acc@1: 69.1162 (69.1162) Acc@5: 89.9414 (89.9414)
Test: [ 12/12] Time: 0.078 (2.470) Loss: 0.9946 (1.6403) Acc@1: 77.0047 (62.6120) Acc@5: 93.0425 (84.7840)
Test (EMA): [ 0/12] Time: 18.612 (18.612) Loss: 6.8945 (6.8945) Acc@1: 0.0000 ( 0.0000) Acc@5: 1.0498 ( 1.0498)
Test (EMA): [ 12/12] Time: 0.034 (2.172) Loss: 7.3594 (7.1392) Acc@1: 0.0000 ( 0.0980) Acc@5: 0.0000 ( 0.5000)
Train: 65 [ 0/312 ( 0%)] Loss: 2.802 (2.80) Time: 22.782s, 179.79/s (22.782s, 179.79/s) LR: 1.125e-01 Data: 18.689 (18.689)
Train: 65 [ 50/312 ( 16%)] Loss: 2.785 (2.79) Time: 0.833s, 4915.50/s (2.595s, 1578.53/s) LR: 1.125e-01 Data: 0.087 (1.400)
Train: 65 [ 100/312 ( 32%)] Loss: 2.806 (2.80) Time: 0.820s, 4997.90/s (2.481s, 1650.78/s) LR: 1.125e-01 Data: 0.074 (1.261)
Train: 65 [ 150/312 ( 48%)] Loss: 2.826 (2.80) Time: 0.920s, 4450.98/s (2.426s, 1688.32/s) LR: 1.125e-01 Data: 0.110 (1.078)
Train: 65 [ 200/312 ( 64%)] Loss: 2.823 (2.81) Time: 0.928s, 4412.98/s (2.425s, 1688.87/s) LR: 1.125e-01 Data: 0.137 (0.931)
Train: 65 [ 250/312 ( 80%)] Loss: 2.789 (2.81) Time: 0.831s, 4930.29/s (2.398s, 1708.20/s) LR: 1.125e-01 Data: 0.072 (0.899)
Train: 65 [ 300/312 ( 96%)] Loss: 2.861 (2.81) Time: 0.840s, 4878.27/s (2.375s, 1724.34/s) LR: 1.125e-01 Data: 0.062 (0.926)
Train: 65 [ 311/312 (100%)] Loss: 2.851 (2.82) Time: 0.728s, 5629.82/s (2.347s, 1745.29/s) LR: 1.125e-01 Data: 0.000 (0.918)
Distributing BatchNorm running means and vars
Test: [ 0/12] Time: 20.165 (20.165) Loss: 1.3135 (1.3135) Acc@1: 70.4590 (70.4590) Acc@5: 90.2100 (90.2100)
Test: [ 12/12] Time: 0.034 (2.633) Loss: 0.9727 (1.5545) Acc@1: 78.0660 (64.8620) Acc@5: 94.1038 (86.2480)
Test (EMA): [ 0/12] Time: 17.258 (17.258) Loss: 6.8984 (6.8984) Acc@1: 0.0000 ( 0.0000) Acc@5: 1.2695 ( 1.2695)
Test (EMA): [ 12/12] Time: 0.034 (2.116) Loss: 7.3516 (7.1292) Acc@1: 0.0000 ( 0.0980) Acc@5: 0.0000 ( 0.5160)
Train: 66 [ 0/312 ( 0%)] Loss: 2.767 (2.77) Time: 25.191s, 162.59/s (25.191s, 162.59/s) LR: 1.125e-01 Data: 21.310 (21.310)
Train: 66 [ 50/312 ( 16%)] Loss: 2.748 (2.76) Time: 0.844s, 4853.36/s (2.846s, 1439.17/s) LR: 1.125e-01 Data: 0.084 (1.743)
Train: 66 [ 100/312 ( 32%)] Loss: 2.826 (2.78) Time: 0.859s, 4770.71/s (2.588s, 1582.69/s) LR: 1.125e-01 Data: 0.106 (1.434)
Train: 66 [ 150/312 ( 48%)] Loss: 2.809 (2.79) Time: 0.854s, 4797.23/s (2.482s, 1650.40/s) LR: 1.125e-01 Data: 0.094 (1.272)
Train: 66 [ 200/312 ( 64%)] Loss: 2.808 (2.79) Time: 0.834s, 4909.97/s (2.477s, 1653.39/s) LR: 1.125e-01 Data: 0.078 (1.120)
Train: 66 [ 250/312 ( 80%)] Loss: 2.856 (2.80) Time: 0.849s, 4825.69/s (2.426s, 1688.60/s) LR: 1.125e-01 Data: 0.079 (0.948)
Train: 66 [ 300/312 ( 96%)] Loss: 2.836 (2.81) Time: 0.922s, 4441.56/s (2.374s, 1725.46/s) LR: 1.125e-01 Data: 0.088 (0.830)
Train: 66 [ 311/312 (100%)] Loss: 2.843 (2.81) Time: 0.727s, 5637.55/s (2.332s, 1756.42/s) LR: 1.125e-01 Data: 0.000 (0.803)
Distributing BatchNorm running means and vars
Test: [ 0/12] Time: 19.141 (19.141) Loss: 1.2900 (1.2900) Acc@1: 70.1172 (70.1172) Acc@5: 89.6973 (89.6973)
Test: [ 12/12] Time: 0.036 (2.268) Loss: 0.9883 (1.5837) Acc@1: 78.0660 (63.6980) Acc@5: 92.9245 (85.6780)
Test (EMA): [ 0/12] Time: 17.182 (17.182) Loss: 6.9023 (6.9023) Acc@1: 0.0000 ( 0.0000) Acc@5: 1.6357 ( 1.6357)
Test (EMA): [ 12/12] Time: 0.034 (2.031) Loss: 7.3477 (7.1205) Acc@1: 0.0000 ( 0.1000) Acc@5: 0.0000 ( 0.5480) Here is the summary: Any help is appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The time period of the EMA weight average is optimizer steps, so it needs to be set relative to your steps per epoch, you have a large global batch size (4096) so very few steps per epoch and need to change your decay factor to make sense (have equivalence to maybe 30-100 epochs, I usually target 10-25% of training duration). Right now your EMA weights probably won't be 'good' until a few hundred epochs have passed... you can look up details on EMA periods, etc Other aside, it's unlikely a 4096 global batch with RMSProp will be 'great', best results for that optimizer for me have been in the 768-256 range. Maybe tweaked version of LAMB hparams from ResNet strikes back could be more competitive? Or SGD /w grad clipping? |
Beta Was this translation helpful? Give feedback.
The time period of the EMA weight average is optimizer steps, so it needs to be set relative to your steps per epoch, you have a large global batch size (4096) so very few steps per epoch and need to change your decay factor to make sense (have equivalence to maybe 30-100 epochs, I usually target 10-25% of training duration). Right now your EMA weights probably won't be 'good' until a few hundred epochs have passed... you can look up details on EMA periods, etc
Other aside, it's unlikely a 4096 global batch with RMSProp will be 'great', best results for that optimizer for me have been in the 768-256 range. Maybe tweaked version of LAMB hparams from ResNet strikes back could be more compet…