Why all 1D parameters are optimized without weight decay? #1744
Unanswered
function2-llx
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When constructing optimizers, timm sets weight decay to 0 for all 1D parameters. E.g.:
pytorch-image-models/timm/optim/optim_factory.py
Lines 57 to 58 in 56b9031
pytorch-image-models/timm/optim/optim_factory.py
Lines 129 to 132 in 56b9031
I'm wondering if this rule is always correct. I understand that 1D parameters may be some embedding and not be used for matrix multiplication. But, e.g., they can still be used for dot production.
Any explanation will be appreciated.
Beta Was this translation helpful? Give feedback.
All reactions