Why all 1D parameters are optimized without weight decay? #1744

function2-llx · 2023-03-27T11:57:06Z

function2-llx
Mar 27, 2023

When constructing optimizers, timm sets weight decay to 0 for all 1D parameters. E.g.:

pytorch-image-models/timm/optim/optim_factory.py

Lines 57 to 58 in 56b9031

    
           if param.ndim <= 1 or name.endswith(".bias") or name in no_weight_decay_list: 
        
               no_decay.append(param)

pytorch-image-models/timm/optim/optim_factory.py

Lines 129 to 132 in 56b9031

    
           # no decay: all 1D parameters and model specific ones 
        
           if param.ndim == 1 or name in no_weight_decay_list: 
        
               g_decay = "no_decay" 
        
               this_decay = 0.

I'm wondering if this rule is always correct. I understand that 1D parameters may be some embedding and not be used for matrix multiplication. But, e.g., they can still be used for dot production.

Any explanation will be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Why all 1D parameters are optimized without weight decay? #1744

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Why all 1D parameters are optimized without weight decay? #1744

Uh oh!

Uh oh!

function2-llx Mar 27, 2023

Replies: 0 comments

function2-llx
Mar 27, 2023