Why do we need drop_embed in GPTModule? #636

lpstudy · 2025-04-30T01:30:49Z

lpstudy
Apr 30, 2025

Hi，I have bought this book and try to implement it according to your step-by-step instruction. This book is really great and helps me a lot. But, I am a little confused about the usage of drop_emb after tok_embeds+pos_embeds in GPTModel。Could you give some explaination or some material to help understand it?

d-kleine · 2025-05-14T17:47:36Z

d-kleine
May 14, 2025

Hi @lpstudy ,

This is the dropout layer after the embedding layer. The idea of dropout generally is regularization, a technique to reduce overfitting. The concept of dropout is explained in section 3.5.2 of the book for the attention weights.

Embedding dropout works at the word level by randomly dropping entire word vectors in the embedding matrix. It's like forcing the model to randomly "forget" a little bit of information about certain words in each training pass to make it understand meaning from context even when some words are missing. So, the idea is to make the LLM more capable of understanding new text without relying too much on patterns in the training data.

This paper describes the general idea of the dropout layer after the embedding layer.

0 replies

casinca · 2025-05-15T11:40:27Z

casinca
May 15, 2025

I agree, it's less intuitive at the beginning, compared to after an important block. A slight precision, in this case, the technique is done inside the emb_dim vector (last dim with x shape as batch_size, seq_len, emb_dim) by zeroing some features/dimensions/scalars (how you want to call it) not dropping entire vectors.

In practice nowadays, at the entry, this is barely a thing anymore (judging from the recent open source SOTA LLMs I've seen) because we are using RoPE+YarN scaling for positional information (pretty much what's being done by Sebastian for the Llama conversion here)

Since RopE is done at the attention level and angles are precomputed (not learned anymore) I guess there's not much benefits to use dropout at the entry just for the learned embeddings.

7 replies

d-kleine May 15, 2025

I just took a look into the HF implementation, which does it the same way:
https://github.com/huggingface/transformers/blob/0f77ca72cae3565632bafd7e06080b2c19920f06/src/transformers/models/gpt2/modeling_gpt2.py#L704

The openai implementation doesn't have dropout included because they have only published the code for inference. So, there is no "official" reference for how dropout is correctly performed during training. With reference to your above provided screenshot, for me it seems that the embedding dropout is incorrectly implemented in HF - what do you think?

casinca May 15, 2025

What do you mean exactly by implemented incorrectly? for me, It looks good and the same as what Sebastian did.
they are doing in the forward pass hidden_states = inputs_embeds + position_embeds.to(inputs_embeds.device) and then later LC 906 hidden_states = self.drop(hidden_states).
Dropout is instantiated as self.drop = nn.Dropout(config.embd_pdrop)
and the notebook is doing:

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

My screenshot is just printing x after x = self.drop_emb(x) from the GPTModel, are you not getting something similar with a p=0.5 instead of cfg["drop_rate"]?

d-kleine May 15, 2025

I mean that the embedding dropout should zero out entire word vectors (not just some scalars in each word vector like in your screenshot)

casinca May 16, 2025

I modified the HF GPT2 source code to add a print and check what hidden states look like after the dropout, it looks similar to what's being done here/screenshot.

In case you want to reproduce:

Install their repo in editable mode and uv pip install torch
I added 2 prints after LC 907 hidden_states = self.drop(hidden_states)

        print("hidden_states.shape:", hidden_states.shape)
        print("hidden_states:", hidden_states)

ran a small script to see the print()

import torch

from transformers import GPT2Config, GPT2Model


# Small config for testing
config = GPT2Config(
    vocab_size=100,
    n_positions=20,
    n_embd=32,
    n_layer=2,
    n_head=2,
    embd_pdrop=0.5,  # ~drop half
)

model = GPT2Model(config)

# Batch size = 1, sequence length = 5
input_ids = torch.randint(0, config.vocab_size, (1, 5))
print("Input IDs shape:", input_ids.shape)

outputs = model(input_ids)

print() is like:

Input IDs shape: torch.Size([1, 5])
hidden_states.shape: torch.Size([1, 5, 32])
hidden_states: tensor([[[-0.0000, -0.0000,  0.0566,  0.0000,  0.0351, -0.0000,  0.0000, 
           0.1281, -0.0000, -0.0652,  0.0000, -0.0358,  0.0000, -0.0000,
          -0.0000, -0.0000,  0.0000,  0.0220, -0.0357,  0.0000, -0.0636,
          -0.0000,  0.0125, -0.0000, -0.0000,  0.0519, -0.0000,  0.0000,
           0.0011,  0.0151, -0.1007, -0.0000],
         [-0.0000,  0.0369, -0.0000, -0.0000, -0.0000,  0.0227,  0.0000,
           0.0000,  0.0000,  0.0072,  0.0000, -0.0000,  0.0000, -0.0398,
          -0.0000,  0.0000,  0.0000, -0.0249, -0.0000,  0.0905, -0.0000,
          -0.0000, -0.0000,  0.0000, -0.0000, -0.0011, -0.0000,  0.1476,
           0.0000,  0.0460, -0.0463, -0.0000],
         [-0.0662, -0.0000,  0.0333, -0.0129, -0.0000, -0.0000, -0.0981,
          -0.0000, -0.0348, -0.0152, -0.1052, -0.0000,  0.0000,  0.0585,
           0.0020, -0.0187,  0.0122,  0.0000, -0.0000, -0.0152,  0.0000,
          -0.0000,  0.0240, -0.0000, -0.0000, -0.0000,  0.0000, -0.0597,
          -0.0095, -0.0000, -0.0000,  0.0143],
         [ 0.0000, -0.1110, -0.0000, -0.0000, -0.0360, -0.0000, -0.0575,
          -0.0849,  0.0000,  0.1204,  0.0579, -0.0000,  0.0000,  0.0000,
           0.0368,  0.0000, -0.0000, -0.1781, -0.0319, -0.0380,  0.0000,
          -0.0000, -0.0000,  0.0442,  0.0000,  0.0548, -0.0097,  0.0000,
           0.0000,  0.0254,  0.0273,  0.0000],
         [ 0.0000,  0.0863, -0.0000, -0.0451, -0.0000, -0.0179,  0.0059,
          -0.0000, -0.0000,  0.0684,  0.0129, -0.0000,  0.0525,  0.0374,
           0.0000,  0.0289,  0.0197,  0.0000, -0.0000, -0.0465,  0.0000,
           0.0034,  0.0000,  0.0752, -0.0000,  0.0296,  0.1051, -0.1842,
          -0.0000, -0.0000, -0.0034, -0.0000]]], grad_fn=<MulBackward0>)

d-kleine May 16, 2025

Yeah, for me this looks like standard dropout (individual elements zeroed out) instead of embedding dropout (zeroing entire word vectors).

lpstudy · 2025-05-15T11:53:12Z

lpstudy
May 15, 2025
Author

thanks a lot

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why do we need drop_embed in GPTModule? #636

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why do we need drop_embed in GPTModule? #636

Uh oh!

Replies: 3 comments · 7 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lpstudy May 15, 2025 Author

Replies: 3 comments 7 replies

lpstudy
May 15, 2025
Author