Replies: 3 comments 7 replies
-
Hi @lpstudy , This is the dropout layer after the embedding layer. The idea of dropout generally is regularization, a technique to reduce overfitting. The concept of dropout is explained in section 3.5.2 of the book for the attention weights. Embedding dropout works at the word level by randomly dropping entire word vectors in the embedding matrix. It's like forcing the model to randomly "forget" a little bit of information about certain words in each training pass to make it understand meaning from context even when some words are missing. So, the idea is to make the LLM more capable of understanding new text without relying too much on patterns in the training data. This paper describes the general idea of the dropout layer after the embedding layer. |
Beta Was this translation helpful? Give feedback.
-
I agree, it's less intuitive at the beginning, compared to after an important block. A slight precision, in this case, the technique is done inside the In practice nowadays, at the entry, this is barely a thing anymore (judging from the recent open source SOTA LLMs I've seen) because we are using RoPE+YarN scaling for positional information (pretty much what's being done by Sebastian for the Llama conversion here) Since RopE is done at the attention level and angles are precomputed (not learned anymore) I guess there's not much benefits to use dropout at the entry just for the learned embeddings. |
Beta Was this translation helpful? Give feedback.
-
thanks a lot |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,I have bought this book and try to implement it according to your step-by-step instruction. This book is really great and helps me a lot. But, I am a little confused about the usage of
drop_emb
aftertok_embeds+pos_embeds
in GPTModel。Could you give some explaination or some material to help understand it?Beta Was this translation helpful? Give feedback.
All reactions