[Chapter 8] Paper Replicating, 244. Creating the Patch Embedding Layer with PyTorch - are we not missing a step? #485
-
Hey, I'd like to start off by thanking you for this amazing course, it's been extremely useful and engaging :) I'm having some trouble understanding the paper replication section. I understand that we are trying to reproduce the Hybrid approach proposed in the paper. In this section, the authors say:
However, in our implementation we have this: class PatchEmbedding(nn.Module):
"""Turns a 2D input image into a 1D sequence learnable embedding vector.
Args:
in_channels (int): Number of color channels for the input images. Defaults to 3.
patch_size (int): Size of patches to convert input image into. Defaults to 16.
embedding_dim (int): Size of embedding to turn image into. Defaults to 768.
"""
# 2. Initialize the class with appropriate variables
def __init__(self,
in_channels:int=3,
patch_size:int=16,
embedding_dim:int=768):
super().__init__()
# 3. Create a layer to turn an image into patches
self.patcher = nn.Conv2d(in_channels=in_channels,
out_channels=embedding_dim,
kernel_size=patch_size,
stride=patch_size,
padding=0)
# 4. Create a layer to flatten the patch feature maps into a single dimension
self.flatten = nn.Flatten(start_dim=2, # only flatten the feature map dimensions into a single vector
end_dim=3)
# 5. Define the forward method
def forward(self, x):
# Create assertion to check that inputs are the correct shape
image_resolution = x.shape[-1]
assert image_resolution % patch_size == 0, f"Input image size must be divisble by patch size, image shape: {image_resolution}, patch size: {patch_size}"
# Perform the forward pass
x_patched = self.patcher(x)
x_flattened = self.flatten(x_patched)
# 6. Make sure the output shape has the right order
return x_flattened.permute(0, 2, 1) # adjust so the embedding is on the final dimension [batch_size, P^2•C, N] -> [batch_size, N, P^2•C] CNN gives us a feature map which we flatten, but it seems that we are missing the projection Maybe I missed something and if so, then I'd appreciate if someone could clarify this for me :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Hi @ivan-rivera , Thank you for the kind words, I'm glad to hear you're enjoying the course :) As for the projection, if I understand your question correctly, it would be possible with an Or the This is because we take image -> conv2d layer (this extracts features aka embedding) -> flatten into a single embedding vector (this is our patch embedding). We used Is this what you meant by extracting a projection? Or did you mean something else? |
Beta Was this translation helpful? Give feedback.
Hi @ivan-rivera ,
Good questions! + you're definitely making sense!
To answer in short, the feature map from the CNN is the embedding layer.
This may be a bit confusing due to the demo in the materials showcasing a feature map of a piece of piece of pizza (I think this was the example).
And the feature map of that specific image showcases certain features of that particular image.
However, the important concept is that the feature map (the embedding) is learned during training.
So although at the beginning, it may represent a specific sample, over time, it will be adjusted to (hopefully) represent the training data (in a generalized fashion).
In a CNN, a feature map is one form of project…