[Chapter 8] Paper Replicating, 244. Creating the Patch Embedding Layer with PyTorch - are we not missing a step? #485

ivan-rivera · 2023-06-14T15:26:26Z

ivan-rivera
Jun 14, 2023

Hey, I'd like to start off by thanking you for this amazing course, it's been extremely useful and engaging :)

I'm having some trouble understanding the paper replication section. I understand that we are trying to reproduce the Hybrid approach proposed in the paper. In this section, the authors say:

In this hybrid model, the patch embedding projection E is applied to patches extracted from a CNN feature map

However, in our implementation we have this:

class PatchEmbedding(nn.Module):
    """Turns a 2D input image into a 1D sequence learnable embedding vector.
    
    Args:
        in_channels (int): Number of color channels for the input images. Defaults to 3.
        patch_size (int): Size of patches to convert input image into. Defaults to 16.
        embedding_dim (int): Size of embedding to turn image into. Defaults to 768.
    """ 
    # 2. Initialize the class with appropriate variables
    def __init__(self, 
                 in_channels:int=3,
                 patch_size:int=16,
                 embedding_dim:int=768):
        super().__init__()
        
        # 3. Create a layer to turn an image into patches
        self.patcher = nn.Conv2d(in_channels=in_channels,
                                 out_channels=embedding_dim,
                                 kernel_size=patch_size,
                                 stride=patch_size,
                                 padding=0)

        # 4. Create a layer to flatten the patch feature maps into a single dimension
        self.flatten = nn.Flatten(start_dim=2, # only flatten the feature map dimensions into a single vector
                                  end_dim=3)

    # 5. Define the forward method 
    def forward(self, x):
        # Create assertion to check that inputs are the correct shape
        image_resolution = x.shape[-1]
        assert image_resolution % patch_size == 0, f"Input image size must be divisble by patch size, image shape: {image_resolution}, patch size: {patch_size}"
        
        # Perform the forward pass
        x_patched = self.patcher(x)
        x_flattened = self.flatten(x_patched) 
        # 6. Make sure the output shape has the right order 
        return x_flattened.permute(0, 2, 1) # adjust so the embedding is on the final dimension [batch_size, P^2•C, N] -> [batch_size, N, P^2•C]

CNN gives us a feature map which we flatten, but it seems that we are missing the projection E itself. I think the projection E could be a simple nn.Linear applied to the patcher outputs (feature map).

Maybe I missed something and if so, then I'd appreciate if someone could clarify this for me :)

Answered by mrdbourke

Jun 18, 2023

Hi @ivan-rivera ,

Good questions! + you're definitely making sense!

To answer in short, the feature map from the CNN is the embedding layer.

This may be a bit confusing due to the demo in the materials showcasing a feature map of a piece of piece of pizza (I think this was the example).

And the feature map of that specific image showcases certain features of that particular image.

However, the important concept is that the feature map (the embedding) is learned during training.

So although at the beginning, it may represent a specific sample, over time, it will be adjusted to (hopefully) represent the training data (in a generalized fashion).

In a CNN, a feature map is one form of project…

View full answer

mrdbourke · 2023-06-15T05:50:32Z

mrdbourke
Jun 15, 2023
Maintainer

Hi @ivan-rivera ,

Thank you for the kind words, I'm glad to hear you're enjoying the course :)

As for the projection, if I understand your question correctly, it would be possible with an nn.Linear layer but in our case, the CNN is the projection.

Or the nn.Conv2d layer is the embedding/projection layer.

This is because we take image -> conv2d layer (this extracts features aka embedding) -> flatten into a single embedding vector (this is our patch embedding).

We used nn.Conv2d in the replication because I find it quite an elegant method to get the "patch embeddings" discussed in the paper.

Is this what you meant by extracting a projection?

Or did you mean something else?

4 replies

ivan-rivera Jun 15, 2023
Author

Hey, thanks for the prompt response.

Your understanding of my question is correct. My confusion concerns the paper as much as the implementation and it might have a lot to do with the fact that I don't have much experience in dealing with convolutional layers.

I realise that the conv layer does produce an embedding (and can be thought of as a projection) which is also known as the feature map. The paper proposes to apply an embedding to the feature map, i.e. the outputs of the conv layer. In the above implementation we produce a feature map, but we don't apply an embedding to it, we just reshape it (flatten it). Am I missing something here? Otherwise, it feels like the hybrid approach is simply proposing to fit a CNN into a transformer block (and perhaps it really is as simple as that).

Beyond this, I'm also a little uncomfortable with the definition of "embedding" in the paper. When I'm thinking of a traditional embedding layer, I'm thinking of a lookup table with some IDs (e.g. classes, text tokens, etc) and their corresponding learnable vectors. Mentally, I can extend this notion to replace an ID with "patch signature", i.e. map the image into integers unique to each patch and create a corresponding lookup table. However, in the context of a feature map, the "patch signature" is learnable, meaning that it's shifting with every training pass and so we cannot map that to a lookup table. Because of suspect that the embedding E is meant to represent a more abstract projection that could simply be a linear layer applied to the feature maps (hence "linear projection of flattened patches").

I hope I'm making sense :) Would love to hear what you think about it.

EDIT: I've found another tutorial implementing this paper and it does agree with your approach (look up PatchEmbedding), it's just that I'm still confused about the whole embedding / feature map terminology :/

mrdbourke Jun 18, 2023
Maintainer

Hi @ivan-rivera ,

Good questions! + you're definitely making sense!

To answer in short, the feature map from the CNN is the embedding layer.

This may be a bit confusing due to the demo in the materials showcasing a feature map of a piece of piece of pizza (I think this was the example).

And the feature map of that specific image showcases certain features of that particular image.

However, the important concept is that the feature map (the embedding) is learned during training.

So although at the beginning, it may represent a specific sample, over time, it will be adjusted to (hopefully) represent the training data (in a generalized fashion).

In a CNN, a feature map is one form of projection as is a Linear layer.

--

In summary, a feature map == an embedding layer (as long as the feature map is learnable, which is the default for all Conv layers in PyTorch).

A confusing thing about ML/deep learning is that there are several names for the same thing.

Let me know if this clears things up.

Answer selected by ivan-rivera

ivan-rivera Jun 18, 2023
Author

Excellent, thank you for the clarification, I think it's clearer in my head now :)

mrdbourke Jun 19, 2023
Maintainer

Glad to hear! Let me know if you have any other questions :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Chapter 8] Paper Replicating, 244. Creating the Patch Embedding Layer with PyTorch - are we not missing a step? #485

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Chapter 8] Paper Replicating, 244. Creating the Patch Embedding Layer with PyTorch - are we not missing a step? #485

Uh oh!

ivan-rivera Jun 14, 2023

Replies: 1 comment · 4 replies

Uh oh!

mrdbourke Jun 15, 2023 Maintainer

Uh oh!

Uh oh!

ivan-rivera Jun 15, 2023 Author

Uh oh!

mrdbourke Jun 18, 2023 Maintainer

Uh oh!

ivan-rivera Jun 18, 2023 Author

Uh oh!

mrdbourke Jun 19, 2023 Maintainer

ivan-rivera
Jun 14, 2023

Replies: 1 comment 4 replies

mrdbourke
Jun 15, 2023
Maintainer

ivan-rivera Jun 15, 2023
Author

mrdbourke Jun 18, 2023
Maintainer

ivan-rivera Jun 18, 2023
Author

mrdbourke Jun 19, 2023
Maintainer