comparison between ViT-L/14@336px and MLCD-ViT-bigG-14-448px / access to ViT-L/14@518px #121

mluerig · 2025-02-23T01:00:51Z

mluerig
Feb 23, 2025

Hi there,

I am interested in comparing the performance of ViT-L/14@336px and MLCD-ViT-bigG-14-448px, but while the former gives me a single embedding vector, the latter gives me an array - I guess with values for each patch? What would your recommendation be to condense the embeddings of the big patch model to a single vector - just averaging across one of the dimensions? (e.g., either N=1025 or N=1664?)

Along the lines of trying your image encoders with higher resolutions: I was wondering whether the ViT-L/14@518px model that you refer to in the paper is available somewhere?

Thanks,
Moritz

anxiangsir · 2025-02-26T03:05:59Z

anxiangsir
Feb 26, 2025
Maintainer

Regarding the MLCD-ViT-bigG-14-448px model, you should simply use the first CLS token as your single vector representation. This token is specifically designed to capture the global information of the entire image and is the most straightforward way to represent the image.

As for the ViT-L/14@518px model you inquired about, this model was only used with increased resolution during fine-tuning on ImageNet and doesn't possess good general-purpose performance. Due to its limited application scope (mainly for specific experiments), we don't plan to release it publicly.

Please let me know if you have any other questions about using these models.

0 replies

mluerig · 2025-03-05T16:11:38Z

mluerig
Mar 5, 2025
Author

hey, @anxiangsir , many thanks for your feedback - first vector in the array does behave like expected, i.e., gives and all-image summary

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

comparison between ViT-L/14@336px and MLCD-ViT-bigG-14-448px / access to ViT-L/14@518px #121

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

comparison between ViT-L/14@336px and MLCD-ViT-bigG-14-448px / access to ViT-L/14@518px #121

Uh oh!

Uh oh!

mluerig Feb 23, 2025

Replies: 2 comments

Uh oh!

anxiangsir Feb 26, 2025 Maintainer

Uh oh!

mluerig Mar 5, 2025 Author

mluerig
Feb 23, 2025

anxiangsir
Feb 26, 2025
Maintainer

mluerig
Mar 5, 2025
Author