Replies: 2 comments
-
Regarding the MLCD-ViT-bigG-14-448px model, you should simply use the first CLS token as your single vector representation. This token is specifically designed to capture the global information of the entire image and is the most straightforward way to represent the image. As for the ViT-L/14@518px model you inquired about, this model was only used with increased resolution during fine-tuning on ImageNet and doesn't possess good general-purpose performance. Due to its limited application scope (mainly for specific experiments), we don't plan to release it publicly. Please let me know if you have any other questions about using these models. |
Beta Was this translation helpful? Give feedback.
-
hey, @anxiangsir , many thanks for your feedback - first vector in the array does behave like expected, i.e., gives and all-image summary |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there,
I am interested in comparing the performance of ViT-L/14@336px and MLCD-ViT-bigG-14-448px, but while the former gives me a single embedding vector, the latter gives me an array - I guess with values for each patch? What would your recommendation be to condense the embeddings of the big patch model to a single vector - just averaging across one of the dimensions? (e.g., either N=1025 or N=1664?)
Along the lines of trying your image encoders with higher resolutions: I was wondering whether the ViT-L/14@518px model that you refer to in the paper is available somewhere?
Thanks,
Moritz
Beta Was this translation helpful? Give feedback.
All reactions