RotAttentionPool2d
Performance Discrepancy and Comparison with naver-ai/rope-vit
#2528
-
A note within the class documentation for RotAttentionPool2d, which was added approximately 4 years ago: pytorch-image-models/timm/layers/attention_pool2d.py Lines 29 to 30 in 7101adb This note suggests a significant performance degradation in downstream tasks at different resolutions when using However, from my understanding, the implementation here appears to be similar to what is done in naver-ai/rope-vit. Any insights or clarification on this matter would be greatly appreciated. Thank you! Or. is my understanding incorrect, and are the implementations of RotAttentionPool2d here and the RoPE in naver-ai/rope-vit fundamentally different in a way that would explain this discrepancy? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
@ryan-minato I haven't looked too closely at the naver impl, there are often subtle differences in impl of ROPE though they usually equivalent. It might be possible to port those vits to timm using an existing vit as base, or make a new model if it's sufficiently different. The comment there was specific to the ROPE attention pool. I tried it once as a replacement for a standard attention pool with a ResNet model or similar and it didn't generalize well to other resolutions. I think this might have been before I added resolution scaling support to ROPE though, it was some time ago. However, the ROPE embedding impl ( ) does work well in a ViT model. Most (all?) of the ROPE ViT's in timm are in the EVA ViT as that was the first model to use ROPE, and I've based a number of other (non EVA) models on it since, including the Meta Perception Encoder ViTs. |
Beta Was this translation helpful? Give feedback.
-
@ryan-minato FWIW the axial rope-vits loaded pretty much as is into my extended EVA ViT impl, and I just merged support for the 'mixed mode' ROPE and added support to the pretrained models to the library, ran eval at 224 and upscaled 320: |
Beta Was this translation helpful? Give feedback.
@ryan-minato I haven't looked too closely at the naver impl, there are often subtle differences in impl of ROPE though they usually equivalent. It might be possible to port those vits to timm using an existing vit as base, or make a new model if it's sufficiently different.
The comment there was specific to the ROPE attention pool. I tried it once as a replacement for a standard attention pool with a ResNet model or similar and it didn't generalize well to other resolutions. I think this might have been before I added resolution scaling support to ROPE though, it was some time ago.
However, the ROPE embedding impl (
pytorch-image-models/timm/layers/pos_embed_sincos.py
Line 289 in 7101adb