You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Get test dataset of image - caption (IIITD-20K dataset)
calculate embeddings with my fine tuned CLIP
calculate cosine distance between each text to all images
get the k closest images to the text, if the corresponding image is in it, do +1 to score
get the recall by dividing the score by the length of the test dataset.
This is my recall at k. I obtain a R@1 of 17%, while most papers when finetuning CLIP obtain at least 60% recall at 1. Any idea what i could be doing wrong?