Skip to content

Conversation

@arthur-75
Copy link
Contributor

No description provided.

Raphael Sourty and others added 3 commits April 18, 2024 00:41
I have coded XTR from google, "Rethinking the Role of Token Retrieval in Multi-Vector Retrieval", we still need to optimize the code and to add Missing similarity imputation, please let me know if u have any question.
@arthur-75 arthur-75 changed the title Patch 2 Adding XTR from Rethinking the Role of Token Retrieval in Multi-Vector Retrieval May 4, 2024
@raphaelsty
Copy link
Owner

Thank you @arthur-75 for this MR, the best I think would be to add an index directory with a file annoy.py in this directory.

The class would be Annoy() with the parameters dedicated to create the vector database: https://github.com/spotify/annoy

The Annoy index would have a add() method which take as input the documents_embeddings parameter, in order to upload the documents_embeddings.

Then it would have a __call__ method which take as input queries_embeddings: dict[str, torch.tensor], k: int = 100, batch_size: int = 32 and then retrieve the top_k documents_embeddings given the set of queries_embeddings in batch.

Once we have the index method, we can create an XTR object which will take as input an index object such as Annoy, key, on, model.

The XTR object will have an add method, which will simply call the add method of XTR.

The XTR object should inherit from ColBERT retriever.

The __call__ method of XTR will query the index and then post-process the embeddings similarities in order to compute the XTR score.

Also you should properly set up ruff in order to format your code, this is really useful 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants