Skip to content

MMDocRAG/MMDocIR

Repository files navigation

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Kuicai Dong* · Yujing Chang* · Derrick Xin Deik Goh* · Dexun Li · Ruiming Tang · Yong Liu

📖Paper | 🏠Homepage | 🤗Huggingface | 👉Github

Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable resource in multimodal document retrieval for both training and evaluation. Through rigorous experiments, we demonstrate that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR training set effectively enhances the performance of multimodal document retrieval and (iii) text retrievers leveraging VLM-text significantly outperforms retrievers relying on OCR-text.

Logo

🔮Evalulation Dataset

1. Download Datasets

Download MMDocIR_pages.parquet and MMDocIR_layouts.parquet from huggingface: MMDocIR/MMDocIR_Evaluation_Dataset

Place two parquet files under ./dataset/

2. Download Retriever Checkpoints

Download relavent retrievers (either text or visual retrievers) from huggingface: MMDocIR/MMDocIR_Retrievers.

For text retrievers:

For visual retrievers:

Place these checkpoints under ./checkpoint/

Environment

python 3.9
torch2.4.0+cu121
transformers==4.45.0
sentence-transformers==2.2.2   # for BGE, GTE, E5 retrievers
colbert-ai==0.2.21             # for colbert retriever
flash-attn==2.7.4.post1        # for DSE retrievers with flash attention

3. Inference Command

You can infer using the command:

python encode.py BGE --bs 256 --mode vlm_text --encode query,page,layout

model : the model name for example "BGE", is compulsory. All available models are ["BGE", "E5", "GTE", "Contriever", "DPR", "ColBERT", "ColPali", "ColQwen", "DSE-docmatix", "DSE-wikiss"]

--mode parameter (choices=['vlm_text', 'oct_text', 'image_binary', 'image_hybrid'], default='vlm_text') is to control passing pages or layouts as either vlm_text, ocr_text, image_binary, image_hybrid.

--encode parameter (default="query,page,layout") is by default encode for all queries, pages, and layouts.

  • You can encode select any of [query, page, layout] and use , to seperate.
  • For example: encoding query and page is via --encode query,page , encoding page only is via --encode page.

4. For evaluation

You can infer using the command:

python search.py BGE --encode page,layout --encode_path encode

model : the model name for example "BGE", is compulsory. All available models are ["BGE", "E5", "GTE", "Contriever", "DPR", "ColBERT", "ColPali", "ColQwen", "DSE-docmatix", "DSE-wikiss"]

--encode parameter (default="page,layout") is by default score topk recall for page-level and layout-level.

  • You can obtain only page-level scores via --encode page , or layout-level scores via --encode layout.

--encode_path parameter (default="encode") to indicate the stored embedding of query, page, and layout.

  • For example, to score BGE results, by default we look for 3 pikle files named:
  • ./encode/encoded_query_BGE.pkl
  • ./encode/encoded_page_BGE.pkl
  • ./encode/encoded_layout_BGE.pkl

🛠️Training Dataset

1. Download Datasets

Download all parquet and json line files from huggingface: MMDocIR/MMDocIR_Train_Dataset

2. Dataset Class

Refer to train_dataset.py

3. Training Code

Coming soon

💾Citation

@misc{dong2025mmdocirbenchmarkingmultimodalretrieval,
      title={MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents}, 
      author={Kuicai Dong and Yujing Chang and Xin Deik Goh and Dexun Li and Ruiming Tang and Yong Liu},
      year={2025},
      eprint={2501.08828},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2501.08828}, 
}

📄 License

Code License Data License Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use