1Johns Hopkins University
2DEVCOM Army Research Laboratory
Project Page / Paper / Huggingface Data Card 🤗 / Code
Official implementation of the CVPR 2025 (Highlight) paper:
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
Spatial457 is a diagnostic benchmark designed to evaluate the 6D spatial reasoning capabilities of large multimodal models (LMMs). It systematically introduces four key capabilities—multi-object understanding, 2D and 3D localization, and 3D orientation—across five difficulty levels and seven question types, progressing from basic recognition to complex physical interaction.
You can access the full dataset and evaluation toolkit:
- Dataset: Hugging Face
- Code: GitHub Repository
- Paper: arXiv 2502.08636
🔥Run benchmark with VLMEvalKit.
Spatial457 is also support by VLMEvalKit! Please try here for quick evaluation on most of the VLM. Evaluation can be done be running run.py
in VLMEvalKit:
python run.py --data Spatial457 --model <model_name>
We use blender to render the scenes, so you can also add customed objects in to dataset. We also support you customized you own questions type / templates for your studies. The source code of dataset generation is avaiable soon.
See image_generation/README.md
The result will contains forder of image, and a json file scene annotation
Run bash script to generate all levels of question. Set input_scene_file
as the json file scene annotation.
bash scripts/generate_questions.sh
@inproceedings{wang2025spatial457,
title = {Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models},
author = {Wang, Xingrui and Ma, Wufei and Zhang, Tiezheng and de Melo, Celso M and Chen, Jieneng and Yuille, Alan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025},
url = {https://arxiv.org/abs/2502.08636}
}
Content and toolkit are actively being updated. Stay tuned!