rookie0607 / Awesome-RL-based-Reasoning-MLLMs Public

forked from Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs

Notifications You must be signed in to change notification settings
Fork 0
Star 0

This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!

0 stars 58 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Awesome RL-based Reasoning MLLMs

Recent advancements in leveraging reinforcement learning to enhance LLM reasoning capabilities have yielded remarkably promising results, exemplified by DeepSeek-R1, Kimi k1.5, OpenAI o3-mini, Grok 3. These exhilarating achievements herald ascendance of Large Reasoning Models, making us advance further along the thorny path towards Artificial General Intelligence (AGI). Study of LLM reasoning has garnered significant attention within the community, and researchers have concurrently summarized awesome RL-based LLM reasoning. Meanwhile, we have observed that remarkably awesome work has already been done in the domain of Multimodal Large Language Models (MLLMs), encompassing both multimodal understanding and autoregressive text-to-image generation.

"The senses are the organs by which man perceives the world, and the soul acts through them as through tools."

— Leonardo da Vinci

This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!

Papers📄

[2503] [TimeZero] TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM (RUC) Model 🤗 Code 💻
[2503] [Skywork R1V] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought (Skywork AI) Model 🤗 Code 💻
[2503] [R1-AQA] Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering (Xiaomi) Model 🤗 Code 💻
[2503] [LMM-R1] LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL (SEU) Code 💻
[2503] [VisualThinker-R1-Zero] R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model (UCLA) Code 💻
[2503] [R1-Omni] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning (Alibaba) Model 🤗 Code 💻
[2503] [Vision-R1] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models (ECNU) Code 💻
[2503] [Seg-Zero] Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement (CUHK) Model 🤗 Dataset 🤗 Code 💻
[2503] [Audio-Reasoner] Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models (NTU) Project 🌐 Model 🤗 Code 💻
[2503] [MM-Eureka] MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning (Shanghai AI Laboratory) Models 🤗 Dataset 🤗 Code 💻
[2503] [Visual-RFT] Visual-RFT: Visual Reinforcement Fine-Tuning (SJTU) Project 🌐 Datasets 🤗 Code 💻
[2502] [MedVLM-R1] MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning (TUM)
[2501] [Kimi k1.5] Kimi k1.5: Scaling Reinforcement Learning with LLMs (MoonshotAI) Project 🌐
[2501] [Mulberry] Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search (THU) Model 🤗 Code 💻
[2501] [Virgo] Virgo: A Preliminary Exploration on Reproducing o1-like MLLM (RUC) Model 🤗 Code 💻
[2501] [Text-to-image COT] Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step (CUHK) Project 🌐 Model 🤗 Code 💻
[2501] [LlamaV-o1] LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs (MBZUAI) Project 🌐 Model 🤗 Code 💻
[2411] [InternVL2-MPO] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization (Shanghai AI Laboratory) Project 🌐 Model 🤗 Code 💻
[2411] [Insight-V] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models (NTU) Model 🤗 Code 💻
[2411] [LLaVA-CoT] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step (PKU) Project 🌐 Model 🤗 Demo🤗 Code 💻

Benchmarks📊

[2502] [MM-IQ] MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models (Tencent) Project 🌐 Dataset 🤗 Code 💻
[2502] [MME-CoT] MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency (CUHK) Project 🌐 Dataset 🤗 Code 💻
[2502] [ZeroBench] ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models (Cambridge) Project 🌐 Dataset 🤗 Code 💻

Open-Source Projects🌐

EasyR1 💻 (An Efficient, Scalable, Multi-Modality RL Training Framework)
Multimodal Open R1 💻 Model 🤗 Dataset 🤗
LMM-R1 💻 Code 💻
MMR1 💻 Code 💻 Model 🤗 Dataset 🤗
R1-Multimodal-Journey 💻 (Latest progress at MM-Eureka)
R1-V 💻 Blog 🎯 Datasets 🤗
VLM-R1 💻 Model 🤗 Dataset 🤗 Demo 🤗
R1-Vision 💻 Cold-Start Datasets 🤗
R1-Onevision 💻 Model 🤗 Dataset 🤗 Demo 🤗 Report 📝
VisualThinker-R1-Zero 💻 Report 📝 (Aha Moment on a 2B non-SFT Model)
Open R1 Video 💻 Models 🤗 Datasets 🤗 Datasets 🤗
Video-R1 💻 Code 💻 Dataset 🤗
Open-LLaVA-Video-R1 💻 Code 💻
MetaSpatial 💻 Code 💻 Dataset 🤗 (3D Spatial Reasoning)

Star Chart⭐

About

This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!

Report repository

Releases

No releases published

Packages

No packages published