Recent advancements in leveraging reinforcement learning to enhance LLM reasoning capabilities have yielded remarkably promising results, exemplified by DeepSeek-R1, Kimi k1.5, OpenAI o3-mini, Grok 3. These exhilarating achievements herald ascendance of Large Reasoning Models, making us advance further along the thorny path towards Artificial General Intelligence (AGI). Study of LLM reasoning has garnered significant attention within the community, and researchers have concurrently summarized awesome RL-based LLM reasoning. Meanwhile, we have observed that remarkably awesome work has already been done in the domain of Multimodal Large Language Models (MLLMs), encompassing both multimodal understanding and autoregressive text-to-image generation.
-
[2503] [TimeZero] TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM (RUC) Model 🤗 Code 💻
-
[2503] [Skywork R1V] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought (Skywork AI) Model 🤗 Code 💻
-
[2503] [R1-AQA] Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering (Xiaomi) Model 🤗 Code 💻
-
[2503] [LMM-R1] LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL (SEU) Code 💻
-
[2503] [VisualThinker-R1-Zero] R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model (UCLA) Code 💻
-
[2503] [R1-Omni] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning (Alibaba) Model 🤗 Code 💻
-
[2503] [Vision-R1] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models (ECNU) Code 💻
-
[2503] [Seg-Zero] Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement (CUHK) Model 🤗 Dataset 🤗 Code 💻
-
[2503] [Audio-Reasoner] Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models (NTU) Project 🌐 Model 🤗 Code 💻
-
[2503] [MM-Eureka] MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning (Shanghai AI Laboratory) Models 🤗 Dataset 🤗 Code 💻
-
[2503] [Visual-RFT] Visual-RFT: Visual Reinforcement Fine-Tuning (SJTU) Project 🌐 Datasets 🤗 Code 💻
-
[2502] [MedVLM-R1] MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning (TUM)
-
[2501] [Kimi k1.5] Kimi k1.5: Scaling Reinforcement Learning with LLMs (MoonshotAI) Project 🌐
-
[2501] [Mulberry] Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search (THU) Model 🤗 Code 💻
-
[2501] [Virgo] Virgo: A Preliminary Exploration on Reproducing o1-like MLLM (RUC) Model 🤗 Code 💻
-
[2501] [Text-to-image COT] Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step (CUHK) Project 🌐 Model 🤗 Code 💻
-
[2501] [LlamaV-o1] LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs (MBZUAI) Project 🌐 Model 🤗 Code 💻
-
[2411] [InternVL2-MPO] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization (Shanghai AI Laboratory) Project 🌐 Model 🤗 Code 💻
-
[2411] [Insight-V] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models (NTU) Model 🤗 Code 💻
-
[2411] [LLaVA-CoT] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step (PKU) Project 🌐 Model 🤗 Demo🤗 Code 💻
-
[2502] [MM-IQ] MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models (Tencent) Project 🌐 Dataset 🤗 Code 💻
-
[2502] [MME-CoT] MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency (CUHK) Project 🌐 Dataset 🤗 Code 💻
-
[2502] [ZeroBench] ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models (Cambridge) Project 🌐 Dataset 🤗 Code 💻
-
EasyR1 💻
(An Efficient, Scalable, Multi-Modality RL Training Framework)
-
R1-Multimodal-Journey 💻
(Latest progress at MM-Eureka)
-
VisualThinker-R1-Zero 💻
Report 📝 (Aha Moment on a 2B non-SFT Model)
-
MetaSpatial 💻
Code 💻 Dataset 🤗 (3D Spatial Reasoning)