- (2025.12) Stabilizing Reinforcement Learning with LLMs: Formulation and Practices, [Zheng+]
- (2025.10) Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers, [Ma+]
- (2025.08) Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning, [Shrivastava+]
- (2025.08) Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning, [Liu+]
- (2025.07) Group Sequence Policy Optimization, [Zheng+]
- (2025.06) Truncated Proximal Policy Optimization, [Fan+]
- (2025.04) VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks, [Yue+]
- (2025.03) Understanding R1-Zero-Like Training: A Critical Perspective, [Liu+, COLM 2025]
- (2025.03) DAPO: An Open-Source LLM Reinforcement Learning System at Scale, [Yu+, NeurIPS 2025]
- (2024.02) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, [Shao+]
- (2025.10) The Art of Scaling Reinforcement Learning Compute for LLMs, [Khatri+]
- (2025.06) AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy, [Liu+]
- (2025.05) ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, [Liu+, NeurIPS 2025]
- (2025.05) AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning, [Chen+, NeurIPS 2025]
- (2025.09) PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation, [Piché+]
- (2025.05) AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning, [Fu+, NeurIPS 2025]
- (2025.12) MiMo-V2-Flash Technical Report, [LLM-Core Xiaomi]
- (2025.11) Olmo 3, [Olmo Team]
- (2025.10) Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model, [Ling Team]
- (2025.09) DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention, [DeepSeek-AI]
- (2025.09) CWM: An Open-Weights LLM for Research on Code Generation with World Models, [Meta FAIR CodeGen Team]
- (2025.09) LongCat-Flash-Thinking Technical Report, [Meituan LongCat Team]
- (2025.08) GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models, [GLM-4.5 Team]
- (2025.07) Kimi K2: Open Agentic Intelligence, [Kimi Team]
- (2025.06) Hunyuan-A13B Technical Report, [Tencent Hunyuan Team]
- (2025.06) MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention, [MiniMax]
- (2025.06) Magistral, [Mistral-AI]
- (2025.05) Skywork Open Reasoner 1 Technical Report, [Skywork AI]
- (2025.05) Qwen3 Technical Report, [Qwen Team]
- (2025.05) Llama-Nemotron: Efficient Reasoning Models, [NVIDIA]
- (2025.04) Phi-4-reasoning Technical Report, [Microsoft]
- (2025.04) Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning, [ByteDance Seed]
- (2025.01) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, [DeepSeek-AI]
- (2025.01) Kimi k1.5: Scaling Reinforcement Learning with LLMs, [Kimi Team]
- (2025.11) No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan, [vLLM and TorchTitan Teams]
- (2025.09) Towards Deterministic Inference in SGLang and Reproducible RL Training, [The SGLang Team]
- (2025.09) Defeating Nondeterminism in LLM Inference, [Horace He and Thinking Machines Lab]
- (2025.09) When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch, [Liu+]
- (2025.08) Your Efficient RL Framework Secretly Brings You Off-Policy RL Training, [Yao+]
- (2025.08) ProRL V2 - Prolonged Training Validates RL Scaling Laws, [Hu+]
- (2025.04) DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level, [Agentica x Together AI]
- (2025.02) DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL, [Luo+]