This repository is a curated list of survey papers for Long-context Data.
Click on the links below to jump directly to each section:
Long-context-modeling-space/
├── CCMT2024-Slides-CN.pdf
├── NLPCC2024-Long_context_model.pdf
├── README.md
├── README_tutorial.md
└── long-context-model-data-survey.pdf
You can directly click on the title to jump to the corresponding PDF link location
-
A Survey on Long Text Modeling with Transformers. Zican Dong, Tianyi Tang, Lunyi Li, Wayne Xin Zhao. Arxiv 2023.
-
Thus Spake Long-Context Large Language Model. Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu. Arxiv 2025.
-
A Comprehensive Survey on Long Context Language Modeling. Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang. Arxiv 2025.
-
Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving. Arxiv 2021.
-
In-context Pretraining: Language Modeling Beyond Document Boundaries. Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Gergely Szilvasy, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis. Arxiv 2023.
-
Effective Long-Context Scaling of Foundation Models. Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, Hao Ma. Arxiv 2023.
-
SemDeDup: Data-efficient learning at web-scale through semantic deduplication. Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, Ari S. Morcos. Arxiv 2023.
-
Structured Packing in LLM Training Improves Long Context Utilization. Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Henryk Michalewski, Łukasz Kuciński, Piotr Miłoś. Arxiv 2023.
-
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Model. Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, Dahua Lin. Arxiv 2023.
-
Extending Context Window of Large Language Models via Positional Interpolation. Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian. Arxiv 2023.
-
Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models. Junfeng Tian, Da Zheng, Yang Cheng, Rui Wang, Colin Zhang, Debing Zhang. Arxiv 2024.
-
How to Train Long-Context Language Models (Effectively). Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen. Arxiv 2024.
-
Data Engineering for Scaling Language Models to 128K Context. Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng. Arxiv 2024.
-
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities. Peng Xu, Wei Ping, Xianchao Wu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro. Arxiv 2024.
-
Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model. Chaochen Gao, Xing Wu, Qi Fu, Songlin Hu. Arxiv 2024.
-
RegMix: Data Mixture as Regression for Language Model Pre-training. Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin. Arxiv 2024.
-
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn Guo, Soren Gao, Wangchunshu Zhou, Xinyue Zhang, Yizhi Zhou, Yubo Wang, Yuelin Bai, Yuhan Zhang, Yuxiang Zhang, Zenith Wang, Zhenzhu Yang, Zijian Zhao, Jiajun Zhang, Wanli Ouyang, Wenhao Huang, Wenhu Chen. Arxiv 2024.
-
LongWanjuan: Towards Systematic Measurement for Long Text Quality. Kai Lv, Xiaoran Liu, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu, Dahua Lin. Arxiv 2024.
-
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models. Longze Chen, Ziqiang Liu, Wanwei He, Yunshui Li, Run Luo, Min Yang. Arxiv 2024.
-
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance. Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, Xipeng Qiu. Arxiv 2024.
-
Generating Long-form Story Using Dynamic Hierarchical Outlining with Memory-Enhancement. Qianyue Wang, Jinwu Hu, Zhengping Li, Yufeng Wang, daiyuan li, Yu Hu, Mingkui Tan. Arxiv 2024.
-
LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs. Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang. Arxiv 2025.
-
LongAttn: Selecting Long-context Training Data via Token-level Attention. Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, Sujian Li. Arxiv 2025.
-
NExtLong: Toward Effective Long-Context Training without Long Documents. Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu. Arxiv 2025.
-
Diversity Enhances an LLM's Performance in RAG and Long-context Task. Zhichao Wang, Bin Bi, Yanqi Luo, Sitaram Asur, Claire Na Cheng. Arxiv 2025.
-
Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing. Chen Wu, Yin Song. Arxiv 2025.
-
Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training. Junqing He, Kunhao Pan, Xiaoqun Dong, Zhuoyang Song, Yibo Liu, Qianguo Sun, Yuxin Liang, Hao Wang, Enming Zhang, Jiaxing Zhang. Arxiv 2025.
-
Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key. Yingda Chen, Xingjun Wang, Jintao Huang, Yunlin Mao, Daoze Zhang, Yuze Zhao. Arxiv 2024.
-
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following. Gabrielle Kaili-May Liu, Bowen Shi, Avi Caciularu, Idan Szpektor, Arman Cohan. Arxiv 2024.
-
LongAlign: A Recipe for Long Context Alignment of Large Language Models. Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li. Arxiv 2024.
-
Long Context Alignment with Short Instructions and Synthesized Positions. Wenhao Wu, Yizhong Wang, Yao Fu, Xiang Yue, Dawei Zhu, Sujian Li. Arxiv 2024.
-
Large Language Models Can Self-Improve in Long-context Reasoning. Siheng Li, Cheng Yang, Zesen Cheng, Lemao Liu, Mo Yu, Yujiu Yang, Wai Lam. Arxiv 2024.
-
Extending Llama-3's Context Ten-Fold Overnight. Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, Zhicheng Dou. Arxiv 2024.
-
ChatQA: Surpassing GPT-4 on Conversational QA and RAG. Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro. Arxiv 2024.
-
USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations. Mounika Marreddy, Subba Reddy Oota, Venkata Charan Chinni, Manish Gupta, Lucie Flek. Arxiv 2024.
-
From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data. Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos. Arxiv 2024.
-
LOGO -- Long cOntext aliGnment via efficient preference Optimization. Zecheng Tang, Zechen Sun, Juntao Li, Qiaoming Zhu, Min Zhang. [ICML 2025].
-
LongReward: Improving Long-context Large Language Models with AI Feedback. Jiajie Zhang, Zhongni Hou, Xin Lv, Shulin Cao, Zhenyu Hou, Yilin Niu, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li. Arxiv 2024.
-
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs. Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li. Arxiv 2024.
-
Make Your LLM Fully Utilize the Context. Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou. Arxiv 2024.
-
ORPO: Monolithic Preference Optimization without Reference Model. Jiwoo Hong, Noah Lee, James Thorne. Arxiv 2024.
-
Weaver: Foundation Models for Creative Writing. Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao Xie, Chuou Xu, Jihong Dai, Yibin Liu, Jialong Wu, Shengwei Ding, Long Li, Zhiwei Huang, Xinle Deng, Teng Yu, Gangan Ma, Han Xiao, Zixin Chen, Danjun Xiang, Yunxia Wang, Yuanyuan Zhu, Yi Xiao, Jing Wang, Yiru Wang, Siran Ding, Jiayang Huang, Jiayi Xu, Yilihamu Tayier, Zhenyu Hu, Yuan Gao, Chengfeng Zheng, Yueshu Ye, Yihang Li, Lei Wan, Xinyue Jiang, Yujie Wang, Siyu Cheng, Zhule Song, Xiangru Tang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang, Wangchunshu Zhou. Arxiv 2024.
-
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices. Zhi Chen, Qiguang Chen, Libo Qin, Qipeng Guo, Haijun Lv, Yicheng Zou, Wanxiang Che, Hang Yan, Kai Chen, Dahua Lin. Arxiv 2024.
-
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond. Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang. Arxiv 2025.
-
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models. Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang. Arxiv 2025.
-
Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning. Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch. Arxiv 2025.
-
WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale. Jiaxi Li, Xingxing Zhang, Xun Wang, Xiaolong Huang, Li Dong, Liang Wang, Si-Qing Chen, Wei Lu, Furu Wei. Arxiv 2025.
-
CLIPPER: Compression enables long-context synthetic data generation. Chau Minh Pham, Yapei Chang, Mohit Iyyer. Arxiv 2025.
-
DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation. Ming Wang, Fang Wang, Minghao Hu, Li He, Haiyang Wang, Jun Zhang, Tianwei Yan, Li Li, Zhunchen Luo, Wei Luo, Xiaoying Bai, Guotong Geng. Arxiv 2025.
-
Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation. Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang. Arxiv 2025.
-
LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information. Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, Shanghang Zhang. Arxiv 2025.
-
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning. Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan. Arxiv 2025.
-
LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data. Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo. Arxiv 2025.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li. Arxiv 2023.
-
ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding. Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy. Arxiv 2023.
-
LooGLE: Long Context Evaluation for Long-Context Language Models. Jiaqi Li, Mengmeng Wang, Zilong Zheng, Muhan Zhang. Arxiv 2023.
-
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors. Ido Amos, Jonathan Berant, Ankit Gupta. ICLR 2024 Oral.
-
∞Bench: Extending Long Context Evaluation Beyond 100K Tokens. Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, Maosong Sun. Arxiv 2024.
-
LongIns: A Challenging Long-context Instruction-based Exam for LLMs. Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Wenhu Chen, Ge Zhang. Arxiv 2024.
-
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly. Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izasak, Moshe Wasserblat, Danqi Chen. Arxiv 2024.
-
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios. Xiaodong Wu, Minhao Wang, Yichen Liu, Xiaoming Shi, He Yan, Xiangju Lu, Junmin Zhu, Wei Zhang. Arxiv 2024.
-
XL2Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies. Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Piji Li. Arxiv 2024.
-
NovelQA: A Benchmark for Long-Range Novel Question Answering. Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, Yue Zhang. Arxiv 2024.
-
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K. Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, Yu Wang. Arxiv 2024.
-
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li. Arxiv 2024.
-
RULER: What's the Real Context Size of Your Long-Context Language Models?. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Boris Ginsburg. Arxiv 2024.
-
Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding. Zhihan Zhang, Yixin Cao, Chenchen Ye, Yunshan Ma, Lizi Liao, Tat-Seng Chua. Arxiv 2024.
-
L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding? Zecheng Tang, Keyan Zhou, Juntao Li, Baibei Ji, Jianye Hou, Min Zhang. Arxiv 2024.
-
Long Code Arena: a Set of Benchmarks for Long-Context Code Models. Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, Timofey Bryksin. Arxiv 2024.
-
Long Input Benchmark for Russian Analysis. Igor Churin, Murat Apishev, Maria Tikhonova, Denis Shevelev, Aydar Bulatov, Yuri Kuratov, Sergej Averkiev, Alena Fenogenova. Arxiv 2024.
-
Long-context LLMs Struggle with Long In-context Learning. Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen. Arxiv 2024.
-
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA. Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li. Arxiv 2024.
-
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu. Arxiv 2024.
-
MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens. Yongqi Fan, Hongli Sun, Kui Xue, Xiaofan Zhang, Shaoting Zhang, Tong Ruan. Arxiv 2024.
-
Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation. Kaijian Zou, Muhammad Khalifa, Lu Wang. Arxiv 2024.
-
How Long Can Context Length of Open-Source LLMs truly Promise? _Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang.
-
Landmark Attention: Random-Access Infinite Context Length for Transformers. Amirkeivan Mohtashami, Martin Jaggi Arxiv 2023.
-
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems. Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu. Arxiv 2024.
-
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation. Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel. Arxiv 2024.
-
Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack. Xiaoyue Xu, Qinyuan Ye, Xiang Ren. Arxiv 2024.
-
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?. Jonathan Roberts, Kai Han, Samuel Albanie. Arxiv 2024.
-
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?. Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu. Arxiv 2024.
-
DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities. Hui Dai, Dan Pechi, Xinyi Yang, Garvit Banga, Raghav Mantri. Arxiv 2024.
-
Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs. Runchu Tian, Yanghao Li, Yuepeng Fu, Siyang Deng, Qinyu Luo, Cheng Qian, Shuo Wang, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Huadong Wang, Xiaojiang Liu. Arxiv 2024.
-
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage. Taewhoo Lee, Chanwoong Yoon, Kyochul Jang, Donghyeon Lee, Minju Song, Hyunjae Kim, Jaewoo Kang. Arxiv 2024.
-
Evaluating Multilingual Long-Context Models for Retrieval and Reasoning. Ameeta Agrawal, Andy Dang, Sina Bagheri Nezhad, Rhitabrat Pokharel, Russell Scheinberg. Arxiv 2024.
-
Evaluating Multilingual Long-Context Models for Retrieval and Reasoning. Hung Phan, Anurag Acharya, Rounak Meyur, Sarthak Chaturvedi, Shivam Sharma, Mike Parker, Dan Nally, Ali Jannesari, Karl Pazdernik, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana. Arxiv 2024.
-
Long2RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall. Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, Wei Xu. EMNLP 2024.
-
Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models. Amey Hengle, Prasoon Bajpai, Soham Dan, Tanmoy Chakraborty. Arxiv 2024.
-
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?. Mo Li, Songyang Zhang, Yunxin Liu, Kai Chen. Arxiv 2024.
-
CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems. Sara Rosenthal, Avirup Sil, Radu Florian, Salim Roukos. Arxiv 2024.
-
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing. Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, Minhao Cheng. Arxiv 2025.
-
NoLiMa: Long-Context Evaluation Beyond Literal Matching. Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze. Arxiv 2025.
-
U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack. Yunfan Gao, Yun Xiong, Wenlong Wu, Zijing Huang, Bohan Li, Haofen WangYunfan Gao, Yun Xiong, Wenlong Wu, Zijing Huang, Bohan Li, Haofen Wang. Arxiv 2025.
-
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, Akiko Aizawa . Arxiv 2020.
-
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner. Arxiv 2021.
-
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner. Arxiv 2021.
-
MuSiQue: Multihop Questions via Single-hop Question Composition Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal. Arxiv 2021.
-
L-Eval: Instituting Standardized Evaluation for Long Context Language Models. Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, Xipeng Qiu. Arxiv 2023.
-
WikiHowQA: A Comprehensive Benchmark for Multi-Document Non-Factoid Question Answering. Valeriia Bolotova-Baranova, Vladislav Blinov, Sofya Filippova, Falk Scholer, Mark Sanderson. ACL 2023.
-
S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models. Fangyu Lei, Qian Liu, Yiming Huang, Shizhu He, Jun Zhao, Kang Liu. NAACL 2024.
-
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA. Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li. Arxiv 2024.
-
FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models. Andrew Zhu, Alyssa Hwang, Liam Dugan, Chris Callison-Burch. Arxiv 2024.
-
DocFinQA: A Long-Context Financial Reasoning Dataset. Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Chris Tanner. Arxiv 2024.
-
FinTextQA: A Dataset for Long-form Financial Question Answering. Jian Chen, Peilin Zhou, Yining Hua, Yingxin Loh, Kehui Chen, Ziyuan Li, Bing Zhu, Junwei Liang. Arxiv 2024.
-
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev. Arxiv 2024.
-
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks. Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen. Arxiv 2024.
-
CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models. Zexuan Qiu, Jingjing Li, Shijue Huang, Wanjun Zhong, Irwin King. Arxiv 2024.
-
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels. Zhe Xu, Jiasheng Ye, Xiangyang Liu, Tianxiang Sun, Xiaoran Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, Xipeng Qiu. Arxiv 2024.
-
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs. Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo. Arxiv 2024.
-
One Thousand and One Pairs: A "novel" challenge for long-context language models. Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer. Arxiv 2024.
-
A Benchmark for Long-Form Medical Question Answering. Pedram Hosseini, Jessica M. Sin, Bing Ren, Bryceton G. Thomas, Elnaz Nouri, Ali Farahanchi, Saeed Hassanpour. NeurIPS 2024.
-
MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning. Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen. Arxiv 2025.
-
LongCodeBench: Evaluating Coding LLMs at 1M Context Windows. Stefano Rando, Luca Romani, Alessio Sampieri, Yuta Kyuragi, Luca Franco, Fabio Galasso, Tatsunori Hashimoto, John Yang. Arxiv 2025.
-
L2M: Mutual Information Scaling Law for Long-Context Language Modeling. Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić. Arxiv 2025.
-
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion. Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen. Arxiv 2025.
-
Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, Dragomir Radev. ACL 2019.
-
AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization. Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, Eugene Ie. Arxiv 2020.
-
Efficient Attentions for Long Document Summarization. Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, Lu Wang. Arxiv 2021.
-
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, Dragomir Radev. Arxiv 2021.
-
LCFO: Long Context and Long Form Output Dataset and Benchmarking. Marta R. Costa-jussà, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Joe Chuang, David Dale, Christophe Ropers, Alexandre Mourachko, Eduardo Sánchez, Holger Schwenk, Tuan Tran, Arina Turkatenko, Carleigh Wood. Arxiv 2024.
-
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data. Seiji Maekawa, Hayate Iso, Nikita Bhutani. Arxiv 2024.
-
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. Ofir Press, Noah A. Smith, Mike Lewis. ICLR 2022.
-
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training. Dawei Zhu,Nan Yang,Liang Wang,Yifan Song,Wenhao Wu,Furu Wei,Sujian Li. Arxiv 2023.
-
LongForm: Effective Instruction Tuning with Reverse Instructions. Abdullatif Köksal, Timo Schick, Anna Korhonen, Hinrich Schütze. Arxiv 2023.
-
LongLaMP: A Benchmark for Personalized Long-form Text Generation. Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Hamed Zamani. Arxiv 2024.
-
Long-form factuality in large language models. Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le. Arxiv 2024.
-
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models. Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, Junran Peng, Zhaoxiang Zhang, Songyang Zhang, Kai Chen. Arxiv 2024.
-
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs. Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee. Arxiv 2024.
-
DOLOMITES: Domain-Specific Long-Form Methodical Tasks. Chaitanya Malaviya, Priyanka Agrawal, Kuzman Ganchev, Pranesh Srinivasan, Fantine Huot, Jonathan Berant, Mark Yatskar, Dipanjan Das, Mirella Lapata, Chris Alberti. Arxiv 2024.
-
Large Language Models Still Exhibit Bias in Long Text. Wonje Jeung, Dongjae Jeon, Ashkan Yousefpour, Jonghyun Choi. Arxiv 2024.
-
OLAPH: Improving Factuality in Biomedical Long-form Question Answering. Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, Jaewoo Kang. Arxiv 2024.
-
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models. Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, Xiaoguang Li, Yasheng Wang, Lifeng Shang, Qun Liu, Linqi Song. Arxiv 2024.
-
Suri: Multi-constraint Instruction Following for Long-form Text Generation. Chau Minh Pham, Simeng Sun, Mohit Iyyer. Arxiv 2024.
-
Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation. Junhao Zhang, Richong Zhang, Fanshuang Kong, Ziyang Miao, Yanhan Ye, Yaowei Zheng. Arxiv 2025.
-
DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation. Ming Wang, Fang Wang, Minghao Hu, Li He, Haiyang Wang, Jun Zhang, Tianwei Yan, Li Li, Zhunchen Luo, Wei Luo, Xiaoying Bai, Guotong Geng. Arxiv 2025.
-
Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models. Ruibin Xiong, Yimeng Chen, Dmitrii Khizbullin, Jürgen Schmidhuber. Arxiv 2025.
-
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation. Xi Ye, Fangcong Yin, Yinghui He, Joie Zhang, Howard Yen, Tianyu Gao, Greg Durrett, Danqi Chen. Arxiv 2025.
-
RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery. Hongchao Gu, Dexun Li, Kuicai Dong, Hao Zhang, Hang Lv, Hao Wang, Defu Lian, Yong Liu, Enhong Chen. Arxiv 2025.
-
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input. Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, Dipanjan Das. Arxiv 2025.
Contributions are welcome! If you'd like to add a new category or improve an existing one:
- Fork the repo.
- Create a new branch (
git checkout -b feature/new-category
). - Make your changes.
- Commit and push your changes.
- Open a pull request.
Please follow the contribution guidelines (you can create a CONTRIBUTING.md
file for details).
This project is licensed under the MIT License — see the LICENSE file for details.
If you have any questions or suggestions, feel free to open an issue or contact me at [[email protected]/[email protected]/[email protected]/[email protected]/[email protected]/[email protected]].