Document Reconstruction Unlocks Scalable Long-Context RLVR
Abstract
Unsupervised reinforcement learning approach enhances large language models' long-context capabilities by training them to reconstruct documents through paragraph identification and sequencing, achieving improved performance on benchmark datasets without requiring human annotations or teacher models.
Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards (2026)
- LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards (2026)
- In-Context Reinforcement Learning for Tool Use in Large Language Models (2026)
- Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models (2026)
- Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping (2026)
- RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning (2026)
- Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper