Reinforcement Learning
Safetensors

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

🚀 Overview

IAPO is an information-theoretic post-training framework that improves the token efficiency of Chain-of-Thought (CoT) reasoning. Instead of shaping rewards at the sequence level — as in standard RL methods such as GRPO — IAPO assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning tokens and suppressing low-utility exploration, producing shorter reasoning traces without sacrificing accuracy.

Specifically, this checkpoint is a fine-tuned version of Qwen2.5-0.5B-Instruct using the IAPO framework on mathematical reasoning datasets (GSM8K, MATH-500, and DAPO-Math-17k).

Paper: IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning Code: Official GitHub Repository

🎯 Key Features

  • 🧠 Information-Aware Advantage Shaping: Assigns token-level advantages based on conditional MI w.r.t. the final answer, amplifying informative tokens and suppressing redundant ones.
  • 🔍 Exploration Adjustment: Adds a per-token exploration term that rewards confident tokens in correct trajectories and penalizes them in incorrect ones, preventing premature reasoning collapse while encouraging recovery from failed paths.
  • Efficient Conditional MI Estimation: Introduces an early-exit–based MI estimator combined with KV-cache preloading and chunk-wise forwarding, reducing the naive O(L³d) cost of per-token MI estimation to a tractable cost suitable for long reasoning.
  • 📉 Provable Length Reduction: Theoretical analysis shows that, under mild assumptions, IAPO monotonically reduces expected completion length without harming correctness.
  • 🏆 Strong Empirical Results: Reduces reasoning length by up to 47% while improving accuracy across math reasoning datasets, outperforming token-efficient RL baselines such as DAPO, GFPO, GTPO, and S-GRPO.

🔧 Loading the Checkpoints

You can load any of our released IAPO checkpoints directly from the Hugging Face Hub via the subfolder argument. Each subfolder corresponds to a {base_model}_{dataset} combination. For example:

from transformers import AutoModel

# Generic usage
model = AutoModel.from_pretrained("jonathanhe123/iapo", subfolder="model_name")

# Example: load the Qwen2.5-0.5B-Instruct checkpoint fine-tuned on MATH-500
model = AutoModel.from_pretrained("jonathanhe123/iapo", subfolder="Qwen2.5-0.5B-Instruct_MATH-500")

Citation

If you find this work useful, please cite our paper:

@inproceedings{he2026iapo,
  title={IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning},
  author={He, Yinhan and Zhu, Yaochen and Shi, Mingjia and Zheng, Wendy and Su, Lin and Wang, Xiaoqing and Guo, Qi
and Li, Jundong},
  booktitle={International Conference on Machine Learning (ICML 2026)},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for jonathanhe123/iapo

Finetuned
(796)
this model

Datasets used to train jonathanhe123/iapo

Paper for jonathanhe123/iapo