IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
🚀 Overview
IAPO is an information-theoretic post-training framework that improves the token efficiency of Chain-of-Thought (CoT) reasoning. Instead of shaping rewards at the sequence level — as in standard RL methods such as GRPO — IAPO assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning tokens and suppressing low-utility exploration, producing shorter reasoning traces without sacrificing accuracy.
Specifically, this checkpoint is a fine-tuned version of Qwen2.5-0.5B-Instruct using the IAPO framework on
mathematical reasoning datasets (GSM8K, MATH-500, and DAPO-Math-17k).
Paper: IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning Code: Official GitHub Repository
🎯 Key Features
- 🧠 Information-Aware Advantage Shaping: Assigns token-level advantages based on conditional MI w.r.t. the final answer, amplifying informative tokens and suppressing redundant ones.
- 🔍 Exploration Adjustment: Adds a per-token exploration term that rewards confident tokens in correct trajectories and penalizes them in incorrect ones, preventing premature reasoning collapse while encouraging recovery from failed paths.
- ⚡ Efficient Conditional MI Estimation: Introduces an early-exit–based MI estimator combined with KV-cache
preloading and chunk-wise forwarding, reducing the naive
O(L³d)cost of per-token MI estimation to a tractable cost suitable for long reasoning. - 📉 Provable Length Reduction: Theoretical analysis shows that, under mild assumptions, IAPO monotonically reduces expected completion length without harming correctness.
- 🏆 Strong Empirical Results: Reduces reasoning length by up to 47% while improving accuracy across math reasoning datasets, outperforming token-efficient RL baselines such as DAPO, GFPO, GTPO, and S-GRPO.
🔧 Loading the Checkpoints
You can load any of our released IAPO checkpoints directly from the Hugging Face Hub via the subfolder argument.
Each subfolder corresponds to a {base_model}_{dataset} combination. For example:
from transformers import AutoModel
# Generic usage
model = AutoModel.from_pretrained("jonathanhe123/iapo", subfolder="model_name")
# Example: load the Qwen2.5-0.5B-Instruct checkpoint fine-tuned on MATH-500
model = AutoModel.from_pretrained("jonathanhe123/iapo", subfolder="Qwen2.5-0.5B-Instruct_MATH-500")
Citation
If you find this work useful, please cite our paper:
@inproceedings{he2026iapo,
title={IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning},
author={He, Yinhan and Zhu, Yaochen and Shi, Mingjia and Zheng, Wendy and Su, Lin and Wang, Xiaoqing and Guo, Qi
and Li, Jundong},
booktitle={International Conference on Machine Learning (ICML 2026)},
year={2026}
}