Papers
arxiv:2604.09752

A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

Published on Apr 15
Authors:
,
,
,
,
,

Abstract

Deploying large language models on heterogeneous NPUs encounters memory-bound issues during autoregressive decoding, with static model deployment creating scaling paradoxes and fine-grained speculative decoding suffering from kernel synchronization overhead.

AI-generated summary

During the deployment of Large Language Models (LLMs), the autoregressive decoding phase on heterogeneous NPU platforms (e.g., Ascend 910B) faces severe memory-bound challenges. This study reveals the ``Model Scaling Paradox'' caused by the static deployment of single-sized models. It also points out the kernel synchronization overhead of fine-grained speculative decoding leviathan2023fast, chen2023speculative under NPU computational graph compilation, and the severe limitations of purely relying on micro-level acceleration algorithms like Prompt LookUp Decoding (PLD)

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.09752
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.09752 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.09752 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.09752 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.