I-OSUM-Pangu: Intent-Aware Open-Source Speech Understanding Framework
Yujie Liao, Xuelong Geng, Shuiyuan Wang, Lei Xie
In recent years, the development of large-scale audio-language models has enabled multi-dimensional speech understanding. However, most existing open-source models rely on fixed templates or task tags, while more powerful systems are often closed-source or require massive amounts of training data.
We propose I-OSUM-Pangu, an efficient, controllable, and fully open-source speech understanding framework.
The model is built upon:
- Whisper-medium speech encoder (from the Whisper series developed by :contentReference[oaicite:0]{index=0})
- :contentReference[oaicite:1]{index=1} 7B large language model backbone
The core objective of our framework is to enable the model to:
- Understand user instructions expressed in natural language
- Automatically identify user intent
- Route the request to the corresponding speech understanding task
- Work without relying on fixed prompt templates
Experimental results show that:
- The Instruction Following Rate (IFR) exceeds 90%
- While maintaining comparable task performance with traditional fixed-tag approaches
This project releases both code and model weights, aiming to provide a reproducible and extensible open-source framework for speech understanding research.
Architecture
The overall architecture of I-OSUM-Pangu is shown below:
The model mainly consists of three components:
1. Speech Encoder
Whisper-medium
Responsible for extracting speech representations.
2. Adapter
Transforms acoustic features into tokens compatible with the LLM input space.
3. Intent-aware LLM
OpenPangu-7B
Responsible for:
- Parsing natural language instructions
- Identifying user intent
- Determining which speech task to execute
Training Strategy
We propose a Decoupled-then-Integrated Training Strategy, illustrated below:
Stage 1: Speech Understanding Alignment
Goal: Equip the model with multi-task speech understanding capability.
Characteristics:
- Only speech-related modules are trained
- Establish strong acoustic representation ability
Stage 2: Intent Understanding
Goal: Enable the model to understand natural language user instructions.
Examples:
Please transcribe this audio.
Analyze the speaker's emotion.
Identify what event happens in the audio.
The model learns:
- Instruction semantic understanding
- Task mapping capability
Stage 3: Joint Instruction Tuning
In the final stage, joint training allows the model to:
- Automatically parse user instructions
- Identify task types
- Execute the corresponding speech understanding tasks
Without requiring fixed templates, such as:
What is the emotion of this speech?
Can you transcribe this audio?
What event happens in the audio?
The model can correctly understand and execute all of them.
Inference Results
Dataset Configuration
The model is trained on 47,000 hours of multi-task speech data, covering seven core speech tasks. Additionally, a dedicated dataset is constructed to enhance instruction-following ability.
Instruction Following Performance (IFR)
Instruction Following Rate (IFR) measures the ability of the model to parse natural language instructions and execute the corresponding tasks.
The metric is defined as:
where:
- $N_{correct}$ represents the number of correctly executed instructions
- $N_{total}$ represents the total number of evaluation samples
Compared with mainstream open-source models, I-OSUM-Pangu achieves significantly better performance:
Flexibility vs Accuracy
We evaluate whether natural language instructions (NL) degrade performance compared to fixed instructions (FI).
Results show that the model maintains strong flexibility while preserving task accuracy.
Conclusion:
Only minor performance drops appear in relatively niche tasks such as:
- Style recognition
- Event detection
Core tasks such as:
- ASR
- SER
- SAP
remain almost unchanged, validating the effectiveness of the Decoupled-then-Integrated strategy.
Multi-task Speech Understanding Performance
On public benchmarks, the model demonstrates competitive performance across multiple tasks, particularly in:
- Age prediction
- Emotion recognition (MER2023)
Speech-to-Text Chat (STTC) Capability
We further evaluate the model in conversational reasoning scenarios.
I-OSUM-Pangu outperforms GLM-4-Voice on the TriviaQA and WebQ benchmarks.
Ablation Study: Importance of the Decoupled Training Strategy
We compare direct joint training with our decoupled-then-integrated strategy to verify the effectiveness of our core design.
Conclusion:
Text-domain intent pretraining (Stage 2) establishes a strong semantic prior for the model and is crucial for improving instruction-following stability.
How to Use the I-OSUM-Pangu Framework for Training and Inference
Environment Setup
Before starting, please ensure that your device supports NPU and the Python environment is properly configured.
We recommend running the code on a Linux system.
If Conda is not installed, please refer to: https://blog.csdn.net/qq_41636123/article/details/130266232
# Create a new conda environment
conda create -n iosum python=3.10
conda activate iosum
# Clone the repository
git clone https://github.com/ASLP-lab/I-OSUM-Pangu.git
cd I-OSUM-Pangu
# Install dependencies
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
Model Download
from huggingface_hub import snapshot_download
# 下载Qwen2-Audio-7B模型
snapshot_download(
repo_id="ASLP-lab/I-OSUM-Pangu",
local_dir="path",
local_dir_use_symlinks=False,
endpoint="https://hf-mirror.com"
)
Inference
This project provides batch inference scripts for all tasks under in :I-OSUM-Pangu/infer_code:
python infer_ASR.py
Training
To ensure a smooth training process, please follow the steps below.
1. Data Preparation
Data can be prepared in three formats:
raw、shard、combine
Recommended: shard format
After preparing the dataset, write the generated data index into the following configuration file:
I-OSUM-Pangu/conf/data_s2t_tmp.yaml
2. Start Training
Run the main training script:
I-OSUM-Pangu/train.sh