Leveraging the multi-dimensional fine-grained annotations produced by our pipeline, we introduce FM-Speech, built upon the frontier Qwen3-Omni (30B MoE) architecture.

🎙️ Input: Raw Speech Audio ➔ 📊 Output: 14-Dimension Fine-Grained Speech Attributes (Structured JSON)

To overcome modality gaps and text-conditioned hallucinations, FM-Speech is trained using a Progressive Curriculum Fine-Tuning framework, decoupling complex auditory comprehension into three incremental stages: Warm-up (MCQ/QA) --> Capability Ramp-up --> Final Alignment (Full JSON).

🚀 Usage & Environment Setup

Our model is built upon the Qwen3-Omni architecture. We strongly recommend using vLLM for the inference and deployment of FM-Speech.

Step 1: Create a fresh Python environment to avoid runtime conflicts and incompatibilities.

conda create -n fmspeech python=3.12
conda activate fmspeech

Step 2: Install required packages

# Install vLLM (Specifically version 0.13.0)
pip install vllm==0.13.0
# Note: If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, 
# please use "pip install -e . -v" to build vLLM from source.

# Install Transformers and Accelerate
pip install transformers==4.57.3
pip install accelerate

# Install Qwen Omni utilities and Flash Attention
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

Step 3: Run Inference Prepare a sample audio file and run the inference script to generate the 14-dimension JSON output.

python infer.py

(See infer.py in our repository for detailed loading and inference examples).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ASLP-lab/FM-Speech

Base model

Qwen/Qwen3-Omni-30B-A3B-Instruct

Finetuned

(19)

this model

Collection including ASLP-lab/FM-Speech

FMSU

Collection

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model • 2 items • Updated 8 days ago