Audio Caption and Categorizer Models
Model Description
This repository provides optimized exports of audio captioning and categorization models for ARM-based mobile deployment. The pipeline consists of:
Audio Captioning: Uses
wsntxxn/effb2-trm-audiocaps-captioning(EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events.Audio Categorization: Uses
sentence-transformers/all-MiniLM-L6-v2to match generated captions to predefined sound categories via semantic similarity.
Export Formats
- Encoder: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB)
- Decoder: ExecuTorch (
.pte) format with dynamic quantization for reduced model size - Categorizer: ExecuTorch (
.pte) format with quantization
Key Features
- 5-second audio input at 16kHz
- Preprocessing baked into ONNX encoder (no external audio processing needed)
- Optimized for mobile inference with quantization
- Complete end-to-end pipeline from raw audio to categorized captions
Usage
Quick Start
Generate a caption for an audio file:
# Activate environment
source .venv/bin/activate
# Generate caption
python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav
Python Example
import onnxruntime as ort
from executorch.extension.pybindings.portable_lib import _load_for_executorch
from transformers import AutoTokenizer
import numpy as np
# Load models
encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx")
decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte")
tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True)
# Process audio (16kHz, 5 seconds = 80000 samples)
audio = np.random.randn(1, 80000).astype(np.float32)
# Encode
attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0]
# Decode (greedy search)
generated = [tokenizer.bos_token_id]
for _ in range(30):
logits = decoder.forward((
torch.tensor([generated]),
torch.tensor(attn_emb),
torch.tensor([attn_emb.shape[1] - 1])
))[0]
next_token = int(torch.argmax(logits[0, -1, :]))
generated.append(next_token)
if next_token == tokenizer.eos_token_id:
break
caption = tokenizer.decode(generated, skip_special_tokens=True)
print(caption)
Training Details
Base Models
This repository does not train models but exports pre-trained models to optimized formats:
| Component | Base Model | Training Dataset | Parameters |
|---|---|---|---|
| Audio Encoder | EfficientNet-B2 | AudioCaps | ~7.7M |
| Caption Decoder | Transformer (2 layers) | AudioCaps | ~4.3M |
| Categorizer | all-MiniLM-L6-v2 | 1B+ sentence pairs | ~22.7M |
Export Configuration
Audio Captioning:
- Preprocessing:
n_mels=64,n_fft=512,hop_length=160,win_length=512 - Input: Raw audio waveform (16kHz, 5 seconds)
- Encoder: ONNX opset 17 with dynamic axes
- Decoder: ExecuTorch with dynamic quantization (int8)
Categorizer:
- Tokenizer: RoBERTa-based (max length: 128)
- Export: ExecuTorch with dynamic quantization
- Categories: 50+ predefined audio event categories
Project Structure
.
βββ audio-caption/
β βββ export_encoder_preprocess_onnx.py # Export ONNX encoder
β βββ export_decoder_executorch.py # Export ExecuTorch decoder
β βββ generate_caption_hybrid.py # Inference pipeline
β βββ effb2_encoder_preprocess.onnx # Exported encoder
β βββ effb2_decoder_5sec.pte # Exported decoder
β
βββ sentence-transformers-embbedings/
β βββ export_sentence_transformers_executorch.py
β βββ generate_category_embeddings.py
β βββ category_embeddings.json
β
βββ categories.json # Category definitions
Setup
Prerequisites
# Install uv package manager
pip install uv
# Create environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -r pyproject.toml
Configuration
Create a .env file:
# Hugging Face Token (for gated models)
HF_TOKEN=your_token_here
# Optional: Custom cache directory
# HF_HOME=./.cache/huggingface
Export Models
# Export audio captioning models
python audio-caption/export_encoder_preprocess_onnx.py
python audio-caption/export_decoder_executorch.py
# Export categorization model
python sentence-transformers-embbedings/export_sentence_transformers_executorch.py
# Generate category embeddings
python sentence-transformers-embbedings/generate_category_embeddings.py
License
Apache License 2.0
Citations
Audio Captioning Model
@inproceedings{xu2024efficient,
title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation},
author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.},
booktitle={Interspeech 2024},
year={2024},
doi={10.48550/arXiv.2407.14329},
url={https://arxiv.org/abs/2407.14329}
}
Sentence Transformer
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
Model tree for stiv14/audio-caption-categorizer-model
Base model
sentence-transformers/all-MiniLM-L6-v2