Audio Caption and Categorizer Models

Model Description

This repository provides optimized exports of audio captioning and categorization models for ARM-based mobile deployment. The pipeline consists of:

Audio Captioning: Uses wsntxxn/effb2-trm-audiocaps-captioning (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events.
Audio Categorization: Uses sentence-transformers/all-MiniLM-L6-v2 to match generated captions to predefined sound categories via semantic similarity.

Export Formats

Encoder: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB)
Decoder: ExecuTorch (.pte) format with dynamic quantization for reduced model size
Categorizer: ExecuTorch (.pte) format with quantization

Key Features

5-second audio input at 16kHz
Preprocessing baked into ONNX encoder (no external audio processing needed)
Optimized for mobile inference with quantization
Complete end-to-end pipeline from raw audio to categorized captions

Usage

Quick Start

Generate a caption for an audio file:

# Activate environment
source .venv/bin/activate

# Generate caption
python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav

Python Example

import onnxruntime as ort
from executorch.extension.pybindings.portable_lib import _load_for_executorch
from transformers import AutoTokenizer
import numpy as np

# Load models
encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx")
decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte")
tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True)

# Process audio (16kHz, 5 seconds = 80000 samples)
audio = np.random.randn(1, 80000).astype(np.float32)

# Encode
attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0]

# Decode (greedy search)
generated = [tokenizer.bos_token_id]
for _ in range(30):
    logits = decoder.forward((
        torch.tensor([generated]),
        torch.tensor(attn_emb),
        torch.tensor([attn_emb.shape[1] - 1])
    ))[0]
    next_token = int(torch.argmax(logits[0, -1, :]))
    generated.append(next_token)
    if next_token == tokenizer.eos_token_id:
        break

caption = tokenizer.decode(generated, skip_special_tokens=True)
print(caption)

Training Details

Base Models

This repository does not train models but exports pre-trained models to optimized formats:

Component	Base Model	Training Dataset	Parameters
Audio Encoder	EfficientNet-B2	AudioCaps	~7.7M
Caption Decoder	Transformer (2 layers)	AudioCaps	~4.3M
Categorizer	all-MiniLM-L6-v2	1B+ sentence pairs	~22.7M

Export Configuration

Audio Captioning:

Preprocessing: n_mels=64, n_fft=512, hop_length=160, win_length=512
Input: Raw audio waveform (16kHz, 5 seconds)
Encoder: ONNX opset 17 with dynamic axes
Decoder: ExecuTorch with dynamic quantization (int8)

Categorizer:

Tokenizer: RoBERTa-based (max length: 128)
Export: ExecuTorch with dynamic quantization
Categories: 50+ predefined audio event categories

Project Structure

.
├── audio-caption/
│   ├── export_encoder_preprocess_onnx.py  # Export ONNX encoder
│   ├── export_decoder_executorch.py       # Export ExecuTorch decoder
│   ├── generate_caption_hybrid.py         # Inference pipeline
│   ├── effb2_encoder_preprocess.onnx      # Exported encoder
│   └── effb2_decoder_5sec.pte             # Exported decoder
│
├── sentence-transformers-embbedings/
│   ├── export_sentence_transformers_executorch.py
│   ├── generate_category_embeddings.py
│   └── category_embeddings.json
│
└── categories.json                         # Category definitions

Setup

Prerequisites

# Install uv package manager
pip install uv

# Create environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r pyproject.toml

Configuration

Create a .env file:

# Hugging Face Token (for gated models)
HF_TOKEN=your_token_here

# Optional: Custom cache directory
# HF_HOME=./.cache/huggingface

Export Models

# Export audio captioning models
python audio-caption/export_encoder_preprocess_onnx.py
python audio-caption/export_decoder_executorch.py

# Export categorization model
python sentence-transformers-embbedings/export_sentence_transformers_executorch.py

# Generate category embeddings
python sentence-transformers-embbedings/generate_category_embeddings.py

License

Apache License 2.0

Citations

Audio Captioning Model

@inproceedings{xu2024efficient,
  title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation},
  author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.},
  booktitle={Interspeech 2024},
  year={2024},
  doi={10.48550/arXiv.2407.14329},
  url={https://arxiv.org/abs/2407.14329}
}

Sentence Transformer

@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for stiv14/audio-caption-categorizer-model

Base model

sentence-transformers/all-MiniLM-L6-v2

Quantized

(50)

this model