Audio Caption and Categorizer Models

Model Description

This repository provides optimized exports of audio captioning and categorization models for ARM-based mobile deployment. The pipeline consists of:

  1. Audio Captioning: Uses wsntxxn/effb2-trm-audiocaps-captioning (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events.

  2. Audio Categorization: Uses sentence-transformers/all-MiniLM-L6-v2 to match generated captions to predefined sound categories via semantic similarity.

Export Formats

  • Encoder: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB)
  • Decoder: ExecuTorch (.pte) format with dynamic quantization for reduced model size
  • Categorizer: ExecuTorch (.pte) format with quantization

Key Features

  • 5-second audio input at 16kHz
  • Preprocessing baked into ONNX encoder (no external audio processing needed)
  • Optimized for mobile inference with quantization
  • Complete end-to-end pipeline from raw audio to categorized captions

Usage

Quick Start

Generate a caption for an audio file:

# Activate environment
source .venv/bin/activate

# Generate caption
python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav

Python Example

import onnxruntime as ort
from executorch.extension.pybindings.portable_lib import _load_for_executorch
from transformers import AutoTokenizer
import numpy as np

# Load models
encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx")
decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte")
tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True)

# Process audio (16kHz, 5 seconds = 80000 samples)
audio = np.random.randn(1, 80000).astype(np.float32)

# Encode
attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0]

# Decode (greedy search)
generated = [tokenizer.bos_token_id]
for _ in range(30):
    logits = decoder.forward((
        torch.tensor([generated]),
        torch.tensor(attn_emb),
        torch.tensor([attn_emb.shape[1] - 1])
    ))[0]
    next_token = int(torch.argmax(logits[0, -1, :]))
    generated.append(next_token)
    if next_token == tokenizer.eos_token_id:
        break

caption = tokenizer.decode(generated, skip_special_tokens=True)
print(caption)

Training Details

Base Models

This repository does not train models but exports pre-trained models to optimized formats:

Component Base Model Training Dataset Parameters
Audio Encoder EfficientNet-B2 AudioCaps ~7.7M
Caption Decoder Transformer (2 layers) AudioCaps ~4.3M
Categorizer all-MiniLM-L6-v2 1B+ sentence pairs ~22.7M

Export Configuration

Audio Captioning:

  • Preprocessing: n_mels=64, n_fft=512, hop_length=160, win_length=512
  • Input: Raw audio waveform (16kHz, 5 seconds)
  • Encoder: ONNX opset 17 with dynamic axes
  • Decoder: ExecuTorch with dynamic quantization (int8)

Categorizer:

  • Tokenizer: RoBERTa-based (max length: 128)
  • Export: ExecuTorch with dynamic quantization
  • Categories: 50+ predefined audio event categories

Project Structure

.
β”œβ”€β”€ audio-caption/
β”‚   β”œβ”€β”€ export_encoder_preprocess_onnx.py  # Export ONNX encoder
β”‚   β”œβ”€β”€ export_decoder_executorch.py       # Export ExecuTorch decoder
β”‚   β”œβ”€β”€ generate_caption_hybrid.py         # Inference pipeline
β”‚   β”œβ”€β”€ effb2_encoder_preprocess.onnx      # Exported encoder
β”‚   └── effb2_decoder_5sec.pte             # Exported decoder
β”‚
β”œβ”€β”€ sentence-transformers-embbedings/
β”‚   β”œβ”€β”€ export_sentence_transformers_executorch.py
β”‚   β”œβ”€β”€ generate_category_embeddings.py
β”‚   └── category_embeddings.json
β”‚
└── categories.json                         # Category definitions

Setup

Prerequisites

# Install uv package manager
pip install uv

# Create environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r pyproject.toml

Configuration

Create a .env file:

# Hugging Face Token (for gated models)
HF_TOKEN=your_token_here

# Optional: Custom cache directory
# HF_HOME=./.cache/huggingface

Export Models

# Export audio captioning models
python audio-caption/export_encoder_preprocess_onnx.py
python audio-caption/export_decoder_executorch.py

# Export categorization model
python sentence-transformers-embbedings/export_sentence_transformers_executorch.py

# Generate category embeddings
python sentence-transformers-embbedings/generate_category_embeddings.py

License

Apache License 2.0

Citations

Audio Captioning Model

@inproceedings{xu2024efficient,
  title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation},
  author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.},
  booktitle={Interspeech 2024},
  year={2024},
  doi={10.48550/arXiv.2407.14329},
  url={https://arxiv.org/abs/2407.14329}
}

Sentence Transformer

@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for stiv14/audio-caption-categorizer-model

Quantized
(50)
this model