IndexTTS-1.5-vLLM
This is the IndexTTS-1.5 model converted for vLLM acceleration.
Model Description
IndexTTS-1.5 is a high-quality text-to-speech model optimized with vLLM for fast inference.
Performance
- RTF (Real-Time Factor): ~0.10 (10x faster than real-time)
- Peak Throughput: 8.49 requests/second at 96 concurrent requests
- Concurrency: Supports 128+ concurrent requests
- Speedup: 52x faster than real-time at peak load
- GPU: Tested on L40S with 75% memory utilization
Usage
Installation
pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install vllm==0.10.2
pip install -r requirements.txt
Running the Server
python api_server.py \
--model_dir /path/to/IndexTTS-1.5-vLLM \
--host 0.0.0.0 \
--port 6006 \
--gpu_memory_utilization 0.75
API Example
import requests
url = "http://localhost:6006/tts_url"
data = {
"text": "Hello, this is IndexTTS speaking!",
"audio_paths": ["reference_audio.wav"]
}
response = requests.post(url, json=data)
with open("output.wav", "wb") as f:
f.write(response.content)
Model Details
- Base Model: IndexTTS-1.5
- Framework: vLLM 0.10.2
- PyTorch Version: 2.8.0+cu126
- CUDA Version: 12.6
- KV Cache: 283,200 tokens (75% GPU utilization)
- Max Sequence Length: 803 tokens
Original Model
Based on IndexTeam/IndexTTS-1.5
Citation
If you use this model, please cite:
@misc{indextts-vllm,
title={IndexTTS-1.5 with vLLM Acceleration},
author={Original: IndexTeam, vLLM conversion: Community},
year={2024},
howpublished={\url{https://huggingface.co/Hariprasath28/IndexTTS-1.5-vLLM}}
}
License
MIT License
Performance Benchmarks
Single Request
- RTF: ~0.10 (10x faster than real-time)
High Concurrency (75% GPU)
| Concurrent | Throughput | RTF | Speedup |
|---|---|---|---|
| 16 | 8.18 req/s | 0.225 | 50.2x |
| 32 | 8.23 req/s | 0.382 | 50.5x |
| 64 | 8.39 req/s | 0.683 | 51.5x |
| 96 | 8.49 req/s | 0.979 | 52.1x |
| 128 | 8.45 req/s | 1.298 | 51.8x |
Repository
Original vLLM implementation: index-tts-vllm
- Downloads last month
- 5