ACE-Step-CPU / README.md
Nekochu's picture
update README with final state, full pipeline inference, LM generation step
a5741b1
metadata
title: ACE-Step 1.5 XL Music Generation (CPU)
emoji: 🎵
colorFrom: indigo
colorTo: yellow
sdk: docker
pinned: false
license: mit
tags:
  - music-generation
  - ace-step
  - gguf
  - lora
  - training
  - cpu
  - mcp-server
short_description: ACE-Step 1.5 XL - CPU music generation + LoRA training
models:
  - ACE-Step/Ace-Step1.5
startup_duration_timeout: 2h

ACE-Step 1.5 XL Music Generation (CPU)

GGUF inference + LoRA training on free CPU Spaces. Powered by acestep.cpp.

Features

  • Music Generation -- text/lyrics to stereo 48kHz MP3 via GGUF quantized models
  • LoRA Training -- fine-tune on your own audio (~11s/epoch CPU, ~1.4s/epoch GPU)
  • Auto-Captioning -- librosa BPM/key/signature + LM understand mode (caption + lyrics extraction)
  • Multiple LM Sizes -- 0.6B / 1.7B / 4B language models (on-demand download)
  • Cancel + Download -- cancel training mid-epoch, download trained LoRA adapter

Music Generation

  1. Enter a music description
  2. Enter lyrics or check Instrumental
  3. Adjust BPM, duration, steps, seed
  4. Select LoRA adapter if trained
  5. Click Generate Music

Timing: ~270s for 10s audio with 1.7B LM, 8 steps on CPU.

LoRA Training

  1. Upload audio files (any length, auto-tiled at 30s chunks by VAE)
  2. Set LoRA name, epochs, learning rate, rank
  3. Click Train -- ace-server stops during training, restarts after
  4. Use Cancel to stop early (saves checkpoint)
  5. Download the trained adapter file
  6. Trained adapter appears in the LoRA dropdown

Timing: ~170s preprocessing + ~11s/epoch on CPU. GPU: ~1.4s/epoch.

Limits: 30 min total audio across all files. Files exceeding the cap are truncated with a warning. 50 files max. 8h training timeout.

Settings (per Side-Step author recommendations):

  • LR: 3e-4
  • Rank: 32, Alpha: 64
  • Epochs: 200-500 for 3-10 files
  • Optimizer: Adafactor (minimal memory)
  • Variant: standard turbo (not XL -- XL swaps on 18GB)

Captioning Pipeline

Training audio is auto-captioned before preprocessing:

Method What it extracts Speed
librosa BPM, key, time signature ~3s/file
LM understand (GPU) Rich caption + lyrics + metadata ~52s/file
ace-server /understand (Space) Same as LM, via GGUF ~30s/file
.txt/.json sidecar User-provided caption (if present) instant

On Space: uses ace-server /understand before training. Locally: uses PyTorch LM understand.

Models

Component GGUF Size Purpose
DiT XL turbo acestep-v15-xl-turbo-Q4_K_M 2.8 GB Music generation (no LoRA)
DiT standard turbo acestep-v15-turbo-Q4_K_M 1.1 GB Music generation (with LoRA)
LM 1.7B acestep-5Hz-lm-1.7B-Q8_0 1.7 GB Caption understanding
Text Encoder Qwen3-Embedding-0.6B-Q8_0 0.75 GB Text encoding
VAE vae-BF16 0.32 GB Audio encode/decode

API

Generate Music

from gradio_client import Client

client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
    caption="upbeat electronic dance music",
    lyrics="[Instrumental]",
    instrumental=True, bpm=120, duration=10, seed=-1, steps=8,
    lora_select="None (no LoRA)",
    lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
    api_name="/generate"
)

Train LoRA

from gradio_client import Client, handle_file

client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
    audio_files=[handle_file("song.mp3")],
    lora_name="my-style", epochs=200, lr=0.0003, rank=32,
    api_name="/train_lora"
)

MCP (Model Context Protocol)

{
  "mcpServers": {
    "ace-step": {"url": "https://werecooking-ace-step-cpu.hf.space/gradio_api/mcp/"}
  }
}

CLI

python app.py "upbeat electronic dance music" --duration 10 --steps 8
python app.py "jazz piano" --adapter my-style --seed 42

Architecture

  • Inference: GGUF via acestep.cpp
  • Training: PyTorch, ported from Side-Step (commit ecd13bd)
  • Captioning: librosa + LM understand (PyTorch or ace-server /understand)
  • Training stops ace-server to free RAM, restarts after with new adapters
  • Inference blocked during training with clear message

Credits