diffuse-cpp is now Apache-2.0 — 3.3x faster than llama.cpp on CPU
#2
by Carmenest - opened
We just open-sourced diffuse-cpp under Apache-2.0!
diffuse-cpp is the first C++ inference engine for Diffusion Language Models, built on GGML. It brings LLaDA-8B and Dream-7B to CPU with near-interactive performance.
Why diffusion on CPU?
Autoregressive LLMs are memory-bound on CPU — each token reads the full weight matrix. Diffusion LLMs process all tokens in parallel through matrix-matrix multiplication, making them compute-bound. This means they actually scale with more cores.
Benchmarks (AMD EPYC 12-Core, Q4_K_M)
| Prompt | diffuse-cpp (LLaDA) | llama.cpp | Speedup |
|---|---|---|---|
| Translate to French | 27.7 tok/s | 8.51 tok/s | 3.3x |
| Capital of France? | 24.4 tok/s | 8.51 tok/s | 2.9x |
| Translate to Spanish | 22.9 tok/s | 8.51 tok/s | 2.7x |
| Math (15x23) | 15.7 tok/s | 8.51 tok/s | 1.8x |
| Average (8 prompts) | 15.3 tok/s | 8.51 tok/s | 1.8x |
Key features:
- entropy_exit: model decides how many denoising steps to use (2 for easy prompts, 16 for hard)
- Inter-step KV cache: 1.6x average speedup by reusing stable K,V between steps
- Thread scaling: 7.4x at 12 cores (vs 2.4x for autoregressive)
Quick Start
# Download model
huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
# Build
git clone --recursive https://github.com/iafiscal1212/diffuse-cpp.git
cd diffuse-cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run
./build/diffuse-cli -m llada-8b-q4km.gguf --tokens "128000,3923,374,279,6864,315,9822,30" -n 256 -s 16 -t 12 --remasking entropy_exit
Links
- GitHub: https://github.com/iafiscal1212/diffuse-cpp
- Paper: https://doi.org/10.5281/zenodo.19119814
- Dream-7B GGUF: https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF
Contributions welcome — tokenizer integration, GPU offload, Apple Silicon benchmarks, and more architectures are all open areas.