--- license: apache-2.0 tags: - diffusion - llada - gguf - cpu-inference - diffuse-cpp language: - en base_model: GSAI-ML/LLaDA-8B-Instruct pipeline_tag: text-generation --- # LLaDA-8B-Instruct-GGUF GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), the first C++ inference engine for Diffusion Language Models. LLaDA is a masked diffusion language model based on the Llama backbone. Unlike autoregressive models that generate one token at a time, LLaDA generates all tokens in parallel through iterative refinement — making it compute-bound rather than memory-bound on CPU. **On a 12-core CPU, LLaDA with diffuse-cpp reaches 27.7 tok/s on translation tasks — 3.3x faster than llama.cpp (8.51 tok/s) on the same hardware.** ## Available Quantizations | File | Type | Size | Description | |------|------|------|-------------| | `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality | | `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless | | `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed, best speed/quality ratio | **Recommended:** Q4_K_M for most users. ## Quick Start ```bash # Download huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf # Build diffuse-cpp git clone --recursive https://github.com/iafiscal1212/diffuse-cpp.git cd diffuse-cpp cmake -B build -DCMAKE_BUILD_TYPE=Release cmake --build build -j$(nproc) # Run ./build/diffuse-cli -m ../llada-8b-q4km.gguf \ --tokens "128000,3923,374,279,6864,315,9822,30" \ -n 256 -s 16 -t 12 --remasking entropy_exit ``` ## Performance Benchmarked on AMD EPYC 4465P 12-Core, Q4_K_M, entropy_exit + inter-step cache, B=256: | Prompt | No-Cache | Cache | Steps | vs llama.cpp | |--------|----------|-------|-------|-------------| | Capital of France? | 17.5 | **24.4 tok/s** | 3 | 2.9x | | Translate to French | 25.9 | **27.7 tok/s** | 2 | **3.3x** | | 15 x 23? | 12.8 | **15.7 tok/s** | 4 | 1.8x | | Translate to Spanish | 7.6 | **22.9 tok/s** | 7 | 2.7x | | Python is_prime() | 3.2 | **4.9 tok/s** | 16 | 0.6x | | Poem about ocean | 3.2 | **5.3 tok/s** | 16 | 0.6x | | Why is sky blue? | 3.3 | **12.0 tok/s** | 16 | 1.4x | | List the planets | 3.3 | **9.4 tok/s** | 15 | 1.1x | | **Average** | **9.6** | **15.3 tok/s** | | **1.8x** | - Inter-step cache: 1.6x average speedup with no quality degradation - 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline) - LLaDA excels at translation tasks (converges in 2-5 steps) ## Model Details - **Architecture:** Llama backbone with bidirectional (non-causal) attention - **Parameters:** 8B - **Layers:** 32 - **Hidden size:** 4096 - **Attention:** MHA (32 query heads, 32 KV heads) - **FFN:** SwiGLU, intermediate 12288 - **Vocabulary:** 126,464 tokens - **RoPE theta:** 500,000 - **Mask token ID:** 126336 ## Also Available - **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA. Excels at math and code (21.6 tok/s, correctly solves arithmetic in 2 steps). ## Citation ```bibtex @software{diffuse_cpp_2026, title={diffuse-cpp: High-Performance Inference for Diffusion Language Models}, author={Carmen Esteban}, year={2026}, url={https://github.com/iafiscal1212/diffuse-cpp} } ``` ## License Apache 2.0