Instructions to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="diffuse-cpp/LLaDA-8B-Instruct-GGUF",
	filename="llada-8b-q4km.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
# Run inference directly in the terminal:
llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
# Run inference directly in the terminal:
llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Use Docker

docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

LM Studio
Jan

vLLM

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "diffuse-cpp/LLaDA-8B-Instruct-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "diffuse-cpp/LLaDA-8B-Instruct-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Ollama
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Ollama:
```
ollama run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
```

Unsloth Studio

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
```

Lemonade

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Run and chat with the model

lemonade run user.LLaDA-8B-Instruct-GGUF-Q8_0

List all available models

lemonade list

Carmenest commited on Mar 20

Commit

e00b07f

verified ·

1 Parent(s): 0092801

Update model card with B=256 real-prompt benchmarks

Browse files

Files changed (1) hide show

README.md +24 -15

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ GGUF quantized versions of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GS
 LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
-> **Paper:** [Diffusion Language Models are Faster than Autoregressive on CPU](https://doi.org/10.5281/zenodo.19119814) -- C. Esteban, 2026
 ## Available Quantizations
@@ -28,21 +28,26 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
 | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
 | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
-## Benchmark (AMD EPYC 4465P 12-Core, 64 tokens, steps=16, threads=12)
 ### Real Prompt Performance (Q4_K_M + entropy_exit)
-| Prompt type | tok/s | Steps used | Speedup |
-|---|---|---|---|
-| Factual ("Capital of France?") | **9.22** | 4 | 3.9x |
-| Translation ("Translate to French") | **10.23** | 3 | 4.6x |
-| Arithmetic ("15 x 23?") | **11.49** | 3 | 5.5x |
-| Code (is_prime function) | **2.53** | 15 | 1.1x |
-| Creative (poem, explanation) | 2.33 | 17 | 1.0x |
-entropy_exit adapts to prompt difficulty: 3-4 steps for easy, 16 for hard. Never slower than baseline.
-### Quantization Comparison (low_confidence baseline)
 | Model | Size | tok/s | vs F16 |
 |-------|------|-------|--------|
@@ -52,10 +57,10 @@ entropy_exit adapts to prompt difficulty: 3-4 steps for easy, 16 for hard. Never
 ### Summary
-- **~10 tok/s on easy real prompts** (Q4_K_M + entropy_exit)
-- **~6x faster than F16 baseline** on factual/translation tasks
 - **7.5x thread scaling** from 1 to 12 threads
-- **40+ tok/s peak** on synthetic benchmarks (single forward pass)
 Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
@@ -68,5 +73,9 @@ cmake -B build -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j$(nproc)
 # Generate with entropy_exit (recommended)
-python tools/generate.py     --model-dir /path/to/LLaDA-8B-Instruct     --gguf llada-8b-q4km.gguf     -p "What is the capital of France?"     -s 16 -t 12 --remasking entropy_exit
 ```

 LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
+> **Paper:** [Diffusion Language Models are Faster than Autoregressive on CPU](https://doi.org/10.5281/zenodo.19119813) — C. Esteban, 2026
 ## Available Quantizations
 | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
 | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
+## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
 ### Real Prompt Performance (Q4_K_M + entropy_exit)
+| Prompt | B=64 tok/s | B=256 tok/s | Steps | vs llama.cpp |
+|---|---|---|---|---|
+| Capital of France? | 9.22 | **15.60** | 4 | 1.8x |
+| Translate to French | 10.23 | **21.78** | 3 | 2.6x |
+| 15 × 23? | 11.49 | **11.45** | 5 | 1.3x |
+| Translate to Spanish | 4.59 | **7.17** | 8 | 0.8x |
+| Python is_prime() | 2.53 | **3.12** | 17 | 0.4x |
+| Poem about ocean | 2.33 | **3.10** | 17 | 0.4x |
+| Why is sky blue? | 2.21 | **3.18** | 17 | 0.4x |
+| List the planets | 2.33 | **3.19** | 17 | 0.4x |
+*B = generation buffer size (tokens generated per call). llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware).*
+entropy_exit adapts to prompt difficulty: 3–4 steps for easy, 16 for hard. Never slower than baseline.
+### Quantization Comparison (low_confidence baseline, B=64)
 | Model | Size | tok/s | vs F16 |
 |-------|------|-------|--------|
 ### Summary
+- **11–22 tok/s on easy real prompts** (Q4_K_M + entropy_exit, B=256)
+- **Up to 2.6x faster than llama.cpp** on the same hardware
+- **256-token generation** with 20% lower per-token cost vs 64-token batches
 - **7.5x thread scaling** from 1 to 12 threads
 Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
 cmake --build build -j$(nproc)
 # Generate with entropy_exit (recommended)
+python tools/generate.py \
+    --model-dir /path/to/LLaDA-8B-Instruct \
+    --gguf llada-8b-q4km.gguf \
+    -p "What is the capital of France?" \
+    -n 256 -s 16 -t 12 --remasking entropy_exit
 ```