How to use in python code?
#6
by
riddlechen
- opened
Thanks for great quant, any way to use this gguf checkpoint in raw python code except comfy? thanks.
technically it could work with diffusers, but after checking the diffusers repo, support for ZImage Turbo GGUF hasn’t been added yet. Someone probably needs to create a pull request or a feature request.
https://github.com/huggingface/diffusers/pull/12756
Someone just created a PR for this, and it works fine. If you want to try it, you can:
git clone https://github.com/huggingface/diffusers.git
cd diffusers
git fetch origin pull/12756/head:pr-12756
git switch pr-12756
pip install -e .
Example script (Will put it on the model card later once the PR merged.):
from diffusers import ZImagePipeline, ZImageTransformer2DModel, GGUFQuantizationConfig
import torch
prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights."
height = 1024
width = 1024
seed = 42
#hf_path = "https://huggingface.co/jayn7/Z-Image-Turbo-GGUF/blob/main/z_image_turbo-Q3_K_M.gguf"
local_path = "path\to\local\model\z_image_turbo-Q3_K_M.gguf"
transformer = ZImageTransformer2DModel.from_single_file(
local_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
dtype=torch.bfloat16,
)
pipeline = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo",
transformer=transformer,
dtype=torch.bfloat16,
).to("cuda")
# [Optional] Attention Backend
# Diffusers uses SDPA by default. Switch to Custom attention backend for better efficiency if supported:
#pipeline.transformer.set_attention_backend("_sage_qk_int8_pv_fp16_triton") # Enable Sage Attention
#pipeline.transformer.set_attention_backend("flash") # Enable Flash-Attention-2
#pipeline.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3
# [Optional] Model Compilation
# Compiling the DiT model accelerates inference, but the first run will take longer to compile.
#pipeline.transformer.compile()
# [Optional] CPU Offloading
# Enable CPU offloading for memory-constrained devices.
#pipeline.enable_model_cpu_offload()
images = pipeline(
prompt=prompt,
num_inference_steps=9, # This actually results in 8 DiT forwards
guidance_scale=0.0, # Guidance should be 0 for the Turbo models
height=height,
width=width,
generator=torch.Generator("cuda").manual_seed(seed)
).images[0]
images.save("zimage.png")
https://github.com/huggingface/diffusers/pull/12756
Someone just created a PR for this, and it works fine. If you want to try it, you can:
git clone https://github.com/huggingface/diffusers.git cd diffusers git fetch origin pull/12756/head:pr-12756 git switch pr-12756 pip install -e .Example script (Will put it on the model card later once the PR merged.):
from diffusers import ZImagePipeline, ZImageTransformer2DModel, GGUFQuantizationConfig import torch prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights." height = 1024 width = 1024 seed = 42 #hf_path = "https://huggingface.co/jayn7/Z-Image-Turbo-GGUF/blob/main/z_image_turbo-Q3_K_M.gguf" local_path = "path\to\local\model\z_image_turbo-Q3_K_M.gguf" transformer = ZImageTransformer2DModel.from_single_file( local_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), dtype=torch.bfloat16, ) pipeline = ZImagePipeline.from_pretrained( "Tongyi-MAI/Z-Image-Turbo", transformer=transformer, dtype=torch.bfloat16, ).to("cuda") # [Optional] Attention Backend # Diffusers uses SDPA by default. Switch to Custom attention backend for better efficiency if supported: #pipeline.transformer.set_attention_backend("_sage_qk_int8_pv_fp16_triton") # Enable Sage Attention #pipeline.transformer.set_attention_backend("flash") # Enable Flash-Attention-2 #pipeline.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3 # [Optional] Model Compilation # Compiling the DiT model accelerates inference, but the first run will take longer to compile. #pipeline.transformer.compile() # [Optional] CPU Offloading # Enable CPU offloading for memory-constrained devices. #pipeline.enable_model_cpu_offload() images = pipeline( prompt=prompt, num_inference_steps=9, # This actually results in 8 DiT forwards guidance_scale=0.0, # Guidance should be 0 for the Turbo models height=height, width=width, generator=torch.Generator("cuda").manual_seed(seed) ).images[0] images.save("zimage.png")
thank you very much
jayn7
changed discussion status to
closed