How to use in python code?

by riddlechen - opened 9 days ago

Discussion

riddlechen

9 days ago

Thanks for great quant, any way to use this gguf checkpoint in raw python code except comfy? thanks.

tofiqqqq

8 days ago

technically it could work with diffusers, but after checking the diffusers repo, support for ZImage Turbo GGUF hasn’t been added yet. Someone probably needs to create a pull request or a feature request.

jayn7

Owner 6 days ago

https://github.com/huggingface/diffusers/pull/12756

Someone just created a PR for this, and it works fine. If you want to try it, you can:

git clone https://github.com/huggingface/diffusers.git
cd diffusers
git fetch origin pull/12756/head:pr-12756
git switch pr-12756
pip install -e .

Example script (Will put it on the model card later once the PR merged.):

from diffusers import ZImagePipeline, ZImageTransformer2DModel, GGUFQuantizationConfig
import torch

prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights."
height = 1024
width = 1024
seed = 42

#hf_path = "https://huggingface.co/jayn7/Z-Image-Turbo-GGUF/blob/main/z_image_turbo-Q3_K_M.gguf"
local_path = "path\to\local\model\z_image_turbo-Q3_K_M.gguf"

transformer = ZImageTransformer2DModel.from_single_file(
    local_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    dtype=torch.bfloat16,
)

pipeline = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    transformer=transformer,
    dtype=torch.bfloat16,
).to("cuda")

# [Optional] Attention Backend
# Diffusers uses SDPA by default. Switch to Custom attention backend for better efficiency if supported:
#pipeline.transformer.set_attention_backend("_sage_qk_int8_pv_fp16_triton") # Enable Sage Attention
#pipeline.transformer.set_attention_backend("flash") # Enable Flash-Attention-2
#pipeline.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3

# [Optional] Model Compilation
# Compiling the DiT model accelerates inference, but the first run will take longer to compile.
#pipeline.transformer.compile()

# [Optional] CPU Offloading
# Enable CPU offloading for memory-constrained devices.
#pipeline.enable_model_cpu_offload()

images = pipeline(
    prompt=prompt,
    num_inference_steps=9, # This actually results in 8 DiT forwards
    guidance_scale=0.0, # Guidance should be 0 for the Turbo models
    height=height,
    width=width,
    generator=torch.Generator("cuda").manual_seed(seed)
).images[0]

images.save("zimage.png")

riddlechen

6 days ago

https://github.com/huggingface/diffusers/pull/12756

Someone just created a PR for this, and it works fine. If you want to try it, you can:

git clone https://github.com/huggingface/diffusers.git
cd diffusers
git fetch origin pull/12756/head:pr-12756
git switch pr-12756
pip install -e .

Example script (Will put it on the model card later once the PR merged.):

from diffusers import ZImagePipeline, ZImageTransformer2DModel, GGUFQuantizationConfig
import torch

prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights."
height = 1024
width = 1024
seed = 42

#hf_path = "https://huggingface.co/jayn7/Z-Image-Turbo-GGUF/blob/main/z_image_turbo-Q3_K_M.gguf"
local_path = "path\to\local\model\z_image_turbo-Q3_K_M.gguf"

transformer = ZImageTransformer2DModel.from_single_file(
    local_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    dtype=torch.bfloat16,
)

pipeline = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    transformer=transformer,
    dtype=torch.bfloat16,
).to("cuda")

# [Optional] Attention Backend
# Diffusers uses SDPA by default. Switch to Custom attention backend for better efficiency if supported:
#pipeline.transformer.set_attention_backend("_sage_qk_int8_pv_fp16_triton") # Enable Sage Attention
#pipeline.transformer.set_attention_backend("flash") # Enable Flash-Attention-2
#pipeline.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3

# [Optional] Model Compilation
# Compiling the DiT model accelerates inference, but the first run will take longer to compile.
#pipeline.transformer.compile()

# [Optional] CPU Offloading
# Enable CPU offloading for memory-constrained devices.
#pipeline.enable_model_cpu_offload()

images = pipeline(
    prompt=prompt,
    num_inference_steps=9, # This actually results in 8 DiT forwards
    guidance_scale=0.0, # Guidance should be 0 for the Turbo models
    height=height,
    width=width,
    generator=torch.Generator("cuda").manual_seed(seed)
).images[0]

images.save("zimage.png")

thank you very much

jayn7 changed discussion status to closed 1 day ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment