data-archetype/semdisdiffae_p32_v2

semdisdiffae_p32_v2 is a native patch-32 SemDisDiffAE diffusion autoencoder. It keeps the same FCDM decoder family as SemDisDiffAE, with an 8-block encoder, an 8-block decoder, and a 384-channel spatial latent at H/32 x W/32.

Relative to the original SemDisDiffAE, this model is optimized for a lower-resolution latent grid and downstream latent diffusion: patch size 32 instead of 16, 384 latent channels instead of 128, an 8-block encoder instead of a 4-block encoder, and DINOv3 ConvNeXt-B semantic alignment instead of the original DINO semantic alignment setup.

For details, see the semdisdiffae_p32_v2 technical report. For additional shared FCDM / VP decoder background, see the original SemDisDiffAE technical report.

The p32 checkpoint was trained at 384 resolution rather than the original 256-scale recipe. With patch size 32, this gives a 12x12 latent grid instead of 8x8, reducing the impact of 7x7-convolution border effects during training.

2k PSNR Benchmark

Evaluated on 2000 images, split as 1333 Pexels images and 667 Amazon book covers. Reconstruction uses the default 1-step VP/DDIM path in bfloat16.

Model	Mean PSNR (dB)	Std (dB)	Median (dB)	P5 (dB)	P95 (dB)
semdisdiffae_p32_v2	`36.06`	`5.47`	`35.80`	`27.63`	`45.02`

Reconstruction Viewer

The 39-image reconstruction viewer shows originals, semdisdiffae_p32_v2 reconstructions, RGB error deltas, and latent PCA side by side, with FLUX.2 VAE included for comparison: semdisdiffae_p32_v2 reconstruction viewer.

Encode Throughput

Measured on an NVIDIA GeForce RTX 5090 in bfloat16, averaging 20 repeated batched encode() calls after 5 warmup batches.

Resolution	Batch Size	Mean (ms/batch)	ms/image	Images/s	Peak Allocated VRAM
`256x256`	`128`	`12.38`	`0.097`	`10336.1`	`567.8 MiB`
`512x512`	`128`	`53.49`	`0.418`	`2393.0`	`1353.8 MiB`

Decode Latency

Measured on the same NVIDIA GeForce RTX 5090 in bfloat16. This is decode-only latency: images are encoded once, latents are cached, and timing is sequential batch-1 decode() over the cached latent set with the default 1-step sampler and PDG disabled.

Resolution	Batch Size	Images	Mean (ms/image)	Images/s	Peak Allocated VRAM
`512x512`	`1`	`20`	`3.89`	`256.8`	`340.8 MiB`
`1024x1024`	`1`	`20`	`9.79`	`102.2`	`409.6 MiB`
`2048x2048`	`1`	`20`	`51.90`	`19.3`	`720.9 MiB`

Latent Interface

encode() returns whitened latents using the model's saved running statistics.
decode() expects those whitened latents and dewhitens internally.
whiten() and dewhiten() expose the transform explicitly.
encode_posterior() returns the raw exported posterior before whitening.

Weights are stored in float32. The recommended runtime path is bfloat16 for the encoder and decoder, while whitening, dewhitening, posterior moment math, VP schedule math, and sampler state updates are kept in float32.

Usage

import torch

from fcdm_diffae import FCDMDiffAE, FCDMDiffAEInferenceConfig


device = "cuda"
model = FCDMDiffAE.from_pretrained(
    "data-archetype/semdisdiffae_p32_v2",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 32

with torch.inference_mode():
    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
    recon = model.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=FCDMDiffAEInferenceConfig(num_steps=1),
    )

Details

Architecture: patch-32 FCDM DiffAE, 156.6M parameters, 384 latent channels.
Encoder / decoder depth: 8 blocks each.
Training resolution: 384 AR buckets and 384x384 square crops.
Semantic alignment: DINOv3 ConvNeXt-B/LVD1689M, 50/50 MSE plus negative cosine.
Posterior: diagonal Gaussian with VP log-SNR parameterization.
Export variant: EMA weights.
Technical report

Citation

@misc{semdisdiffae_p32_v2,
  title   = {SemDisDiffAE p32 v2: a patch-32 FCDM diffusion autoencoder},
  author  = {data-archetype},
  email   = {[email protected]},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/semdisdiffae_p32_v2},
}

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support