V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising
Abstract
Pixel-space diffusion models can be enhanced through visual co-denoising techniques that incorporate pretrained visual features, with systematic analysis revealing key architectural and training components for improved performance.
Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Representation Alignment for Just Image Transformers is not Easier than You Think (2026)
- PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss (2026)
- SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training (2026)
- Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector (2026)
- OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder (2026)
- Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation (2026)
- Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
