# Stable Video Diffusion

Stable Video Diffusion was proposed in [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://hf.co/papers/2311.15127) by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach.

The abstract from the paper is:

*We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at this https URL.*

> [!TIP]
> To learn how to use Stable Video Diffusion, take a look at the [Stable Video Diffusion](../../../using-diffusers/svd) guide.
>
> 
>
> Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the [base](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [extended frame](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) checkpoints!

## Tips

Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.

Check out the [Text or image-to-video](../../../using-diffusers/text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.

## StableVideoDiffusionPipeline[[diffusers.StableVideoDiffusionPipeline]]

#### diffusers.StableVideoDiffusionPipeline[[diffusers.StableVideoDiffusionPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py#L147)

Pipeline to generate video from an input image using Stable Video Diffusion.

This model inherits from [DiffusionPipeline](/docs/diffusers/v0.38.0/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).

**Parameters:**

vae (`AutoencoderKLTemporalDecoder`) : Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.

image_encoder ([CLIPVisionModelWithProjection](https://huggingface.co/docs/transformers/v5.7.0/en/model_doc/clip#transformers.CLIPVisionModelWithProjection)) : Frozen CLIP image-encoder ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K)).

unet (`UNetSpatioTemporalConditionModel`) : A `UNetSpatioTemporalConditionModel` to denoise the encoded image latents.

scheduler ([EulerDiscreteScheduler](/docs/diffusers/v0.38.0/en/api/schedulers/euler#diffusers.EulerDiscreteScheduler)) : A scheduler to be used in combination with `unet` to denoise the encoded image latents.

feature_extractor ([CLIPImageProcessor](https://huggingface.co/docs/transformers/v5.7.0/en/model_doc/clip#transformers.CLIPImageProcessor)) : A `CLIPImageProcessor` to extract features from generated images.

## StableVideoDiffusionPipelineOutput[[diffusers.pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput]]

#### diffusers.pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput[[diffusers.pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py#L134)

Output class for Stable Video Diffusion pipeline.

**Parameters:**

frames (`[list[list[PIL.Image.Image]]`, `np.ndarray`, `torch.Tensor`]) : list of denoised PIL images of length `batch_size` or numpy array or torch tensor of shape `(batch_size, num_frames, height, width, num_channels)`.

