# Marigold Computer Vision

![marigold](https://marigoldmonodepth.github.io/images/teaser_collage_compressed.jpg)

Marigold was proposed in 
[Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), 
a CVPR 2024 Oral paper by 
[Bingxin Ke](http://www.kebingxin.com/), 
[Anton Obukhov](https://www.obukhov.ai/), 
[Shengyu Huang](https://shengyuh.github.io/), 
[Nando Metzger](https://nandometzger.github.io/), 
[Rodrigo Caye Daudt](https://rcdaudt.github.io/), and 
[Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
The core idea is to **repurpose the generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional 
computer vision tasks**.
This approach was explored by fine-tuning Stable Diffusion for **Monocular Depth Estimation**, as demonstrated in the 
teaser above.

Marigold was later extended in the follow-up paper, 
[Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis](https://huggingface.co/papers/2312.02145), 
authored by 
[Bingxin Ke](http://www.kebingxin.com/), 
[Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US), 
[Tianfu Wang](https://tianfwang.github.io/), 
[Nando Metzger](https://nandometzger.github.io/), 
[Shengyu Huang](https://shengyuh.github.io/), 
[Bo Li](https://www.linkedin.com/in/bobboli0202/), 
[Anton Obukhov](https://www.obukhov.ai/), and 
[Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
This work expanded Marigold to support new modalities such as **Surface Normals** and **Intrinsic Image Decomposition** 
(IID), introduced a training protocol for **Latent Consistency Models** (LCM), and demonstrated **High-Resolution** (HR) 
processing capability.

> [!TIP]
> The early Marigold models (`v1-0` and earlier) were optimized for best results with at least 10 inference steps.
> LCM models were later developed to enable high-quality inference in just 1 to 4 steps.
> Marigold models `v1-1` and later use the DDIM scheduler to achieve optimal 
> results in as few as 1 to 4 steps.

## Available Pipelines

Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a 
corresponding prediction.
Currently, the following computer vision tasks are implemented:

| Pipeline                                                                                                                                          | Recommended Model Checkpoints                                                                                                                                                                           |                              Spaces (Interactive Apps)                               | Predicted Modalities                                                                                                                                                               |
|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py)           | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1)                                                                                                                       |          [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold)          | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity)                                                                   |
| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py)       | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1)                                                                                                                   | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping)                                                                                                                    |
| [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid)  | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection)   |

## Available Checkpoints

All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face.
They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train 
new model checkpoints.
The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps. 

| Checkpoint                                                                                          | Modality     | Comment                                                                                                                                                                              |
|-----------------------------------------------------------------------------------------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1)                   | Depth        | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference.                    |
| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1)               | Normals      | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1.                                                        |
| [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics   | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity.                                                                      | 
| [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1)     | Intrinsics   | HyperSim decomposition of an image $I$ is comprised of Albedo $A$, Diffuse shading $S$, and Non-diffuse residual $R$: $I = A*S+R$. |

> [!TIP]
> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff 
> between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to 
> efficiently load the same components into multiple pipelines. 
> Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section 
> [here](../../using-diffusers/svd#reduce-memory-usage).

> [!WARNING]
> Marigold pipelines were designed and tested with the scheduler embedded in the model checkpoint.
> The optimal number of inference steps varies by scheduler, with no universal value that works best across all cases.
> To accommodate this, the `num_inference_steps` parameter in the pipeline's `__call__` method defaults to `None` (see the 
> API reference).
> Unless set explicitly, it inherits the value from the `default_denoising_steps` field in the checkpoint configuration 
> file (`model_index.json`).
> This ensures high-quality predictions when invoking the pipeline with only the `image` argument.

See also Marigold [usage examples](../../using-diffusers/marigold_usage).

## Marigold Depth Prediction API[[diffusers.MarigoldDepthPipeline]]

#### diffusers.MarigoldDepthPipeline[[diffusers.MarigoldDepthPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py#L104)

Pipeline for monocular depth estimation using the Marigold method: https://marigoldmonodepth.github.io.

This model inherits from [DiffusionPipeline](/docs/diffusers/v0.38.0/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__diffusers.MarigoldDepthPipeline.__call__https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py#L347[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor]"}, {"name": "num_inference_steps", "val": ": int | None = None"}, {"name": "ensemble_size", "val": ": int = 1"}, {"name": "processing_resolution", "val": ": int | None = None"}, {"name": "match_input_resolution", "val": ": bool = True"}, {"name": "resample_method_input", "val": ": str = 'bilinear'"}, {"name": "resample_method_output", "val": ": str = 'bilinear'"}, {"name": "batch_size", "val": ": int = 1"}, {"name": "ensembling_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "latents", "val": ": torch.Tensor | list[torch.Tensor] | None = None"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "output_type", "val": ": str = 'np'"}, {"name": "output_uncertainty", "val": ": bool = False"}, {"name": "output_latent", "val": ": bool = False"}, {"name": "return_dict", "val": ": bool = True"}]- **image** (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`), --
  `list[torch.Tensor]`: An input image or images used as an input for the depth estimation task. For
  arrays and tensors, the expected value range is between `[0, 1]`. Passing a batch of images is possible
  by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or
  three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the
  same width and height.
- **num_inference_steps** (`int`, *optional*, defaults to `None`) --
  Number of denoising diffusion steps during inference. The default value `None` results in automatic
  selection.
- **ensemble_size** (`int`, defaults to `1`) --
  Number of ensemble predictions. Higher values result in measurable improvements and visual degradation.
- **processing_resolution** (`int`, *optional*, defaults to `None`) --
  Effective processing resolution. When set to `0`, matches the larger input image dimension. This
  produces crisper predictions, but may also lead to the overall loss of global context. The default
  value `None` resolves to the optimal value from the model config.
- **match_input_resolution** (`bool`, *optional*, defaults to `True`) --
  When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer
  side of the output will equal to `processing_resolution`.
- **resample_method_input** (`str`, *optional*, defaults to `"bilinear"`) --
  Resampling method used to resize input images to `processing_resolution`. The accepted values are:
  `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`.
- **resample_method_output** (`str`, *optional*, defaults to `"bilinear"`) --
  Resampling method used to resize output predictions to match the input resolution. The accepted values
  are `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`.
- **batch_size** (`int`, *optional*, defaults to `1`) --
  Batch size; only matters when setting `ensemble_size` or passing a tensor of images.
- **ensembling_kwargs** (`dict`, *optional*, defaults to `None`) --
  Extra dictionary with arguments for precise ensembling control. The following options are available:
  - reduction (`str`, *optional*, defaults to `"median"`): Defines the ensembling function applied in
    every pixel location, can be either `"median"` or `"mean"`.
  - regularizer_strength (`float`, *optional*, defaults to `0.02`): Strength of the regularizer that
    pulls the aligned predictions to the unit range from 0 to 1.
  - max_iter (`int`, *optional*, defaults to `2`): Maximum number of the alignment solver steps. Refer to
    `scipy.optimize.minimize` function, `options` argument.
  - tol (`float`, *optional*, defaults to `1e-3`): Alignment solver tolerance. The solver stops when the
    tolerance is reached.
  - max_res (`int`, *optional*, defaults to `None`): Resolution at which the alignment is performed;
    `None` matches the `processing_resolution`.
- **latents** (`torch.Tensor`, or `list[torch.Tensor]`, *optional*, defaults to `None`) --
  Latent noise tensors to replace the random initialization. These can be taken from the previous
  function call's output.
- **generator** (`torch.Generator`, or `list[torch.Generator]`, *optional*, defaults to `None`) --
  Random number generator object to ensure reproducibility.
- **output_type** (`str`, *optional*, defaults to `"np"`) --
  Preferred format of the output's `prediction` and the optional `uncertainty` fields. The accepted
  values are: `"np"` (numpy array) or `"pt"` (torch tensor).
- **output_uncertainty** (`bool`, *optional*, defaults to `False`) --
  When enabled, the output's `uncertainty` field contains the predictive uncertainty map, provided that
  the `ensemble_size` argument is set to a value above 2.
- **output_latent** (`bool`, *optional*, defaults to `False`) --
  When enabled, the output's `latent` field contains the latent codes corresponding to the predictions
  within the ensemble. These codes can be saved, modified, and used for subsequent calls with the
  `latents` argument.
- **return_dict** (`bool`, *optional*, defaults to `True`) --
  Whether or not to return a [MarigoldDepthOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) instead of a plain tuple.0[MarigoldDepthOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) or `tuple`If `return_dict` is `True`, [MarigoldDepthOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) is returned, otherwise a
`tuple` is returned where the first element is the prediction, the second element is the uncertainty
(or `None`), and the third is the latent (or `None`).

Function invoked when calling the pipeline.

Examples:
```py
>>> import diffusers
>>> import torch

>>> pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
...     "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")

>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
>>> depth = pipe(image)

>>> vis = pipe.image_processor.visualize_depth(depth.prediction)
>>> vis[0].save("einstein_depth.png")

>>> depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
>>> depth_16bit[0].save("einstein_depth_16bit.png")
```

**Parameters:**

unet (`UNet2DConditionModel`) : Conditional U-Net to denoise the depth latent, conditioned on image latent.

vae (`AutoencoderKL`) : Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations.

scheduler (`DDIMScheduler` or `LCMScheduler`) : A scheduler to be used in combination with `unet` to denoise the encoded image latents.

text_encoder (`CLIPTextModel`) : Text-encoder, for empty text embedding.

tokenizer (`CLIPTokenizer`) : CLIP tokenizer.

prediction_type (`str`, *optional*) : Type of predictions made by the model.

scale_invariant (`bool`, *optional*) : A model property specifying whether the predicted depth maps are scale-invariant. This value must be set in the model config. When used together with the `shift_invariant=True` flag, the model is also called "affine-invariant". NB: overriding this value is not supported.

shift_invariant (`bool`, *optional*) : A model property specifying whether the predicted depth maps are shift-invariant. This value must be set in the model config. When used together with the `scale_invariant=True` flag, the model is also called "affine-invariant". NB: overriding this value is not supported.

default_denoising_steps (`int`, *optional*) : The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly setting `num_inference_steps`, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (`LCMScheduler`) and those with full diffusion schedules (`DDIMScheduler`).

default_processing_resolution (`int`, *optional*) : The recommended value of the `processing_resolution` parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly setting `processing_resolution`, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values.

**Returns:**

`[MarigoldDepthOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) or `tuple``

If `return_dict` is `True`, [MarigoldDepthOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) is returned, otherwise a
`tuple` is returned where the first element is the prediction, the second element is the uncertainty
(or `None`), and the third is the latent (or `None`).

#### diffusers.pipelines.marigold.MarigoldDepthOutput[[diffusers.pipelines.marigold.MarigoldDepthOutput]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py#L83)

Output class for Marigold monocular depth prediction pipeline.

**Parameters:**

prediction (`np.ndarray`, `torch.Tensor`) : Predicted depth maps with values in the range [0, 1]. The shape is `numimages × 1 × height × width` for `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`.

uncertainty (`None`, `np.ndarray`, `torch.Tensor`) : Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `numimages × 1 × height × width` for `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`.

latent (`None`, `torch.Tensor`) : Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline. The shape is `numimages * numensemble × 4 × latentheight × latentwidth`.

#### diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_depth[[diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_depth]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/marigold_image_processing.py#L387)

Visualizes depth maps, such as predictions of the `MarigoldDepthPipeline`.

Returns: `list[PIL.Image.Image]` with depth maps visualization.

**Parameters:**

depth (`PIL.Image.Image | np.ndarray | torch.Tensor | list[PIL.Image.Image, list[np.ndarray], : list[torch.Tensor]]`): Depth maps.

val_min (`float`, *optional*, defaults to `0.0`) : Minimum value of the visualized depth range.

val_max (`float`, *optional*, defaults to `1.0`) : Maximum value of the visualized depth range.

color_map (`str`, *optional*, defaults to `"Spectral"`) : Color map used to convert a single-channel depth prediction into colored representation.

## Marigold Normals Estimation API[[diffusers.MarigoldNormalsPipeline]]
#### diffusers.MarigoldNormalsPipeline[[diffusers.MarigoldNormalsPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py#L99)

Pipeline for monocular normals estimation using the Marigold method: https://marigoldmonodepth.github.io.

This model inherits from [DiffusionPipeline](/docs/diffusers/v0.38.0/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__diffusers.MarigoldNormalsPipeline.__call__https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py#L332[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor]"}, {"name": "num_inference_steps", "val": ": int | None = None"}, {"name": "ensemble_size", "val": ": int = 1"}, {"name": "processing_resolution", "val": ": int | None = None"}, {"name": "match_input_resolution", "val": ": bool = True"}, {"name": "resample_method_input", "val": ": str = 'bilinear'"}, {"name": "resample_method_output", "val": ": str = 'bilinear'"}, {"name": "batch_size", "val": ": int = 1"}, {"name": "ensembling_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "latents", "val": ": torch.Tensor | list[torch.Tensor] | None = None"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "output_type", "val": ": str = 'np'"}, {"name": "output_uncertainty", "val": ": bool = False"}, {"name": "output_latent", "val": ": bool = False"}, {"name": "return_dict", "val": ": bool = True"}]- **image** (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`), --
  `list[torch.Tensor]`: An input image or images used as an input for the normals estimation task. For
  arrays and tensors, the expected value range is between `[0, 1]`. Passing a batch of images is possible
  by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or
  three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the
  same width and height.
- **num_inference_steps** (`int`, *optional*, defaults to `None`) --
  Number of denoising diffusion steps during inference. The default value `None` results in automatic
  selection.
- **ensemble_size** (`int`, defaults to `1`) --
  Number of ensemble predictions. Higher values result in measurable improvements and visual degradation.
- **processing_resolution** (`int`, *optional*, defaults to `None`) --
  Effective processing resolution. When set to `0`, matches the larger input image dimension. This
  produces crisper predictions, but may also lead to the overall loss of global context. The default
  value `None` resolves to the optimal value from the model config.
- **match_input_resolution** (`bool`, *optional*, defaults to `True`) --
  When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer
  side of the output will equal to `processing_resolution`.
- **resample_method_input** (`str`, *optional*, defaults to `"bilinear"`) --
  Resampling method used to resize input images to `processing_resolution`. The accepted values are:
  `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`.
- **resample_method_output** (`str`, *optional*, defaults to `"bilinear"`) --
  Resampling method used to resize output predictions to match the input resolution. The accepted values
  are `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`.
- **batch_size** (`int`, *optional*, defaults to `1`) --
  Batch size; only matters when setting `ensemble_size` or passing a tensor of images.
- **ensembling_kwargs** (`dict`, *optional*, defaults to `None`) --
  Extra dictionary with arguments for precise ensembling control. The following options are available:
  - reduction (`str`, *optional*, defaults to `"closest"`): Defines the ensembling function applied in
    every pixel location, can be either `"closest"` or `"mean"`.
- **latents** (`torch.Tensor`, *optional*, defaults to `None`) --
  Latent noise tensors to replace the random initialization. These can be taken from the previous
  function call's output.
- **generator** (`torch.Generator`, or `list[torch.Generator]`, *optional*, defaults to `None`) --
  Random number generator object to ensure reproducibility.
- **output_type** (`str`, *optional*, defaults to `"np"`) --
  Preferred format of the output's `prediction` and the optional `uncertainty` fields. The accepted
  values are: `"np"` (numpy array) or `"pt"` (torch tensor).
- **output_uncertainty** (`bool`, *optional*, defaults to `False`) --
  When enabled, the output's `uncertainty` field contains the predictive uncertainty map, provided that
  the `ensemble_size` argument is set to a value above 2.
- **output_latent** (`bool`, *optional*, defaults to `False`) --
  When enabled, the output's `latent` field contains the latent codes corresponding to the predictions
  within the ensemble. These codes can be saved, modified, and used for subsequent calls with the
  `latents` argument.
- **return_dict** (`bool`, *optional*, defaults to `True`) --
  Whether or not to return a [MarigoldNormalsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) instead of a plain tuple.0[MarigoldNormalsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) or `tuple`If `return_dict` is `True`, [MarigoldNormalsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) is returned, otherwise a
`tuple` is returned where the first element is the prediction, the second element is the uncertainty
(or `None`), and the third is the latent (or `None`).

Function invoked when calling the pipeline.

Examples:
```py
>>> import diffusers
>>> import torch

>>> pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(
...     "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")

>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
>>> normals = pipe(image)

>>> vis = pipe.image_processor.visualize_normals(normals.prediction)
>>> vis[0].save("einstein_normals.png")
```

**Parameters:**

unet (`UNet2DConditionModel`) : Conditional U-Net to denoise the normals latent, conditioned on image latent.

vae (`AutoencoderKL`) : Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations.

scheduler (`DDIMScheduler` or `LCMScheduler`) : A scheduler to be used in combination with `unet` to denoise the encoded image latents.

text_encoder (`CLIPTextModel`) : Text-encoder, for empty text embedding.

tokenizer (`CLIPTokenizer`) : CLIP tokenizer.

prediction_type (`str`, *optional*) : Type of predictions made by the model.

use_full_z_range (`bool`, *optional*) : Whether the normals predicted by this model utilize the full range of the Z dimension, or only its positive half.

default_denoising_steps (`int`, *optional*) : The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly setting `num_inference_steps`, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (`LCMScheduler`) and those with full diffusion schedules (`DDIMScheduler`).

default_processing_resolution (`int`, *optional*) : The recommended value of the `processing_resolution` parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly setting `processing_resolution`, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values.

**Returns:**

`[MarigoldNormalsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) or `tuple``

If `return_dict` is `True`, [MarigoldNormalsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) is returned, otherwise a
`tuple` is returned where the first element is the prediction, the second element is the uncertainty
(or `None`), and the third is the latent (or `None`).

#### diffusers.pipelines.marigold.MarigoldNormalsOutput[[diffusers.pipelines.marigold.MarigoldNormalsOutput]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py#L78)

Output class for Marigold monocular normals prediction pipeline.

**Parameters:**

prediction (`np.ndarray`, `torch.Tensor`) : Predicted normals with values in the range [-1, 1]. The shape is `numimages × 3 × height × width` for `torch.Tensor` or `numimages × height × width × 3` for `np.ndarray`.

uncertainty (`None`, `np.ndarray`, `torch.Tensor`) : Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `numimages × 1 × height × width` for `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`.

latent (`None`, `torch.Tensor`) : Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline. The shape is `numimages * numensemble × 4 × latentheight × latentwidth`.

#### diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_normals[[diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_normals]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/marigold_image_processing.py#L486)

Visualizes surface normals, such as predictions of the `MarigoldNormalsPipeline`.

Returns: `list[PIL.Image.Image]` with surface normals visualization.

**Parameters:**

normals (`np.ndarray | torch.Tensor | list[np.ndarray, list[torch.Tensor]]`) : Surface normals.

flip_x (`bool`, *optional*, defaults to `False`) : Flips the X axis of the normals frame of reference. Default direction is right.

flip_y (`bool`, *optional*, defaults to `False`) : Flips the Y axis of the normals frame of reference. Default direction is top.

flip_z (`bool`, *optional*, defaults to `False`) : Flips the Z axis of the normals frame of reference. Default direction is facing the observer.

## Marigold Intrinsic Image Decomposition API[[diffusers.MarigoldIntrinsicsPipeline]]

#### diffusers.MarigoldIntrinsicsPipeline[[diffusers.MarigoldIntrinsicsPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py#L120)

Pipeline for Intrinsic Image Decomposition (IID) using the Marigold method:
https://marigoldcomputervision.github.io.

This model inherits from [DiffusionPipeline](/docs/diffusers/v0.38.0/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__diffusers.MarigoldIntrinsicsPipeline.__call__https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py#L359[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor]"}, {"name": "num_inference_steps", "val": ": int | None = None"}, {"name": "ensemble_size", "val": ": int = 1"}, {"name": "processing_resolution", "val": ": int | None = None"}, {"name": "match_input_resolution", "val": ": bool = True"}, {"name": "resample_method_input", "val": ": str = 'bilinear'"}, {"name": "resample_method_output", "val": ": str = 'bilinear'"}, {"name": "batch_size", "val": ": int = 1"}, {"name": "ensembling_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "latents", "val": ": torch.Tensor | list[torch.Tensor] | None = None"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "output_type", "val": ": str = 'np'"}, {"name": "output_uncertainty", "val": ": bool = False"}, {"name": "output_latent", "val": ": bool = False"}, {"name": "return_dict", "val": ": bool = True"}]- **image** (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`), --
  `list[torch.Tensor]`: An input image or images used as an input for the intrinsic decomposition task.
  For arrays and tensors, the expected value range is between `[0, 1]`. Passing a batch of images is
  possible by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or
  three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the
  same width and height.
- **num_inference_steps** (`int`, *optional*, defaults to `None`) --
  Number of denoising diffusion steps during inference. The default value `None` results in automatic
  selection.
- **ensemble_size** (`int`, defaults to `1`) --
  Number of ensemble predictions. Higher values result in measurable improvements and visual degradation.
- **processing_resolution** (`int`, *optional*, defaults to `None`) --
  Effective processing resolution. When set to `0`, matches the larger input image dimension. This
  produces crisper predictions, but may also lead to the overall loss of global context. The default
  value `None` resolves to the optimal value from the model config.
- **match_input_resolution** (`bool`, *optional*, defaults to `True`) --
  When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer
  side of the output will equal to `processing_resolution`.
- **resample_method_input** (`str`, *optional*, defaults to `"bilinear"`) --
  Resampling method used to resize input images to `processing_resolution`. The accepted values are:
  `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`.
- **resample_method_output** (`str`, *optional*, defaults to `"bilinear"`) --
  Resampling method used to resize output predictions to match the input resolution. The accepted values
  are `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`.
- **batch_size** (`int`, *optional*, defaults to `1`) --
  Batch size; only matters when setting `ensemble_size` or passing a tensor of images.
- **ensembling_kwargs** (`dict`, *optional*, defaults to `None`) --
  Extra dictionary with arguments for precise ensembling control. The following options are available:
  - reduction (`str`, *optional*, defaults to `"median"`): Defines the ensembling function applied in
    every pixel location, can be either `"median"` or `"mean"`.
- **latents** (`torch.Tensor`, *optional*, defaults to `None`) --
  Latent noise tensors to replace the random initialization. These can be taken from the previous
  function call's output.
- **generator** (`torch.Generator`, or `list[torch.Generator]`, *optional*, defaults to `None`) --
  Random number generator object to ensure reproducibility.
- **output_type** (`str`, *optional*, defaults to `"np"`) --
  Preferred format of the output's `prediction` and the optional `uncertainty` fields. The accepted
  values are: `"np"` (numpy array) or `"pt"` (torch tensor).
- **output_uncertainty** (`bool`, *optional*, defaults to `False`) --
  When enabled, the output's `uncertainty` field contains the predictive uncertainty map, provided that
  the `ensemble_size` argument is set to a value above 2.
- **output_latent** (`bool`, *optional*, defaults to `False`) --
  When enabled, the output's `latent` field contains the latent codes corresponding to the predictions
  within the ensemble. These codes can be saved, modified, and used for subsequent calls with the
  `latents` argument.
- **return_dict** (`bool`, *optional*, defaults to `True`) --
  Whether or not to return a [MarigoldIntrinsicsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) instead of a plain tuple.0[MarigoldIntrinsicsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) or `tuple`If `return_dict` is `True`, [MarigoldIntrinsicsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) is returned, otherwise a
`tuple` is returned where the first element is the prediction, the second element is the uncertainty
(or `None`), and the third is the latent (or `None`).

Function invoked when calling the pipeline.

Examples:
```py
>>> import diffusers
>>> import torch

>>> pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
...     "prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")

>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
>>> intrinsics = pipe(image)

>>> vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
>>> vis[0]["albedo"].save("einstein_albedo.png")
>>> vis[0]["roughness"].save("einstein_roughness.png")
>>> vis[0]["metallicity"].save("einstein_metallicity.png")
```

```py
>>> import diffusers
>>> import torch

>>> pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
...     "prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")

>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
>>> intrinsics = pipe(image)

>>> vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
>>> vis[0]["albedo"].save("einstein_albedo.png")
>>> vis[0]["shading"].save("einstein_shading.png")
>>> vis[0]["residual"].save("einstein_residual.png")
```

**Parameters:**

unet (`UNet2DConditionModel`) : Conditional U-Net to denoise the targets latent, conditioned on image latent.

vae (`AutoencoderKL`) : Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations.

scheduler (`DDIMScheduler` or `LCMScheduler`) : A scheduler to be used in combination with `unet` to denoise the encoded image latents.

text_encoder (`CLIPTextModel`) : Text-encoder, for empty text embedding.

tokenizer (`CLIPTokenizer`) : CLIP tokenizer.

prediction_type (`str`, *optional*) : Type of predictions made by the model.

target_properties (`dict[str, Any]`, *optional*) : Properties of the predicted modalities, such as `target_names`, a `list[str]` used to define the number, order and names of the predicted modalities, and any other metadata that may be required to interpret the predictions.

default_denoising_steps (`int`, *optional*) : The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly setting `num_inference_steps`, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (`LCMScheduler`) and those with full diffusion schedules (`DDIMScheduler`).

default_processing_resolution (`int`, *optional*) : The recommended value of the `processing_resolution` parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly setting `processing_resolution`, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values.

**Returns:**

`[MarigoldIntrinsicsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) or `tuple``

If `return_dict` is `True`, [MarigoldIntrinsicsOutput](/docs/diffusers/v0.38.0/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) is returned, otherwise a
`tuple` is returned where the first element is the prediction, the second element is the uncertainty
(or `None`), and the third is the latent (or `None`).

#### diffusers.pipelines.marigold.MarigoldIntrinsicsOutput[[diffusers.pipelines.marigold.MarigoldIntrinsicsOutput]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py#L96)

Output class for Marigold Intrinsic Image Decomposition pipeline.

**Parameters:**

prediction (`np.ndarray`, `torch.Tensor`) : Predicted image intrinsics with values in the range [0, 1]. The shape is `(numimages * numtargets) × 3 × height × width` for `torch.Tensor` or `(numimages * numtargets) × height × width × 3` for `np.ndarray`, where `numtargets` corresponds to the number of predicted target modalities of the intrinsic image decomposition.

uncertainty (`None`, `np.ndarray`, `torch.Tensor`) : Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `(numimages * numtargets) × 3 × height × width` for `torch.Tensor` or `(numimages * numtargets) × height × width × 3` for `np.ndarray`.

latent (`None`, `torch.Tensor`) : Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline. The shape is `(numimages * numensemble) × (numtargets * 4) × latentheight × latentwidth`.

#### diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_intrinsics[[diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_intrinsics]]

[Source](https://github.com/huggingface/diffusers/blob/v0.38.0/src/diffusers/pipelines/marigold/marigold_image_processing.py#L542)

Visualizes intrinsic image decomposition, such as predictions of the `MarigoldIntrinsicsPipeline`.

Returns: `list[dict[str, PIL.Image.Image]]` with intrinsic image decomposition visualization.

**Parameters:**

prediction (`np.ndarray | torch.Tensor | list[np.ndarray, list[torch.Tensor]]`) : Intrinsic image decomposition.

target_properties (`dict[str, Any]`) : Decomposition properties. Expected entries: `target_names: list[str]` and a dictionary with keys `prediction_space: str`, `sub_target_names: list[str | Null]` (must have 3 entries, null for missing modalities), `up_to_scale: bool`, one for each target and sub-target.

color_map (`str | dict[str, str]`, *optional*, defaults to `"Spectral"`) : Color map used to convert a single-channel predictions into colored representations. When a dictionary is passed, each modality can be colored with its own color map.

