Transformers documentation
CHMv2
This model was released on 2026-03-11 and added to Hugging Face Transformers on 2026-03-11.
CHMv2
Overview
The Canopy Height Maps v2 (CHMv2) model was proposed in CHMv2: Improvements in Global Canopy Height Mapping using DINOv3. Building on our original high-resolution canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency by leveraging DINOv3, Meta’s self-supervised vision model.
You can find more information here, and the original code here.
The abstract from the paper is the following:
Accurate canopy height information is essential for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure, yet high-fidelity measurements from airborne laser scanning (ALS) remain unevenly available globally. Here we present CHMv2, a global, meter-resolution canopy height map derived from high-resolution optical satellite imagery using a depth-estimation model built on DINOv3 and trained against ALS canopy height models. Compared to existing products, CHMv2 substantially improves accuracy, reduces bias in tall forests, and better preserves fine-scale structure such as canopy edges and gaps. These gains are enabled by a large expansion of geographically diverse training data, automated data curation and registration, and a loss formulation and data sampling strategy tailored to canopy height distributions. We validate CHMv2 against independent ALS test sets and against tens of millions of GEDI and ICESat-2 observations, demonstrating consistent performance across major forest biomes.
Usage examples
Run inference on an image with the following code:
from PIL import Image
import torch
from transformers import AutoModelForDepthEstimation, AutoImageProcessor
processor = AutoImageProcessor.from_pretrained("facebook/dinov3-vitl16-chmv2-dpt-head")
model = AutoModelForDepthEstimation.from_pretrained("facebook/dinov3-vitl16-chmv2-dpt-head")
image = Image.open("image.tif")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
depth = processor.post_process_depth_estimation(
outputs, target_sizes=[(image.height, image.width)]
)[0]["predicted_depth"]CHMv2Config
class transformers.CHMv2Config
< source >( backbone_config: dict | None = None patch_size: int | None = 16 initializer_range: float | None = 0.02 reassemble_factors: list[float] | None = None post_process_channels: list[int] | None = None fusion_hidden_size: int | None = 256 head_hidden_size: int | None = 128 number_output_channels: int | None = 256 readout_type: str | None = 'project' min_depth: float | None = 0.001 max_depth: float | None = 96.0 bins_strategy: str | None = 'chmv2_mixlog' norm_strategy: str | None = 'chmv2_mixlog' **kwargs )
Parameters
- backbone_config (
Union[dict, "PreTrainedConfig"], optional) — The configuration of the backbone model. Only DINOv3ViTConfig is currently supported. - patch_size (
int, optional, defaults to 16) — The patch size used by the backbone vision transformer. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - reassemble_factors (
list[float], optional, defaults to[4, 2, 1, 0.5]) — The up/downsampling factors of the reassemble layers. - post_process_channels (
list[int], optional, defaults to[128, 256, 512, 1024]) — The output channel sizes of the reassemble stage for each backbone feature level. - fusion_hidden_size (
int, optional, defaults to 256) — The number of channels before fusion. - head_hidden_size (
int, optional, defaults to 128) — The number of channels in the hidden layer of the depth estimation head. - number_output_channels (
int, optional, defaults to 256) — Number of output channels for the CHMv2 head (number of depth bins). - readout_type (
str, optional, defaults to"project") — Type of readout operation for the CLS token. One of["ignore", "add", "project"]. - min_depth (
float, optional, defaults to 0.001) — The minimum depth value for depth bin calculation. - max_depth (
float, optional, defaults to 96.0) — The maximum depth value for depth bin calculation. - bins_strategy (
str, optional, defaults to"chmv2_mixlog") — The strategy for depth bins distribution. One of["linear", "log", "chmv2_mixlog"]. - norm_strategy (
str, optional, defaults to"chmv2_mixlog") — The normalization strategy for depth prediction. One of["linear", "softmax", "sigmoid", "chmv2_mixlog"]. - ```python —
from transformers import CHMv2Config, CHMv2ForDepthEstimation
This is the configuration class to store the configuration of a Chmv2Model. It is used to instantiate a Chmv2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/dinov3-vitl16-chmv2-dpt-head
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
CHMv2ImageProcessorFast
class transformers.CHMv2ImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.models.chmv2.image_processing_chmv2.CHMv2ImageProcessorKwargs] )
Parameters
- ensure_multiple_of (
int, kwargs, optional, defaults to 1) — Ifdo_resizeisTrue, the image is resized to a size that is a multiple of this value. Can be overridden byensure_multiple_ofinpreprocess. - keep_aspect_ratio (
bool, kwargs, optional, defaults toFalse) — IfTrue, the image is resized to the largest possible size such that the aspect ratio is preserved. Can be overridden bykeep_aspect_ratioinpreprocess. - do_reduce_labels (
bool, kwargs, optional, defaults toself.do_reduce_labels) — Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Constructs a CHMv2ImageProcessorFast image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None **kwargs: typing_extensions.Unpack[transformers.models.chmv2.image_processing_chmv2.CHMv2ImageProcessorKwargs] ) → ~image_processing_base.BatchFeature
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False. - segmentation_maps (
ImageInput, optional) — The segmentation maps to preprocess. - ensure_multiple_of (
int, kwargs, optional, defaults to 1) — Ifdo_resizeisTrue, the image is resized to a size that is a multiple of this value. Can be overridden byensure_multiple_ofinpreprocess. - keep_aspect_ratio (
bool, kwargs, optional, defaults toFalse) — IfTrue, the image is resized to the largest possible size such that the aspect ratio is preserved. Can be overridden bykeep_aspect_ratioinpreprocess. - do_reduce_labels (
bool, kwargs, optional, defaults toself.do_reduce_labels) — Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255. - return_tensors (
stror TensorType, optional) — Returns stacked tensors if set to'pt', otherwise returns a list of tensors. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Returns
~image_processing_base.BatchFeature
- data (
dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.
post_process_depth_estimation
< source >( outputs: DepthEstimatorOutput target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None ) → List[Dict[str, TensorType]]
Parameters
- outputs (
DepthEstimatorOutput) — Raw outputs of the model. - target_sizes (
TensorTypeorList[Tuple[int, int]], optional) — Tensor of shape(batch_size, 2)or list of tuples (Tuple[int, int]) containing the target size (height, width) of each image in the batch. If left to None, predictions will not be resized.
Returns
List[Dict[str, TensorType]]
A list of dictionaries of tensors representing the processed depth predictions.
Converts the raw output of DepthEstimatorOutput into final depth predictions and depth PIL images.
Only supports PyTorch.
CHMv2ForDepthEstimation
class transformers.CHMv2ForDepthEstimation
< source >( config: CHMv2Config )
Parameters
- config (CHMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
CHMv2 Model with a depth estimation head on top (consisting of convolutional layers) e.g. for canopy height estimation.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor labels: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → DepthEstimatorOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using CHMv2ImageProcessorFast. See CHMv2ImageProcessorFast.call() for details (processor_classuses CHMv2ImageProcessorFast for processing images). - labels (
torch.LongTensorof shape(batch_size, height, width), optional) — Ground truth depth estimation maps for computing the loss.
Returns
DepthEstimatorOutput or tuple(torch.FloatTensor)
A DepthEstimatorOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (CHMv2Config) and inputs.
The CHMv2ForDepthEstimation forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.predicted_depth (
torch.FloatTensorof shape(batch_size, height, width)) — Predicted depth for each pixel.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, num_channels, height, width).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from transformers import AutoImageProcessor, CHMv2ForDepthEstimation
>>> import torch
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> with httpx.stream("GET", url) as response:
... image = Image.open(BytesIO(response.read())).convert("RGB")
>>> processor = AutoImageProcessor.from_pretrained("facebook/dinov3-vitl16-chmv2-dpt-head")
>>> model = CHMv2ForDepthEstimation.from_pretrained("facebook/dinov3-vitl16-chmv2-dpt-head")
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> model.to(device)
>>> # prepare image for the model
>>> inputs = processor(images=image, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # interpolate to original size
>>> post_processed_output = processor.post_process_depth_estimation(
... outputs, [(image.height, image.width)],
... )
>>> predicted_depth = post_processed_output[0]["predicted_depth"]