Cosmos-Predict2.5: A Suite of Diffusion-based World Foundation Models
Cosmos | Code | White Paper | Website
NVIDIA Cosmos™ is a platform of state-of-the-art generative world foundation models, advanced tokenizers, guardrails, and an accelerated data processing and curation pipeline, purpose-built to accelerate the development of physical AI systems, such as autonomous vehicles (AVs) and robots.
Model Overview
Description
Cosmos-Predict2.5: A family of highly performant pre-trained world foundation models purpose-built for generating physics-aware images, videos and world states for physical AI development.
Cosmos-Predict2.5 diffusion models are a collection of diffusion based world foundation models that generate dynamic, high quality images and videos from text, image, or video inputs. It can serve as the building block for various applications or research that are related to world generation.
This model is ready for commercial/non-commercial use.
Model Developer: NVIDIA
Model Versions
The Cosmos-Predict2.5 diffusion-based model family includes the following models:
Cosmos-Predict2.5-14B/ Pre-trained
- Given a text description, an image as the first frame, and/or a video, predict the future frames.
- Produces 720P video with 16FPS
Cosmos-Predict2.5-14B/ Post-trained
- Given a text description, an image as the first frame, and/or a video, predict the future frames.
- Produces 720P video with 16FPS
License
This model is released under the NVIDIA Open Model License. Additional Information: Apache License 2.0.
For a custom license, please contact [email protected].
Under the NVIDIA Open Model License, NVIDIA confirms:
- Models are commercially usable.
- You are free to create and distribute Derivative Models.
- NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.
Deployment Geography:
Global
Use Case:
Physical AI: encompassing robotics, autonomous vehicles (AV), and more.
Release Date:
Github [12/04/2025] via https://github.com/nvidia-cosmos/cosmos-predict2.5
Hugging Face [12/04/2025] via https://huggingface.co/collections/nvidia/cosmos-predict25
Model Architecture
Cosmos-Predict2.5-14B is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
This model was developed based on: Cosmos-Predict2-14B
Number of model parameters: 14,368,048,004
Input/Output Specifications
Input
- Input Type(s): Text+Image, Text+Video
- Input Format(s):
- Text: String
- Image: jpg, png, jpeg, webp
- Video: mp4
- Input Parameters:
- Text: One-dimensional (1D)
- Image: Two-dimensional (2D)
- Video: Three-dimensional (3D)
- Other Properties Related to Input:
- The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration.
- For the 720P model, the input image should be 1280×704; for the 480P model, use 832×480.
- The input video should consist of 5 frames, each with a resolution of 1280×704 for the 720P model, or 832×480 for the 480P model.
Output
- Output Type(s): Video
- Output Format(s): mp4
- Output Parameters: Three-dimensional (3D)
- Other Properties Related to Output: The generated video is a 5-second clip, with resolution and frame rate determined by the model variant used. For example, the 720P 16FPS model produces a video with a resolution of 1280×704 and a frame rate of 16 FPS.
The video content visualizes the input text description as a short animated scene, capturing key elements within the specified time constraints.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper
Note: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Training Dataset:
Data Modality
- [Image]
- [Text]
- [Video]
Data Collection Method by dataset
- [Automated]
Labeling Method by dataset
- [Hybrid: Human, Automated]
Testing Dataset:
Data Collection Method by dataset
- [Automated]
Labeling Method by dataset
- [Hybrid: Human, Automated]
Evaluation
Please see our technical paper for detailed evaluations of the base model.
Data Collection Method:
- Automated
Labeling Method:
- Hybrid: Human,Automated
System Requirements and Performance
The inference time for a single generation across different NVIDIA GPU hardware will be published soon.
Operating System(s):
- Linux (We have not tested on other operating systems.)
Usage
- See Cosmos-Predict2.5 for details.
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Limitations
Despite various improvements in world generation for Physical AI, Cosmos-Predict2 video2world models still face technical and application limitations for world prediction. In particular, they struggle to generate long, high-resolution videos without artifacts. Common issues include temporal inconsistency, camera and object motion instability, and imprecise interactions. The models may inaccurately represent 3D space, 4D space-time, or physical laws in the generated videos, leading to artifacts such as disappearing or morphing objects, unrealistic interactions, and implausible motions. As a result, applying these models for applications that require simulating physical law-grounded environments or complex multi-agent dynamics remains challenging.
Inference:
Acceleration Engine: PyTorch, Transformer Engine
Test Hardware: H100, A100, B200
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Plus Plus (++) Promise
We value you, the datasets, the diversity they represent, and what we have been entrusted with. This model and its associated data have been:
- Verified to comply with current applicable disclosure laws, regulations, and industry standards.
- Verified to comply with applicable privacy labeling requirements.
- Annotated to describe the collector/source (NVIDIA or a third-party).
- Characterized for technical limitations.
- Reviewed to ensure proper disclosure is accessible to, maintained for, and in compliance with NVIDIA data subjects and their requests.
- Reviewed before release.
- Tagged for known restrictions and potential safety implications.
Bias
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups protected classes in model design and testing: | None |
| Measures taken to mitigate against unwanted bias: | None |
Explainability
| Field | Response |
|---|---|
| Intended Application & Domain: | World Generation |
| Model Type: | Transformer |
| Intended Users: | Physical AI developers |
| Output: | Videos |
| Describe how the model works: | Generates videos based on video and text inputs |
| Technical Limitations: | The model may not follow the video or text input accurately in challenging cases, where the input video shows complex scene composition and temporal dynamics. Examples of challenging scenes include: fast camera movements, overlapping human-object interactions, low lighting with high motion blur, and multiple people performing different actions simultaneously. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Quantitative and Qualitative Evaluation. We evaluate on PAI-Bench’s predict task and report two main scores: the Domain Score, which measures performance on domain-specific physical AI tasks, and the Quality Score, which reflects the quality of generated videos. The Quality Score is derived from eight text-to-video and image-to-video metrics adapted from VBench. In contrast, the Domain Score is obtained through VQA-based evaluation across seven domains: av, common, human, industry, misc, physics, and robotics. The final PAI-Bench Overall Score is computed as the average of the Quality and Domain scores. |
| Potential Known Risks: | The model's output can generate all forms of videos, including what may be considered toxic, offensive, or indecent. |
| Licensing: | NVIDIA Open Model License. Additional Information: Apache License 2.0. |
Privacy
| Field | Response |
|---|---|
| Generatable or reverse engineerable personal data? | No |
| Personal data used to create this model? | None Known |
| Was consent obtained for any personal data used? | None Known |
| How often is dataset reviewed? | Before Release |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |
| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |
Safety
| Field | Response |
|---|---|
| Model Application(s): | World Generation |
| Describe the life critical impact (if present). | None Known |
| Use Case Restrictions: | NVIDIA Open Model License. Additional Information: Apache License 2.0. |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog. |
- Downloads last month
- 718