Cosmos 3: Physical AI Foundation Model Explained

What Cosmos 3 Is and Why Physical AI Needs It

NVIDIA Cosmos 3 is an open physical AI foundation model that unifies scene understanding, world prediction, and action generation so robots and autonomous systems can perceive, reason, and act in changing real-world environments from a single architecture. Physical AI systems have always faced a chicken-and-egg problem: they need a reliable world model before they can safely move through traffic, grasp objects, or monitor warehouses. Previous pipelines often stitched together separate perception, simulation, and policy models, creating brittle integrations and long training cycles. Cosmos 3 targets this gap by providing shared world modeling AI that ties perception and control into one model family. NVIDIA describes Cosmos 3 as powering “perception, prediction and action,” repositioning robot training models as infrastructure similar to language models in software. For teams building robots, autonomous vehicle AI, and smart spaces, the model is meant to be a starting point rather than a bespoke system built from scratch.

NVIDIA Cosmos 3 Gives Physical AI a Shared World Model

Mixture-of-Transformers: Reasoner and Generator in One System

Cosmos 3 is built on a mixture of transformers architecture organized into two tightly coupled towers: a reasoner and a generator. The reasoner tower is a vision-language model that ingests images, video, and text, using an autoregressive transformer to parse motion, object interactions, and physical context. It acts as the system’s “brain,” forming a latent world state before any content is produced. The generator tower then takes this state and produces future observations and action sequences through a diffusion-based process that stays consistent with physics. Because the generator always calls the reasoner, every generated frame or action sequence is grounded in an interpreted scene. This lets one model support tasks such as edge-case video generation for driving, predictive world modeling, and policy generation for robot learning without orchestrating multiple separate networks and inference pipelines.

From Scene Reasoning to Robot and Vehicle Actions

Cosmos 3 is designed to move from perception to decision to control inside one world modeling AI stack. It accepts text, images, and video as inputs and can output text, videos, and action sequences, aligning with how physical AI systems sense and act. For autonomous vehicle AI, it can generate physics-aware driving scenarios, including rare edge cases, informed by the reasoner tower’s understanding of traffic dynamics. In robotics, it can serve as a world action model or policy model, turning camera feeds and task descriptions into plausible video rollouts and candidate robot actions. According to NVIDIA, Cosmos 3 is an omnimodel that can “natively understand and generate text, images, video, ambient sound and actions,” which helps compress traditional training and evaluation cycles. Instead of fabricating synthetic scenes with one tool and learning control policies with another, teams can iterate within a single, scene-aware model.

Open Foundation Model and OpenMDW Packaging for Developers

Rather than a closed, monolithic stack, Cosmos 3 is released as an open physical AI foundation model with checkpoints, code, and datasets available for adaptation. Cosmos 3 Nano, at 16B parameters, is tuned for workstation-class GPUs and real-time robotics inference, while Cosmos 3 Super, at 64B parameters, targets datacenter workloads and high-end synthetic data generation. Both sit within a broader open distribution story. NVIDIA is releasing training scripts, deployment tools, and physical AI world-model datasets on platforms such as Hugging Face and GitHub so teams can post-train or specialize the model for their own robots and environments. At the packaging level, Cosmos 3 supports OpenMDW-1.1, giving developers a single, model-centric license that keeps weights, code, documentation, and data under one legal structure instead of scattered across multiple licenses.

Ecosystem Impact: A Common World Model for Physical AI

The Cosmos 3 launch is framed as part of a wider push toward shared world models for physical AI. NVIDIA introduced the NVIDIA Cosmos Coalition, bringing together AI labs and robotics-focused companies including Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI to advance open world modeling. This cooperation aims to standardize how robot training models are built, evaluated, and integrated into production systems. Cosmos 3’s open datasets span robotic manipulation, physics simulations, spatial reasoning, human motion, driving, and warehouse setups, giving developers concrete starting points for their own domains. With Cosmos NIM microservices, teams can deploy the model efficiently on NVIDIA GPUs without rewriting infrastructure. The result is a more accessible pathway for physical AI developers who want a shared, extensible world model that supports scene reasoning, autonomous behavior, and long-term policy learning across many embodiments.