NVIDIA Cosmos 3: World Foundation Model for Physical AI

What NVIDIA Cosmos 3 Is and Why Physical AI Needs It

NVIDIA Cosmos 3 is an open world foundation model for physical AI that combines scene understanding, world prediction, and action generation into a single system so robots, autonomous vehicles, and vision agents can reason about the real world and decide what to do next based on rich multimodal inputs such as text, images, video, sound, and actions. Unlike language-centric foundation models, Cosmos 3 is built to answer three practical questions for machines: What is happening in this scene, what is likely to happen next, and which actions should be taken now? NVIDIA describes Cosmos 3 as a fully open “omnimodel” that can natively process and generate text, images, video, ambient sound, and action trajectories, with physics-aware behavior that supports advanced robot vision systems and autonomous vehicle AI. In effect, Cosmos 3 aims to be the shared world foundation model that physical AI devices have been missing.

NVIDIA Cosmos 3 Gives Physical AI a World Model Brain

Mixture-of-Transformers: A Two-Tower Brain for Scene Reasoning

At the core of NVIDIA Cosmos 3 is a mixture-of-transformers architecture that splits reasoning and generation into coordinated towers instead of relying on a single monolithic network. The reasoner tower is a vision-language model that interprets multimodal observations, using an autoregressive transformer to parse video, images, and text into an internal representation of motion, object interactions, and spatial-temporal context. The generator tower is an expert transformer paired with diffusion-based video synthesis that produces future frames and action sequences conditioned on the reasoner’s understanding. According to NVIDIA, this pairing “enables Cosmos 3 to understand object interactions, motion and spatial-temporal relationships before generating video and action trajectories.” By unifying these capabilities in one model, developers avoid stitching together separate world foundation models, physical AI models, and video generators, simplifying both training and deployment for demanding physical AI reasoning tasks.

From Perception to Action: How Cosmos 3 Supports Robots and AVs

Cosmos 3 is designed as a toolkit for the entire perception–prediction–action loop that powers robots and autonomous vehicle AI. It can serve as a vision language model to answer questions about scenes, as a world model to simulate future states, and as a backbone for world action models that output control policies or action traces. Supported modality combinations cover text, image, video, and action both as inputs and outputs, enabling tasks like physics-aware video prediction, action-conditioned video generation, and rare edge case simulation for safety validation. For robot vision systems, Cosmos 3 can interpret camera feeds and propose action sequences, while autonomous vehicles can use it to predict pedestrian motion or vehicle trajectories under different maneuvers. By embedding temporal dynamics and physical consistency into a single world foundation model, Cosmos 3 directly addresses the technical gap that has kept many physical AI prototypes from moving into dependable deployment.

Open Models, OpenMDW, and Faster Integration into Physical AI Stacks

To make physical AI models more practical to adopt, NVIDIA is releasing Cosmos 3 as open checkpoints, code, datasets, and deployment tools. Two variants are available today: Cosmos 3 Nano, a 16B-parameter model suited to workstation-grade GPUs for real-time robotics inference, and Cosmos 3 Super, a 64B-parameter model aimed at datacenter-scale synthetic data generation and advanced reasoning. Packaging follows the OpenMDW-1.1 framework, which keeps model weights, architecture, documentation, datasets, and benchmarks under a single model-centric license that permits training, modification, contribution, redistribution, and deployment together. Developers can access Cosmos 3 through build.nvidia.com, open repositories, and Cosmos NIM microservices optimized for NVIDIA GPUs. This aligned distribution lowers friction for robotics and autonomous vehicle teams that need to integrate a world foundation model into existing pipelines instead of stitching together fragmented simulation and perception stacks.

A Shift from General-Purpose AI to Specialized Physical AI Systems

Cosmos 3 marks a shift from general-purpose chat-oriented AI toward specialized physical AI systems built to handle spatial and temporal dynamics. Trained on billions of multimodal samples spanning text, images, video, sound, and action trajectories, it provides a pretrained base that can be adapted to domains like warehouse safety monitoring, manipulation, or complex urban driving with less data and lower training costs than building a world model from scratch. NVIDIA has also formed the Cosmos Coalition with partners such as Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI to accelerate open world model development. As Jensen Huang puts it, “The Cosmos 3 family of open, frontier omnimodels gives developers a generational leap in ability to build robots, autonomous vehicles and vision AI that perceive, reason, plan and act in the physical world.”