Cosmos 3: Physical AI World Model for Robotics

What Cosmos 3 Is and Why It Matters for Physical AI

NVIDIA Cosmos 3 is an open world foundation model for physical AI that lets robots and autonomous systems understand, predict, and act in complex real-world environments by combining scene reasoning, world simulation, and action generation in a single architecture. Physical AI models for robots, autonomous vehicles, and smart spaces have long struggled with the gap between perception and action: systems can label objects but often lack a coherent model of how those objects move and interact over time. Cosmos 3 is designed to close that gap. It can reason over multimodal inputs such as images, video, and text, predict what is likely to happen next, and then propose actions tailored to specific embodiments and tasks. For developers building foundation models robotics stacks, this makes world modeling a first-class capability instead of a fragile patchwork of separate systems.

Mixture-of-Transformers: Unifying Reasoning, World Modeling, and Action

At the core of Cosmos 3 is a mixture-of-transformers architecture split into two cooperating towers that handle both thinking and doing. The reasoner tower is a vision-language model that interprets multimodal observations such as images, video, and text using an autoregressive design to understand motion, object interactions, and physical context. It acts as the system’s “brain” for scene reasoning. The generator tower then uses that understanding to drive diffusion-based generation of future observations and action sequences, enabling physics-aware video prediction and control policies in one model. This design gives a single system that covers scene reasoning, robot world modeling, and action generation without complex orchestration between multiple networks and pipelines. For autonomous vehicle AI and advanced robotics, that unified flow from perception to predicted futures to actions is essential for reliable, real-time decision-making in open environments.

NVIDIA’s Cosmos 3 Changes How Robots Learn the Physical World

From Perception to Action: Closing the Loop for Robots and Autonomous Vehicles

Cosmos 3 is built specifically for physical AI applications, where an agent must understand what is happening before deciding how to move. Supported input and output combinations cover text, images, and video, plus ambient sound and actions, so the same model can act as a visual reasoner, a world simulator, and a policy generator. Robots can use Cosmos 3 as a world model to predict how objects will behave under different actions, while autonomous vehicle AI systems can simulate rare driving edge cases and evaluate planned maneuvers against realistic scene dynamics. According to NVIDIA, Cosmos 3 “powers perception, prediction and action,” turning physical AI from a research demo into something closer to deployable engineering software. For Level 4 autonomy and smart spaces, this ability to reason about physical consequences before acting is a key safety and performance requirement.

Open Models, OpenMDW, and Custom Physical AI Stacks

Cosmos 3 is positioned as a fully open omnimodel, with Nano and Super checkpoints on public hubs, training code on open repositories, and datasets for robotics, spatial reasoning, human motion, driving, and warehouse scenes. NVIDIA is also aligning the release with OpenMDW-1.1, a Linux Foundation framework that gives developers a single model-centric license covering weights, architecture, documentation, benchmarks, and data. This structure lets teams train, modify, contribute, redistribute, and deploy without juggling multiple legal bundles or vendor-specific packaging. For robotics and autonomous system builders, the open approach reduces vendor lock-in while still supporting optimized deployment via Cosmos NIM microservices on NVIDIA GPUs. Teams can fine-tune Cosmos 3 to their own robot embodiments and environments, turning a general world foundation model into domain-specific physical AI models that match their safety cases and operational constraints.

Toward Level 4 Autonomy and Smart Spaces with World Foundation Models

By reducing physical AI training and evaluation cycles from months to days, Cosmos 3 aims to speed progress toward Level 4 autonomous systems and highly instrumented smart spaces. Its open datasets for warehouse safety, manipulation, and driving give developers material for synthetic data generation and scenario coverage that would be costly or dangerous to collect in the real world. Cosmos 3 Nano targets real-time inference on workstation-class GPUs for on-robot reasoning and control, while Cosmos 3 Super is aimed at datacenter-scale simulation and advanced world modeling workloads. The launch of the NVIDIA Cosmos Coalition, bringing together world model builders and robotics companies, signals an effort to make foundation models robotics development more collaborative. For teams building autonomous vehicle AI and large vision systems, Cosmos 3 offers a shared, open starting point to experiment with world-aware policies before deployment.