Cosmos 3: Physical AI Foundation Model for Robots

What Cosmos 3 Is and Why It Matters for Physical AI

Nvidia Cosmos 3 is an open physical AI foundation model that helps robots, autonomous vehicles, and vision systems understand real-world scenes, predict what will happen next, and generate suitable actions for their bodies and tasks within a single unified architecture. Unlike task-specific models, Cosmos 3 is designed as an “omnimodel” that combines autonomous vehicle perception, world simulation, and action generation across text, images, video, ambient sound, and control signals. This means a single model can support robot world modeling for warehouses, driving scenarios, or smart spaces without custom pipelines for each domain. Nvidia positions Cosmos 3 as a bridge between perception and control: a system that first reasons about what sensors see, then plans how machines should respond. For developers, that turns physical AI from a research project into something closer to reusable engineering infrastructure.

Nvidia Cosmos 3 Teaches Robots to Understand the World Before They Act

Mixture-of-Transformers: One Model for Seeing, Predicting and Acting

Cosmos 3 is built on a mixture-of-transformers architecture that merges reasoning and generation into a single physical AI foundation model. It has a “reasoner tower” that acts as a vision-language model, interpreting multimodal inputs like images, video, and text to infer motion, object interactions, and physical context. A separate “generator tower” then uses diffusion-based methods to create future video frames and action sequences, conditioned on that understanding. This design removes the need to orchestrate separate perception, prediction, and control models. Cosmos 3 can, for example, take a camera feed from an autonomous vehicle, infer traffic dynamics, simulate rare edge cases, and output future frames or policy actions for training. According to Nvidia, this unified approach shortens physical AI training and evaluation cycles from months to days by providing ready-made reasoning and world modeling capabilities.

From World Models to Actions: Practical Uses in Robotics and Vehicles

By combining scene understanding, world modeling, and action generation, Cosmos 3 is aimed at real deployments in foundation model robotics and large-scale vision systems. Robots can use it as a world action model: ingesting video and task descriptions, then generating physics-aware action sequences for manipulation, navigation, or safety checks. For autonomous vehicle perception, Cosmos 3 can generate synthetic driving videos, predict future scenes, or act as a policy model for training driving agents in rare or dangerous conditions. The model supports flexible input-output combinations: text-to-video for world simulation, video-to-action for control policies, and video-to-video for prediction. This breadth allows teams to reuse the same foundation for tasks like warehouse monitoring, human motion analysis, and spatial reasoning, instead of building separate stacks for each scenario. The result is a tighter loop between simulated worlds and real robot behavior.

Open Models, Datasets and Tools for Custom Physical AI

Nvidia is releasing Cosmos 3 as an open world foundation model, including model checkpoints, training scripts, deployment tools, and datasets. Two main sizes are available: Cosmos 3 Nano with 16B parameters for efficient inference on workstation-class GPUs, and Cosmos 3 Super with 64B parameters for high-quality reasoning and large-scale synthetic data generation in datacenters. Alongside the models, Nvidia is open-sourcing six synthetic datasets for robotics, physics simulation, spatial reasoning, human motion, driving, and warehouses. These datasets can be used to post-train Cosmos 3 for domain-specific robot world modeling or autonomous vehicle perception tasks. Packaging via OpenMDW-1.1 and Cosmos NIM microservices gives developers a single legal and deployment framework to train, modify, redistribute, and serve models. This foundation model approach lets teams start from pre-trained physical reasoning rather than building stack after stack from scratch.

A Faster Path from Research to Deployable Physical AI

Cosmos 3 is intended to compress the journey from experimental world models to deployable robots and autonomous systems. Instead of spending months collecting data, training separate perception and policy networks, and wiring them by hand, teams can adapt a ready-made physical AI foundation model to their environment and embodiment. Open post-training scripts make it possible to fine-tune on new robot platforms, sensor rigs, or safety constraints, while synthetic data generation fills gaps for rare events. Nvidia has also launched the Cosmos Coalition, bringing together AI labs and robotics companies to push shared world model benchmarks and tools. For developers, this ecosystem means they can focus on task design and integration, not low-level model construction. The long-term promise is that physical AI systems will understand the world before they act, and do so with models that are transparent, customizable, and reusable.