NVIDIA Cosmos 3 and the Future of Physical AI

What NVIDIA Cosmos 3 Is and Why Physical AI Needs It

NVIDIA Cosmos 3 is an open world foundation model for physical AI that unifies physical reasoning, world simulation, and action generation so robots and autonomous systems can understand, predict, and safely act in real environments. Physical AI models cannot rely on blind trial and error: a robot arm, delivery bot, or warehouse camera must first interpret what it sees, infer how objects behave, and anticipate what might happen next. Cosmos 3 is designed as a single omnimodel that natively handles text, images, video, ambient sound, and action signals, giving machines consistent world understanding before they move. By focusing on reasoning plus generation, it aims to shorten training and evaluation cycles for physical AI from months to days while keeping physics accuracy high. This makes it a core building block for future robot perception, smarter autonomous vehicles, and responsive smart spaces.

Inside the Mixture-of-Transformers: Reasoner and Generator Towers

Cosmos 3 is built on a mixture-of-transformers architecture that combines two tightly linked towers: a reasoner and a generator. The reasoner tower is a vision-language model that processes multimodal input such as images, videos, and text in an autoregressive fashion, extracting motion cues, object interactions, and physical context. In other words, it acts as the system’s "brain" for world understanding before any output is produced. The generator tower then takes this structured understanding and produces future observations and action sequences using a diffusion-based process that respects physics. When developers call the generator, both towers activate so generation stays grounded in the reasoner’s interpretation; the reasoner, however, can run by itself for pure analysis tasks. This unified design removes the need to stitch together separate perception and simulation models, which simplifies deployment of physical AI models in robots, vehicles, and simulation pipelines.

NVIDIA Cosmos 3 Teaches Robots to Read the Real World

From Reasoning to Action in Robots, Vehicles, and Smart Spaces

Cosmos 3 is engineered to support the full chain from perception to action across diverse physical domains. It accepts text, image, and video inputs and can output text for explanations, physics-aware images and videos for prediction, or videos paired with actions for policy learning. For robot perception and manipulation, it can act as a world model that predicts object motion and contact, helping robots plan grasps and trajectories with fewer real-world trials. In autonomous vehicles, Cosmos 3 can generate rare driving edge cases and future scene evolutions, improving safety evaluation. It can also power smart spaces like warehouses, where video plus action outputs model worker movement, equipment paths, and safety events. According to NVIDIA, Cosmos 3 can "reduce physical AI training and evaluation cycles from months to days," which matters when systems must be updated quickly as environments change.

Choosing Cosmos 3 Nano or Super and Supported Modalities

To fit different deployment needs, Cosmos 3 comes in two main sizes. Cosmos 3 Nano is a 16B-parameter model tuned for efficient inference on workstation-grade GPUs such as the NVIDIA RTX PRO 6000, making it suitable for on-premise robots, factory cells, or lab environments that need real-time decisions. Cosmos 3 Super scales up to 64B parameters for maximum quality and is intended for datacenter deployment on NVIDIA Hopper and Blackwell GPUs, where teams can run large-scale synthetic data generation or complex physical reasoning workloads. Both versions share the same multimodal core: they understand and generate combinations of text, images, video, ambient sound, and action. That means teams can use a consistent foundation model across lightweight robot perception deployments and heavy offline training pipelines, switching sizes without rewriting their physical AI models or workflows.

Open Datasets, Evaluation, and the Cosmos Coalition

Cosmos 3 is released with an open stack to make physical AI development more reproducible. NVIDIA provides model checkpoints for Cosmos 3 Nano and Super on Hugging Face, along with training and post-training scripts, deployment tools, and Cosmos NIM microservices to run the models efficiently on NVIDIA GPUs. The release includes six synthetic data generation datasets that span robotics, physics simulation, spatial reasoning, human motion, autonomous driving, and warehouse environments, giving developers realistic material to adapt foundation models to their own domains. Quality is measured with the NVIDIA Cosmos Human Evaluation framework, which checks generated videos through fact-based yes-or-no questions about semantics, physical laws, geometry, and visual integrity. NVIDIA has also introduced the NVIDIA Cosmos Coalition, bringing together AI labs and robotics companies such as Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI to advance open world models.