NVIDIA used its GTC Taipei keynote at COMPUTEX to release Cosmos 3, which it bills as the first open omni-model for physical AI reasoning and action. The pitch is that a single model now spans the full loop a robot, autonomous vehicle, or smart-space system needs: it perceives a scene, reasons about what is happening and what caused it, predicts what is likely to happen next, and then emits the action data to do something about it. Cosmos 3 takes text, video, images, ambient sound, and action as input, and generates physically grounded video, dense captions, scenario variations, and, critically, numerical action data such as joint angles, gripper positions, and trajectory points.
Architecturally, Cosmos 3 is a mixture-of-transformers split into two cooperating blocks. A reasoning block first interprets the scene — identifying which objects are moving, where paths may intersect, and what future state is likely — and a generation block then conditions on that context to produce physically plausible outputs. Calling it an omnimodel with native action generation is the substantive claim: rather than bolting a separate policy head onto a video model, the same model that imagines the next frames also writes the motor commands, and developers fine-tune it for a specific embodiment, camera layout, workspace, or task.
Two model sizes ship as open weights. Cosmos 3 Nano is an eight-billion-parameter configuration — an eight-billion reasoner paired with an eight-billion generator — tuned for efficient inference on workstation-class hardware like the RTX PRO 6000, and published on Hugging Face as nvidia/Cosmos3-Nano. Cosmos 3 Super pairs a thirty-two-billion reasoner with a thirty-two-billion generator for large-scale synthetic-data generation and research, targeting Hopper and Blackwell GPUs, as nvidia/Cosmos3-Super. There is a diffusers integration through a Cosmos3OmniPipeline class, and everything ships under the Linux Foundation's OpenMDW 1.1 license, a single model-centric license covering weights, architecture, documentation, datasets, benchmarks, and code.
On benchmarks NVIDIA claims a sweep across the categories that matter for this model class. The Cosmos 3 Nano post-trained policy is said to lead RoboLab, which tests policies in simulation across language-guided tasks, and RoboArena, which compares policies on DROID robots in the real world. As a vision-language model it is reported as the top-ranked open model on VANTAGE-Bench for smart-infrastructure scene understanding and on the TAR traffic-anomaly-reasoning challenge, and as a world generator it is said to top Physics-IQ, R-Bench, and PAI-Bench, with variants ranking first on Artificial Analysis open-weights leaderboards. Early adopters cited include the NVIDIA GEAR team building video-action models, Agile Robots generating action-conditioned trajectory data for its Thor 3 and FR3 humanoids, and Linker Vision running scene reasoning across thousands of city camera feeds.
The caveat worth flagging is that essentially all of these numbers are first-party and announced alongside the product, on a mix of NVIDIA-associated and newer benchmarks; independent replication on the robotics evals in particular is what will tell us whether the native-action story holds up outside curated settings. But the direction is clear and consequential: a genuinely open, benchmark-leading world-foundation model that emits actions, distributed under a permissive single license, lowers the barrier to building robot and AV data pipelines considerably, and pulls the physical-AI stack further toward NVIDIA's ecosystem at exactly the moment that stack is consolidating.
- NVIDIA's blog frames the contribution as the reasoning-then-generation split letting systems 'think before they act' in the real world.
- Hugging Face's launch post emphasizes the open weights, the Nano-versus-Super split, and the diffusers Cosmos3OmniPipeline for hands-on use.
- Both note native action output (joint angles, gripper positions, trajectories) as the differentiator from prior video world models.