The core challenge DreamDojo aims to solve is the cost and risk of training robots in the real world, where learning is constrained by time, hardware wear, safety concerns, and constant resets. To overcome this, the model was pre-trained on 44,000 hours of human egocentric video. So-called “latent actions” translate human movements into a hardware-agnostic representation, allowing the model to learn from human videos without ever observing a robot. In a second stage, the system is adapted to the mechanics of a specific robot.

DreamDojo runs in real time at around 10 frames per second and supports VR teleoperation, strategy evaluation, and predictive planning directly within the world model. According to Nvidia, all model weights, code, and datasets are openly available. DreamDojo is built on Nvidia Cosmos, with further details published on the project page and in the accompanying paper.