The following is from a recent Radical Ventures AI Founders Masterclass session with NVIDIA’s Director & Distinguished Research Scientist, Jim Fan.
Artificial intelligence has advanced at a remarkable pace over the past decade. Language models now reason, write code, and synthesize information with a fluency that would have seemed implausible not long ago. Yet as AI begins to move beyond screens and servers and into the physical world, progress slows in ways that are both revealing and consequential. The challenge is no longer intelligence in the abstract, but embodiment: perception, motion, and interaction with a complex, unpredictable environment.
Robotics sits at the center of this transition.
In a recent Radical AI Founders Masterclass, Jim Fan, Director and Distinguished Research Scientist at NVIDIA, offered a grounded and technically rigorous perspective on why embodied AI remains one of the hardest problems in the field, and what it will take to move it forward. His view is not that robotics lacks ambition or innovation, but that it demands a fundamentally different approach to data, systems design, and scale than the one that powered recent breakthroughs in language models.
From End-to-End Learning to Embodied Intelligence
Jim’s career traces the arc of modern AI itself. He was first drawn to deep learning during the early breakthroughs of the 2010s, particularly AlexNet, which replaced elaborate, hand-engineered computer vision pipelines with a simple, end-to-end mapping from pixels to labels. That elegance left a lasting impression. Across his work in speech recognition, his formative internship at OpenAI on the World of Bits project, his doctoral research at Stanford under Fei-Fei Li, and his current leadership of embodied agent research at NVIDIA, a consistent principle emerges: when systems grow more complex, models should become simpler.
This philosophy is not aesthetic. It is practical. Simple, end-to-end interfaces scale more effectively with data and compute, allowing intelligence to emerge from volume and diversity rather than brittle task-specific logic. That belief shaped early efforts like World of Bits, which treated computer use as a direct mapping from pixels to actions, and it continues to shape Jim’s work today in robotics.
Why Robotics Is Still Hard
Despite dramatic progress in AI reasoning, robotics remains stubbornly difficult. Jim frames this challenge using a distinction borrowed from cognitive science: System Two intelligence governs slow, deliberate reasoning and planning, while System One governs fast, intuitive perception and motor control. Recent advances have dramatically strengthened System Two. System One, however, remains underdeveloped.
The reason, Jim argues, is not a lack of modeling ideas, but a lack of data.
Robotics data is fundamentally different from the text data that powered large language models. It consists of high-dimensional, continuous signals: camera feeds, joint positions, force vectors, and real-time feedback from physical interaction. Collecting this data is expensive and slow. Teleoperation, where humans guide robots through tasks using VR or direct control, produces high-quality demonstrations but is constrained by hardware reliability, human labor, and time. Even under ideal conditions, robots rarely collect more than a few hours of usable data per day.
Compared to the abundance of internet text available to language models, robotics operates under severe data scarcity. While language models were trained on what some have called the “fossil fuel” of the internet, robotics has no equivalent resource. Finding one is the central challenge.
Simulation, Synthetic Data, and World Models
To overcome this constraint, Jim points to simulation and synthetic data as the most promising path forward. Physics-based simulators allow robots to train through reinforcement learning at speeds far beyond real-world interaction, running thousands of parallel environments and accumulating experience at scale. By randomizing physical parameters such as friction, mass, and gravity, robots can learn policies that generalize robustly to the real world.
World models extend this approach even further. These pretrained video foundation models learn to predict future states of the world conditioned on actions, effectively serving as counterfactual simulators. Having absorbed massive amounts of visual data, they can generate plausible future trajectories without executing them physically. At NVIDIA, this idea has taken shape in projects like GR00T Dreams, where world models fine-tuned on robot data generate synthetic trajectories that can be used alongside real-world experience.
The goal is not to replace physical data, but to amplify it. Scarce real-world demonstrations become seeds from which far larger and more diverse training distributions can grow.
Data Maximalist, Model Minimalist
Despite the complexity of these data pipelines, Jim emphasizes that the learning models themselves should remain as simple and end-to-end as possible. He summarizes this philosophy as “data maximalist, model minimalist.” The model should not care whether its training data came from teleoperation, simulation, or world models. Pixels in, actions out.
This approach underlies NVIDIA’s open-source GR00T N1 model, which integrates high-level reasoning with fast, low-latency motor control in a fully differentiable, end-to-end system. The belief is that embodied intelligence will not emerge from ever more intricate architectural scaffolding, but from clean abstractions paired with massive, diverse experience.
The Physical Turing Test
To describe the ultimate ambition of robotics, Jim offers a thought experiment he calls the Physical Turing Test. Imagine hosting a large gathering that leaves your home in disarray. You leave for work, return later, and find the house cleaned and dinner prepared, with no clear indication of whether a human or a robot performed the work. Conversational AI has largely passed its own version of the Turing Test. The physical version remains far more elusive.
The difficulty lies not in reasoning, but in perception, manipulation, and safety. Physical environments are messy, dynamic, and unforgiving. Tasks that feel trivial to humans require enormous amounts of data and coordination for machines.
Where Robotics Will Break Through First
Jim is pragmatic about where embodied AI will succeed in the near term. The earliest breakthroughs are likely to appear in structured environments such as programmable factories and automated laboratories, where tasks are repeatable, safety constraints are manageable, and variability is limited. In these settings, robots can be retrained through demonstration rather than painstaking reprogramming, enabling flexibility that traditional automation lacks.
Homes, by contrast, remain a longer-term challenge. They are unstructured, safety-critical, and deeply personal, with privacy and trust considerations layered on top of technical difficulty.
The Long View
When asked about a “GPT-4 moment” for robotics, Jim avoids precise timelines but draws confidence from history. Early breakthroughs often appear unimpressive by today’s standards, yet they set exponential progress in motion. His conviction is that embodied AI will follow a similar trajectory, with steady advances compounding into systems that feel inevitable in hindsight.
The conversation serves as a reminder that the future of AI will not be defined by algorithms alone. It will be shaped by data pipelines, physical systems, and the infrastructure required to translate intelligence into action. Robotics is not waiting for smarter models. It is waiting for experience at scale — and for the systems that make that experience possible.
This post is based on insights from Radical Talks, a podcast from Radical Ventures exploring innovation at the frontier of AI. For more conversations with leaders in AI, subscribe wherever you get your podcasts.