Building Spatially Intelligent AI

This week, Radical Ventures portfolio company World Labs released Marble, a world model that creates 3D worlds from image or text prompts. CEO and Co-Founder Fei-Fei Li (also a Scientific Partner at Radical Ventures) shared her vision for building spatially intelligent AI in a manifesto posted on her Substack. The following is an excerpt.

Building spatially intelligent AI requires something even more ambitious than LLMs: world models, a new type of generative models whose capabilities of understanding, reasoning, generation and interaction with the semantically, physically, geometrically and dynamically complex worlds — virtual or real — are far beyond the reach of today’s LLMs. The field is nascent, with current methods ranging from abstract reasoning models to video generation systems. World Labs was founded in early 2024 on this conviction: that foundational approaches are still being established, making this the defining challenge of the next decade.

While language is a purely generative phenomenon of human cognition, worlds play by much more complex rules. Here on Earth, for instance, gravity governs motion, atomic structures determine how light produces colors and brightness, and countless physical laws constrain every interaction. Even the most fanciful, creative worlds are composed of spatial objects and agents that obey the physical laws and dynamic behaviors that define them. Reconciling all of this consistently — the semantic, the geometric, the dynamic, and physical — demands entirely new approaches. The dimensionality of representing a world is vastly more complex than that of a one-dimensional, sequential signal like language. Achieving world models that deliver the kind of universal capabilities we enjoy as humans will require overcoming several formidable technical barriers. At World Labs, our research teams are devoted to making fundamental progress toward that goal.

Here are some examples of our current research topics:

A new, universal task function for training: Defining a universal task function as simple and elegant as next-token prediction in LLMs has long been a central goal of world model research. The complexities of both their input and output spaces make such a function inherently more difficult to formulate. But while much remains to be explored, this objective function and corresponding representations must reflect the laws of geometry and physics, honoring the fundamental nature of world models as grounded representations of both imagination and reality.
Large-scale training data: Training world models requires far more complex data than text curation. The promising news: massive data sources already exist. Internet-scale collections of images and videos represent abundant, accessible training material—the challenge lies in developing algorithms that can extract deeper spatial information from these two-dimensional image or video frame-based signals (i.e. RGB). Research over the past decade has shown the power of scaling laws linking data volume and model size in language models; the key unlock for world models is building architectures that can leverage existing visual data at comparable scale. In addition, I would not underestimate the power of high-quality synthetic data and additional modalities like depth and tactile information. They supplement the internet scale data in critical steps of the training process. But the path forward depends on better sensor systems, more robust signal extraction algorithms, and far more powerful neural simulation methods.
New model architecture and representational learning: World model research will inevitably drive advances in model architecture and learning algorithms, particularly beyond the current MLLM and video diffusion paradigms. Both of these typically tokenize data into 1D or 2D sequences, which makes simple spatial tasks unnecessarily difficult – like counting unique chairs in a short video, or remembering what a room looked like an hour ago. Alternative architectures may help, such as 3D or 4D-aware methods for tokenization, context, and memory. For example, at World Labs, our recent work on a real-time generative frame-based model called RTFM has demonstrated this shift, which uses spatially-grounded frames as a form of spatial memory to achieve efficient real-time generation while maintaining persistence in the generated world.

Clearly, we are still facing daunting challenges before we can fully unlock spatial intelligence through world modelling. This research isn’t just a theoretical exercise. It is the core engine for a new class of creative and productivity tools. And the progress within World Labs has been encouraging. We recently shared with a limited number of users a glimpse of Marble, the first-ever world model that can be prompted by multimodal inputs to generate and maintain consistent 3D environments for users and storytellers to explore, interact with, and build further in their creative workflow. And we are working hard to make it available to the public soon!

Marble is only our first step in creating a truly spatially intelligent world model. As the progress accelerates, researchers, engineers, users, and business leaders alike are beginning to recognize its extraordinary potential. The next generation of world models will enable machines to achieve spatial intelligence on an entirely new level — an achievement that will unlock essential capabilities still largely absent from today’s AI systems.

Fei-Fei Li’s full manifesto is available here. You can read more about the launch of World Lab’s Marble on the company’s blog.

AI News This Week

The State of AI in 2025: Agents, Innovation, and Transformation (McKinsey)

McKinsey’s latest State of AI survey reveals that AI adoption continues to expand, with 88% of organizations now using AI in at least one business function. Organizations report encouraging use-case-level benefits, with 64% citing innovation improvements. The report highlights that the highest-performing companies distinguish themselves by pursuing enterprise-wide transformation using AI by redesigning workflows, setting growth and innovation as core objectives alongside cost reduction.
From AI to ROI: Some Positive Evidence (FT)

Research is showing measurable revenue gains from generative AI implementation in real-world retail operations. A study from Zhejiang and Columbia universities found that adding an AI assistant before purchase increased sales by 16.3% and conversion rates by 21.7% at a large retail platform, while a hybrid AI-human system boosted sales by 11.5%. The returns from GenAI have been clearly visible in the earnings of the companies associated with the AI infrastructure build-out for some time. What is now becoming clear is that the benefits of AI are accruing to the rest of the economy too.
In a First, AI Models Analyze Language As Well As a Human Expert (Quanta)

Researchers tested the ability to analyze language structure rather than simply use it. Using an LLM to successfully parse complex recursive sentences, the team identified multiple meanings in ambiguous phrases and inferred grammatical rules from 30 invented languages it had never seen before. This seems to prove that the model is not simply memorizing patterns from training data. The results challenge long-held assumptions that deep language understanding requires human-specific cognition, suggesting these capabilities may naturally emerge as AI systems scale.
That New Hit Song on Spotify? It Was Made by A.I. (The New Yorker)

AI-generated music is achieving mainstream success on streaming platforms, with aspiring musicians using AI tools to create chart-topping tracks. The workflow of an AI-enabled musician involves writing lyrics, creating text prompts specifying genre and mood, generating dozens of versions, and using an image generator for album art. Despite concerns, listeners can distinguish AI-generated tracks from human-made music only 53% of the time. An AI country song “Walk My Walk,” hit No. 1 on Billboard’s Country Digital Song Sales chart with over three million Spotify streams, while AI R&B singer Xania Monet secured a multimillion-dollar record deal.
Research: Accumulating Context Changes the Beliefs of Language Models (CMU/Princeton/Stanford)

Determining appropriate belief flexibility in AI systems remains an open question for researchers. As language models gain larger context windows and memory capabilities, they accumulate more text autonomously without user intervention. This creates the risk that the model’s belief profiles may silently shift, leading to inconsistent experiences or misalignment. Researchers found that GPT-5 showed a 54.7% shift in stated beliefs after 10 discussion rounds on moral dilemmas, and Grok-4 demonstrated a 27.2% shift in political beliefs after reading opposing texts. Stated beliefs changed within 2-4 rounds, while behavioural shifts accumulated gradually, showing that language models demonstrate highly malleable belief systems that shift substantially over extended interactions.

Radical Reads is edited by Ebin Tomy (Analyst, Radical Ventures)

Radical Blog

Building Spatially Intelligent AI

AI News This Week

The State of AI in 2025: Agents, Innovation, and Transformation (McKinsey)

From AI to ROI: Some Positive Evidence (FT)

In a First, AI Models Analyze Language As Well As a Human Expert (Quanta)

That New Hit Song on Spotify? It Was Made by A.I. (The New Yorker)

Research: Accumulating Context Changes the Beliefs of Language Models (CMU/Princeton/Stanford)