Radical Blog

Giving Robots Human-like Reasoning Capabilities: Introducing RFM-1

By Andrew Sohn, Anusha Nagabandi, Carlos Florensa, Daniel Adelberg, Di Wu, Hassan Farooq, Ignasi Clavera, Jeremy Welborn, Juyue Chen, Nikhil Mishra, Peter Chen, Peter Qian, Pieter Abbeel, Rocky Duan, Varun Vijay, Yang Liu, Covariant

http://Covariant%20robots%20in%20the%20real%20world%20often%20delivering%201000%20successful%20cycles%20/%20hour%20and%2099%+%20precision

Covariant robots in the real world often delivering 1000 successful cycles / hour and 99%+ precision

The last century has brought us transformative technological advancements, from computers and the internet, to advanced computing, and now artificial intelligence. These advancements have influenced every field of study, every industry, and every facet of human life.

At Covariant, we believe that the next major technological breakthrough lies in extending these advancements into the physical realm. Robotics stands at the forefront of this shift, poised to unlock efficiencies in the physical world that mirror those we’ve unlocked digitally.

Recent advances in foundation models have led to models that can produce realistic, aesthetic, and creative content across various domains, such as text, images, videos, music, and even code. The general problem-solving capabilities of these remarkable foundation models come from the fact that they are pre-trained on millions of tasks, represented by trillions of words from the internet. However, partly due to the limitation of their training data, existing models still struggle with grasping the true physical laws of reality, and achieving the accuracy, precision, and reliability required for robots’ effective and autonomous real-world interaction.

RFM-1 — a Robotics Foundation Model trained on both general internet data as well as data that is rich in physical real-world interactions — represents a remarkable leap forward toward building generalized AI models that can accurately simulate and operate in the demanding conditions of the physical world.

Foundation model powered by real-world multimodal robotics data

Behind the more recently popularized concept of “foundation models” are multiple long-standing academic fields like self-supervised learning, generative modeling, and model-based reinforcement learning, which are broadly predicated on the idea that intelligence and generalization emerge from understanding the world through a large amount of data.

Following the embodiment hypothesis that intelligent behavior arises from an entity’s physical interactions with its environment, Covariant’s first step toward this goal of developing state-of-the-art embodied AI began back in 2017. Since then, we have deployed a fleet of high-performing robotic systems to real customer sites across the world, delivering significant value to customers while simultaneously creating a vast and multimodal real-world dataset.

Why do we need to collect our own data to train robotics foundation models? The first reason is performance. Most existing robotic datasets contain slow-moving robots in lab-like environments, interacting with objects in mostly quasi-static conditions. In contrast, our systems were already tasked with working in demanding real-world environments with high rates of precision as well as performance. In order to build robotics foundation models that can power robots to achieve high rates in the real world, the training data has to contain robotic interactions in those demanding environments.

Understanding physics through learning world models

Learned world models are the future of physics simulation. They offer countless benefits over more traditional simulation methods, including the ability to reason about interactions where the relevant information is not known apriori, operate under real-time computation requirements, and improve accuracy over time. Especially in the current era of high-performing foundation models, the predictive capability of such world models can allow robots to develop physics intuitions that are critical for operating in our world.

We have developed RFM-1 with exactly this goal in mind: To deal with the complex dynamics and physical constraints of real-world robotics, where the optimization landscape is sharp, the line between success and failure is thin, and accuracy requirements are tight, with even a fraction of centimeter error potentially halting operations. Here, the focus shifts from merely recognizing an object like an onion, to managing its precise and efficient manipulation, all while minimizing risks and coordinating with other systems to maximize efficiency.

Leveraging language to help robots and people collaborate

For the past few decades, programming new robot behavior has been a laborious task for experienced robotics engineers only. RFM-1’s ability to process text tokens as input and predict text tokens as output opens up the door to intuitive natural language interfaces, enabling anyone to quickly program new robot behavior in minutes rather than weeks or months.

RFM-1 doesn’t just make robots more taskable by understanding natural language commands, it can also enable robots to ask for help from people. For example, if a robot is having trouble picking a particular item, it can communicate that to the robot operator or engineer. Furthermore, it can suggest why it has trouble picking the item. The operator can then provide new motion strategies to the robot, such as perturbing the object by moving it or knocking it down, to find better grasp points. Moving forward, the robot can apply this new strategy to future actions.

RFM-1 is the start of a new era for Covariant and Robotics Foundation Models.

By giving robots the human-like ability to reason on the fly, RFM-1 is a huge step forward toward delivering the autonomy needed to address the growing shortage of workers willing to engage in highly repetitive and dangerous tasks — ultimately lifting productivity and economic growth for decades to come.

Read the article in full.