The Global Project to Make a General Robotic Brain

How 34 labs are teaming up to tackle robotic learning

8 min read
 A silver robot with one arm lifts a dinosaur from a table full of a clutter of random objects.

Robots from around the world, including this robot from Google, are sharing data on object manipulation to help work towards a general purpose robotic brain.

Open X-Embodiment Collaboration

The generative AI revolution embodied in tools like ChatGPT, Midjourney, and many others is at its core based on a simple formula: Take a very large neural network, train it on a huge dataset scraped from the Web, and then use it to fulfill a broad range of user requests. Large language models (LLMs) can answer questions, write code, and spout poetry, while image-generating systems can create convincing cave paintings or contemporary art.

So why haven’t these amazing AI capabilities translated into the kinds of helpful and broadly useful robots we’ve seen in science fiction? Where are the robots that can clean off the table, fold your laundry, and make you breakfast?

Unfortunately, the highly successful generative AI formula—big models trained on lots of Internet-sourced data—doesn’t easily carry over into robotics, because the Internet is not full of robotic-interaction data in the same way that it’s full of text and images. Robots need robot data to learn from, and this data is typically created slowly and tediously by researchers in laboratory environments for very specific tasks. Despite tremendous progress on robot-learning algorithms, without abundant data we still can’t enable robots to perform real-world tasks (like making breakfast) outside the lab. The most impressive results typically only work in a single laboratory, on a single robot, and often involve only a handful of behaviors.

If the abilities of each robot are limited by the time and effort it takes to manually teach it to perform a new task, what if we were to pool together the experiences of many robots, so a new robot could learn from all of them at once? We decided to give it a try. In 2023, our labs at Google and the University of California, Berkeley came together with 32 other robotics laboratories in North America, Europe, and Asia to undertake the RT-X project, with the goal of assembling data, resources, and code to make general-purpose robots a reality.

Here is what we learned from the first phase of this effort.

How to create a generalist robot

Humans are far better at this kind of learning. Our brains can, with a little practice, handle what are essentially changes to our body plan, which happens when we pick up a tool, ride a bicycle, or get in a car. That is, our “embodiment” changes, but our brains adapt. RT-X is aiming for something similar in robots: to enable a single deep neural network to control many different types of robots, a capability called cross-embodiment. The question is whether a deep neural network trained on data from a sufficiently large number of different robots can learn to “drive” all of them—even robots with very different appearances, physical properties, and capabilities. If so, this approach could potentially unlock the power of large datasets for robotic learning.

The scale of this project is very large because it has to be. The RT-X dataset currently contains nearly a million robotic trials for 22 types of robots, including many of the most commonly used robotic arms on the market. The robots in this dataset perform a huge range of behaviors, including picking and placing objects, assembly, and specialized tasks like cable routing. In total, there are about 500 different skills and interactions with thousands of different objects. It’s the largest open-source dataset of real robotic actions in existence.

Surprisingly, we found that our multirobot data could be used with relatively simple machine-learning methods, provided that we follow the recipe of using large neural-network models with large datasets. Leveraging the same kinds of models used in current LLMs like ChatGPT, we were able to train robot-control algorithms that do not require any special features for cross-embodiment. Much like a person can drive a car or ride a bicycle using the same brain, a model trained on the RT-X dataset can simply recognize what kind of robot it’s controlling from what it sees in the robot’s own camera observations. If the robot’s camera sees a UR10 industrial arm, the model sends commands appropriate to a UR10. If the model instead sees a low-cost WidowX hobbyist arm, the model moves it accordingly.

To test the capabilities of our model, five of the laboratories involved in the RT-X collaboration each tested it in a head-to-head comparison against the best control system they had developed independently for their own robot. Each lab’s test involved the tasks it was using for its own research, which included things like picking up and moving objects, opening doors, and routing cables through clips. Remarkably, the single unified model provided improved performance over each laboratory’s own best method, succeeding at the tasks about 50 percent more often on average.

While this result might seem surprising, we found that the RT-X controller could leverage the diverse experiences of other robots to improve robustness in different settings. Even within the same laboratory, every time a robot attempts a task, it finds itself in a slightly different situation, and so drawing on the experiences of other robots in other situations helped the RT-X controller with natural variability and edge cases. Here are a few examples of the range of these tasks:

Building robots that can reason

Encouraged by our success with combining data from many robot types, we next sought to investigate how such data can be incorporated into a system with more in-depth reasoning capabilities. Complex semantic reasoning is hard to learn from robot data alone. While the robot data can provide a range of physical capabilities, more complex tasks like “Move apple between can and orange” also require understanding the semantic relationships between objects in an image, basic common sense, and other symbolic knowledge that is not directly related to the robot’s physical capabilities.

So we decided to add another massive source of data to the mix: Internet-scale image and text data. We used an existing large vision-language model that is already proficient at many tasks that require some understanding of the connection between natural language and images. The model is similar to the ones available to the public such as ChatGPT or Bard. These models are trained to output text in response to prompts containing images, allowing them to solve problems such as visual question-answering, captioning, and other open-ended visual understanding tasks. We discovered that such models can be adapted to robotic control simply by training them to also output robot actions in response to prompts framed as robotic commands (such as “Put the banana on the plate”). We applied this approach to the robotics data from the RT-X collaboration.

The RT-X model uses images or text descriptions of specific robot arms doing different tasks to output a series of discrete actions that will allow any robot arm to do those tasks. By collecting data from many robots doing many tasks from robotics labs around the world, we are building an open-source dataset that can be used to teach robots to be generally useful.Chris Philpot

To evaluate the combination of Internet-acquired smarts and multirobot data, we tested our RT-X model with Google’s mobile manipulator robot. We gave it our hardest generalization benchmark tests. The robot had to recognize objects and successfully manipulate them, and it also had to respond to complex text commands by making logical inferences that required integrating information from both text and images. The latter is one of the things that make humans such good generalists. Could we give our robots at least a hint of such capabilities?

We conducted two sets of evaluations. As a baseline, we used a model that excluded all of the generalized multirobot RT-X data that didn’t involve Google’s robot. Google’s robot-specific dataset is in fact the largest part of the RT-X dataset, with over 100,000 demonstrations, so the question of whether all the other multirobot data would actually help in this case was very much open. Then we tried again with all that multirobot data included.

In one of the most difficult evaluation scenarios, the Google robot needed to accomplish a task that involved reasoning about spatial relations (“Move apple between can and orange”); in another task it had to solve rudimentary math problems (“Place an object on top of a paper with the solution to ‘2+3’”). These challenges were meant to test the crucial capabilities of reasoning and drawing conclusions.

In this case, the reasoning capabilities (such as the meaning of “between” and “on top of”) came from the Web-scale data included in the training of the vision-language model, while the ability to ground the reasoning outputs in robotic behaviors—commands that actually moved the robot arm in the right direction—came from training on cross-embodiment robot data from RT-X. An example of an evaluation where we asked the robot to perform a task not included in its training data is shown in the video below.

Even without specific training, this Google research robot is able to follow the instruction “move apple between can and orange.” This capability is enabled by RT-X, a large robotic manipulation dataset and the first step towards a general robotic brain.

While these tasks are rudimentary for humans, they present a major challenge for general-purpose robots. Without robotic demonstration data that clearly illustrates concepts like “between,” “near,” and “on top of,” even a system trained on data from many different robots would not be able to figure out what these commands mean. By integrating Web-scale knowledge from the vision-language model, our complete system was able to solve such tasks, deriving the semantic concepts (in this case, spatial relations) from Internet-scale training, and the physical behaviors (picking up and moving objects) from multirobot RT-X data. To our surprise, we found that the inclusion of the multirobot data improved the Google robot’s ability to generalize to such tasks by a factor of three. This result suggests that not only was the multirobot RT-X data useful for acquiring a variety of physical skills, it could also help to better connect such skills to the semantic and symbolic knowledge in vision-language models. These connections give the robot a degree of common sense, which could one day enable robots to understand the meaning of complex and nuanced user commands like “Bring me my breakfast” while carrying out the actions to make it happen.

The next steps for RT-X

The RT-X project shows what is possible when the robot-learning community acts together. Because of this cross-institutional effort, we were able to put together a diverse robotic dataset and carry out comprehensive multirobot evaluations that wouldn’t be possible at any single institution. Since the robotics community can’t rely on scraping the Internet for training data, we need to create that data ourselves. We hope that more researchers will contribute their data to the RT-X database and join this collaborative effort. We also hope to provide tools, models, and infrastructure to support cross-embodiment research. We plan to go beyond sharing data across labs, and we hope that RT-X will grow into a collaborative effort to develop data standards, reusable models, and new techniques and algorithms.

Our early results hint at how large cross-embodiment robotics models could transform the field. Much as large language models have mastered a wide range of language-based tasks, in the future we might use the same foundation model as the basis for many real-world robotic tasks. Perhaps new robotic skills could be enabled by fine-tuning or even prompting a pretrained foundation model. In a similar way to how you can prompt ChatGPT to tell a story without first training it on that particular story, you could ask a robot to write “Happy Birthday” on a cake without having to tell it how to use a piping bag or what handwritten text looks like. Of course, much more research is needed for these models to take on that kind of general capability, as our experiments have focused on single arms with two-finger grippers doing simple manipulation tasks.

As more labs engage in cross-embodiment research, we hope to further push the frontier on what is possible with a single neural network that can control many robots. These advances might include adding diverse simulated data from generated environments, handling robots with different numbers of arms or fingers, using different sensor suites (such as depth cameras and tactile sensing), and even combining manipulation and locomotion behaviors. RT-X has opened the door for such work, but the most exciting technical developments are still ahead.

This is just the beginning. We hope that with this first step, we can together create the future of robotics: where general robotic brains can power any robot, benefiting from data shared by all robots around the world.

This article appears in the February 2024 print issue as “The Global Project to Make a General Robotic Brain.”

{"imageShortcodeIds":[]}
The Conversation (0)
Sort by

Prototype Probe Could Someday Search Icy Moons for Life

The probe can melt through ice to reach subsurface oceans

4 min read
yellowish semi circle with red lines all over against a black background

Europa, one of Jupiter's many moons, may be home to extraterrestrial life in subsurface oceans.

NASA

This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

Drilling down into the icy depths of Jupiter’s or Saturn’s moons could reveal one of the most sought-after prizes in astronomy: extraterrestrial life. Researchers in Germany are bringing us one step closer to the possibility of finding that prize—if it exists—with the creation of a novel subglacial probe. The new probe, which can detect objects under the ice using both sonar and radar, is described in a study published 22 January in IEEE Transactions on Geoscience and Remote Sensing.

Keep Reading ↓ Show less

Biosignals, Robotics, and Rehabilitation

Bridging the gap between human neurophysiology and intelligent machines

8 min read
A physician positions his head into a special medical display and uses his hands to remote operate a surgical robot, seen in the background of an operating room.

A team of researchers and physicians led by Prof. S. Farokh Atashzar at NYU Tandon is working to change the way we view healthcare with intelligent, interactive robotic and AI-driven assistive machines that can augment human capabilities and break human barriers.

NYU Tandon

This sponsored article is brought to you by NYU Tandon School of Engineering.

To address today’s health challenges, especially in our aging society, we must become more intelligent in our approaches. Clinicians now have access to a range of advanced technologies designed to assist early diagnosis, evaluate prognosis, and enhance patient health outcomes, including telemedicine, medical robots, powered prosthetics, exoskeletons, and AI-powered smart wearables. However, many of these technologies are still in their infancy.

The belief that advancing technology can improve human health is central to research related to medical device technologies. This forms the heart of research for Prof. S. Farokh Atashzar who directs the Medical Robotics and Interactive Intelligent Technologies (MERIIT) Lab at the NYU Tandon School of Engineering.

Atashzar is an Assistant Professor of Electrical and Computer Engineering and Mechanical and Aerospace Engineering at NYU Tandon. He is also a member of NYU WIRELESS, a consortium of researchers dedicated to the next generation of wireless technology, as well as the Center for Urban Science and Progress (CUSP), a center of researchers dedicated to all things related to the future of modern urban life.

Keep Reading ↓ Show less

The Engineer Behind Samsung’s Speech Recognition Software

He now heads Korea University’s speech and language processing lab

6 min read
a man in a navy suit jacket and black shirt with glasses sitting on a brown leather couch posing for a portrait

IEEE Member Chanwoo Kim helped develop speech recognition, text-to-speech tools, and language modeling systems used on a Samsung Galaxy mobile phone and Google Home devices.

Chanwoo Kim

Every time you use your voice to generate a message on a Samsung Galaxy mobile phone or activate a Google Home device, you’re using tools Chanwoo Kim helped develop. The former executive vice president of Samsung Research’s Global AI Centers specializes in end-to-end speech recognition, end-to-end text-to-speech tools, and language modeling.

“The most rewarding part of my career is helping to develop technologies that my friends and family members use and enjoy,” Kim says.

Keep Reading ↓ Show less

Get unlimited IEEE Spectrum access

Become an IEEE member and get exclusive access to more stories and resources, including our vast article archive and full PDF downloads
Get access to unlimited IEEE Spectrum content
Network with other technology professionals
Establish a professional profile
Create a group to share and collaborate on projects
Discover IEEE events and activities
Join and participate in discussions

Humanoid Robots Are Getting to Work

Humanoids from Agility Robotics and seven other companies vie for jobs

7 min read
Vertical
Humanoid Robots Are Getting to Work

Ten years ago, at the DARPA Robotics Challenge (DRC) Trial event near Miami, I watched the most advanced humanoid robots ever built struggle their way through a scenario inspired by the Fukushima nuclear disaster. A team of experienced engineers controlled each robot, and overhead safety tethers kept them from falling over. The robots had to demonstrate mobility, sensing, and manipulation—which, with painful slowness, they did.

These robots were clearly research projects, but DARPA has a history of catalyzing technology with a long-term view. The DARPA Grand and Urban Challenges for autonomous vehicles, in 2005 and 2007, formed the foundation for today’s autonomous taxis. So, after DRC ended in 2015 with several of the robots successfully completing the entire final scenario, the obvious question was: When would humanoid robots make the transition from research project to a commercial product?

Keep Reading ↓ Show less
{"imageShortcodeIds":[]}

A Few Social Media Influencers Are Shaping AI

Tweets lead to attention, which leads to higher citation numbers

4 min read
illustration of a group of people standing on gray lines that branch out in different directions on a navy background
iStock

The term “social media influencer” may call to mind Instagram accounts shilling hair growth gummies and cute outfits—but in reality, influencers influence all types of things. Including artificial intelligence research trends.

Mainstream interest in AI and machine learning is at an all-time high, and the industry is responding—churning out thousands of AI and ML works for conferences and journals. The AI/ML community is also particularly active in posting non-peer-reviewed preprints via online platforms like ArXiv. Given this glut of work, what rises to the top and receives attention?

Keep Reading ↓ Show less

Meet RB-WATCHER: Revolutionizing Surveillance for Unmatched Security

Boost your capabilities with the next-generation surveillance robot for enhanced protection

4 min read
An image of a small wheeled robot with the word "robotnik" on the side.
Robotnik

This is a sponsored article brought to you by Robotnik.

In today’s ever-evolving world, ensuring the safety and security of our surroundings has become an utmost priority. Traditional methods of surveillance and security often fall short when it comes to precision, reliability and adaptability. Recognizing this need for a smarter solution, Robotnik, a robotic company fully committed to precision engineering and unparalleled expertise is shaping the future with its groundbreaking advancements, has developed the RB-WATCHER. It is a collaborative mobile robot designed specifically for surveillance and security tasks. With its advanced features and cutting-edge technology, RB-WATCHER is set to revolutionize the way we approach surveillance in various environments.

Keep Reading ↓ Show less

Video Friday: $2.6 Billion

Your weekly selection of awesome robot videos

4 min read
Video Friday: $2.6 Billion
Figure

Video Friday is your weekly selection of awesome robotics videos, collected by your friends at IEEE Spectrum robotics. We also post a weekly calendar of upcoming robotics events for the next few months. Please send us your events for inclusion.

HRI 2024: 11–15 March 2024, BOULDER, COLORADO, USA
Eurobot Open 2024: 8–11 May 2024, LA ROCHE-SUR-YON, FRANCE
ICRA 2024: 13–17 May 2024, YOKOHAMA, JAPAN
RoboCup 2024: 17–22 July 2024, EINDHOVEN, NETHERLANDS

Enjoy today’s videos!

Keep Reading ↓ Show less