Google's DeepMind has introduced Robotics Transformer 2 (RT-2), a groundbreaking vision-language-action (VLA) model that empowers robots to carry out new tasks without specific training. Similar to how language models learn general concepts from vast amounts of web data, RT-2 utilizes text and images from the internet to comprehend real-world concepts and translate that knowledge into generalized instructions for robotic actions. The potential implications of this technology include context-aware and adaptable robots that can execute various tasks in different situations and environments with significantly less training than currently required.
What sets Google DeepMind's RT-2 apart? In 2022, DeepMind unveiled RT-1, a multi-task model that learned from 130,000 demonstrations and enabled Everyday Robots to accomplish over 700 tasks with a 97% success rate. Now, by combining the robotic demonstration data from RT-1 with web datasets, the company has developed its successor: RT-2.
The most noteworthy feature of RT-2 is its ability to function without relying on hundreds of thousands of data points. Traditionally, specific robot training covering every object, environment, and situation has been deemed essential to handle complex, abstract tasks in highly variable settings. However, RT-2 learns from a limited amount of robotic data, allowing it to perform sophisticated reasoning similar to foundation models and apply that knowledge to direct robotic actions, even for tasks it has never encountered or been trained for before.
"RT-2 demonstrates improved generalization capabilities and enhanced semantic and visual understanding beyond the robotic data it was exposed to," explains Google. This includes interpreting new commands and responding to user instructions with rudimentary reasoning, such as categorizing objects or providing high-level descriptions.
One significant advantage of RT-2 is that it can take action without specific training. For example, when tasked with throwing away trash, the model already possesses a general understanding of what trash is and can identify it without explicit training. Furthermore, it can figure out how to dispose of the trash, despite never having been trained for that specific action. In internal tests, RT-2 was performed as well as RT-1 for familiar tasks. However, for novel, unseen scenarios, RT-2's performance nearly doubled, reaching 62%, compared to RT-1's 32%.
The potential applications of advanced vision-language-action (VLA) models like RT-2 are vast. They could lead to context-aware robots capable of reasoning, problem-solving, and interpreting information to perform a diverse range of actions in the real world, depending on the specific situation. For instance, in warehouses, instead of robots performing repetitive actions for all objects, enterprises could employ machines that handle each item differently, considering factors like object type, weight, fragility, and other relevant parameters.
The impact of such developments is reflected in market projections. According to Markets and Markets, the AI-driven robotics segment is expected to grow from $6.9 billion in 2021 to $35.3 billion in 2026, at an expected compound annual growth rate (CAGR) of 38.6%. As technology advances and reaches its full potential, we can expect to witness a transformative impact on various industries and everyday life.