CMU, Tsinghua University, and MIT have launched the world's first Agent infinite flow. Robot "007" can't stop studying overtime! Embodied intelligence is being revolutionized

Article reprint source: AI Trends
Source: New Wisdom
Editor: Aeneas is so sleepy
Recently, RoboGen, the world's first generative robot agent proposed by CMU/MIT/Tsinghua/UMass, can generate data infinitely, allowing robots to train non-stop 24 hours a day. AIGC for Robotics is indeed the direction of the future.
The world's first generative robot agent is released!
For a long time, compared to language or visual models that can be trained on large-scale Internet data, training robot strategy models requires data with dynamic physical interaction information, and the lack of this data has always been the biggest bottleneck in the development of embodied intelligence.
Recently, researchers from CMU, Tsinghua, MIT, UMass and other institutions proposed a new RoboGen agent.
By utilizing the large-scale knowledge contained in large language models and generative models, combined with the physical information provided by a realistic simulated world, we can "infinitely" generate various tasks, scenarios, and teaching data, enabling fully automatic training of robots 24/7.
Now, we are rapidly running out of high-quality real tokens from the network. The data for training AI around the world is almost insufficient.
Hinton, the father of deep learning, said, "Technology companies are using 100 times more computing power than the current GPT-4 to train new models in the next 18 months." The model parameters are larger and the computing power required is huge, but where is the data?
Faced with a thirst for models, AI synthesis is the answer.
Paper address: https://arxiv.org/abs/2311.01455
Project homepage: https://robogen-ai.github.io/
Open source address: https://github.com/Genesis-Embodied-AI
Specifically, a research team led by MIT-IBM Chief Scientist Gan Chuang, with the support of generative AI and differentiable physical simulation, proposed a "propose-generate-learn" cycle, allowing the agent to set questions and train the robot itself.
First, the agent proposed that we should develop this skill.
It then generates the corresponding environment, configuration, and skill learning guidance to create the simulation environment.
Finally, the agent will decompose the proposed high-level task into subtasks, select the best learning method, and then learn the strategy and master the proposed skills.
It is worth noting that the entire process requires almost no human supervision, and the number of tasks is actually infinite!
Nvidia senior scientist Jim Fan also forwarded this important research.
Now, the robot has learned a series of explosive operations——
To place items in a locker:
To heat a bowl of soup in the microwave:
Pull the lever to brew coffee:
And backflips, etc.:
Simulated environment, the key to diverse skill learning
In robotics research, there has long been a difficult problem: how to give robots multiple skills so that they can operate in non-factory environments and perform a wide range of tasks for humans?
In recent years, we have taught robots a variety of complex skills, such as fluid manipulation, throwing objects, playing football, parkour, etc. However, these skills are independent and have a short field of vision, requiring manually designed task descriptions and training supervision.
Because real-world data collection is expensive and laborious, these skills are trained in simulations with appropriate domain randomization and then deployed in the real world.
Compared with exploration and data collection in the real world, simulation environments have many advantages, such as providing privileged access to low-level states and unlimited exploration opportunities; supporting large-scale parallel computing, significantly speeding up data collection; allowing robots to develop closed-loop strategies and error resilience.
However, building a simulated environment requires a series of tedious tasks (designing tasks, selecting relevant and semantically meaningful assets, generating reasonable scene layouts and configurations, and formulating training supervision such as reward or loss functions). Even in a simulated world, this greatly limits the scalability of robot skill learning.
We therefore propose a “generative simulation” paradigm that combines advances in simulated robotic skill learning with recent advances in grounded and generative models.
Leveraging the generative capabilities of state-of-the-art base models, generative simulations can generate information for all stages required for learning a variety of robotic skills in simulation.
Thanks to the comprehensive knowledge encoded in the latest base models, the scene and task data generated in this way can be very similar to the distribution of real-world scenes.
In addition, these models can further provide decomposed low-level subtasks that can be seamlessly handled by domain-specific policy learning methods, resulting in closed-loop demonstrations of various skills and scenarios.
RoboGen Process
RoboGen is a fully automated process that allows robots to learn various skills 24/7. It consists of four stages:
1. Task suggestions;
2. Scene generation;
3. Training supervision generation;
4. Use the generated information for skill learning.
Leveraging the embedded common sense and generative capabilities of the latest base models, RoboGen can automatically generate tasks, scenarios, and training supervision, enabling robots to learn multiple skills at scale.
Task suggestions
At this stage, RoboGen is able to propose high-level tasks, generate corresponding environments, decompose high-level goals into low-level subtasks, and then learn sub-skills sequentially.
First, RoboGen generates meaningful, diverse, high-level tasks for robots to learn.
The researchers initialized the system with a specific robot type and a randomly sampled object from the pool. The provided robot and sampled object information were then fed into the LLM.
This sampling process ensures the diversity of generated tasks.
For example, legged robots such as quadrupeds are able to acquire a variety of motor skills, while robotic arm manipulators, when paired, have the potential to perform a variety of manipulation tasks with different sampling objects.
The researchers used GPT-4 to query the current pipeline and then explained the details of RoboGen in the context of machinery and tasks related to object manipulation.
The objects used for initialization are sampled from a predefined list, including articulated and non-articulated objects commonly found in home scenes, such as an oven, microwave, water dispenser, laptop, dishwasher, etc.
Because GPT-4 has been trained on massive internet datasets, it has a rich understanding of the affordances of these objects, how to interact with them, and what meaningful tasks they can be associated with.
For example, suppose the sampled articulated object is a microwave oven, where joint 0 is the revolute joint connecting the door and joint 1 is another revolute joint controlling the timer knob. GPT-4 will return a task: "The robot arm puts a bowl of soup into the microwave oven, closes the door and sets the microwave oven timer for an appropriate heating time a."
Other objects required for the generated task are a bowl of soup A, and the joints and links associated with the task, including joint 0 (for opening the microwave door), joint 1 (for setting the timer), link 0 (door), and link 1 (timer knob).
For articulated objects, since PartNetMobility is the only high-quality articulated object dataset and already covers a variety of articulated assets, tasks will be generated based on the sampled assets.
By repeatedly querying different sampled objects and examples, a variety of manipulation and motion tasks can be generated.
Scene Generation
Given a task, we can continue to generate corresponding simulation scenarios to learn the skills to complete the task.
As shown in the figure, scene components and configurations are generated according to the task description, and object assets are retrieved or generated, and then the simulation scene is populated.
Scene components and configurations consist of the following elements: queries for the relevant assets to populate the scene, their physical parameters (e.g. size), configuration (e.g. initial joint angles), and the overall spatial configuration of the assets.
In addition to the necessary object assets required for the task generated in the previous step, in order to increase the complexity and diversity of the generated scenes while resembling the object distribution of real scenes, the researchers also let GPT-4 return additional queries for objects semantically related to the task.
For example, for the task “open the cabinet, put the toys in it, and then close it,” the generated scene will also include a living room cushion, a desk lamp, a book, and an office chair.
Training supervised generation
In order to acquire relevant skills, skill learning needs to be supervised.
RoboGen will first query GPT-4 to plan and decompose long tasks into shorter-scale subtasks.
A key assumption is that when a task is decomposed into sufficiently short subtasks, each subtask can be reliably solved by existing algorithms such as reinforcement learning, motion planning, trajectory optimization, etc.
After decomposition, RoboGen queries GPT-4 and selects the appropriate algorithm to solve each subtask.
RoboGen integrates several different types of learning algorithms: reinforcement learning, evolutionary strategies, gradient-based trajectory optimization, and action initialization with motion planning.
Each is better suited for different tasks; for example, gradient-based trajectory optimization is better suited for learning fine-grained manipulation tasks involving soft bodies, such as shaping dough into a target shape.
Action initialization combined with motion planning is more reliable in solving tasks, such as approaching a target object via a collision-free path.
Reinforcement learning and evolutionary strategies are better suited for tasks that are contact-rich and involve continuous interaction with other scene components, such as legged locomotion, or when the desired action cannot be simply parameterized by a discrete end-effector pose, such as turning a knob on an oven.
In short, GPT-4 will choose which algorithm to use online based on the generated subtasks.
Next, you can build simulation scenarios for the robots to learn skills.
Robot learns to open safe
For example, RoboGen will allow robots to learn very delicate tasks such as adjusting the direction of a desk lamp.
Interestingly, in this scene, there are fragile objects like computer monitors on the ground.
It can be said that this is a great test of the robot's ability to recognize the environment.
To this end, RoboGen generates very detailed operation code, including scene configuration, task decomposition and supervision:
In addition, the robot will be trained to perform tasks that require many steps to complete, such as taking items out of a safe.
This involves operations such as opening doors, taking items, putting them down, and closing doors, and you need to try to avoid collisions with furniture during the process.
The code given by RoboGen is as follows:
Or, things like having a Boston Dynamics humanoid robot spin in a circle, a scenario that might be encountered in a confined space.
code show as below:
Experimental Results
- Task diversity
As shown in Table 1, RoboGen achieves the lowest Self-BLEU and embedding similarity compared to all previous baselines. In other words, the diversity of RoboGen-generated tasks is higher than that of hand-crafted skill learning benchmarks and datasets!
- Scenario effectiveness
As shown in Figure 4, removing size verification causes the BLIP-2 score to drop dramatically, because there is a huge difference between the object sizes in Objaverse and PartNetMobility and the actual sizes in the real world. In addition, the BLIP-2 score without object verification is also lower and has a larger variance.
In contrast, the verification step in RoboGen can significantly improve the effectiveness of object selection.
- Effectiveness of training instruction
As shown in Figure 3, the skills learned by the robot in four long-range tasks based on the training instructions (i.e., task decomposition and reward function) generated by RoboGen.
The results show that the robot successfully learned the skills to complete the corresponding tasks. In other words, the automatically generated training instructions can effectively derive meaningful and useful skills.
- Skills learning
The results in Table 2 show that allowing a choice of learning algorithms is beneficial for improving performance on the tasks. If only RL is used, skill learning fails for most tasks.
- system
As shown in Figure 1, RoboGen can generate a variety of tasks for skill learning, including rigid/joint object manipulation, locomotion, and soft body manipulation.
Figure 3 further demonstrates that RoboGen is able to provide long-range operation skills in a reasonable decomposition manner.
about the author
Yufei Wang is a third-year doctoral student at the Robotics Institute at Carnegie Mellon University. Her supervisors are Professors Zackory Erickson and David Held. Her research interest is robot learning.
Previously, he received a master's degree in computer science from CMU in December 2020, supervised by Professor David Held, and a bachelor's degree in data science from Peking University Yuanpei College in July 2019, supervised by Professor Bin Dong.
Zhou Xian is a PhD student at the Robotics Institute at Carnegie Mellon University, with Katerina Fragkiadaki as his advisor. His research interests are robotics, computer vision, and world model learning.
Prior to entering CMU, he completed his bachelor's degree at Nanyang Technological University, Singapore, where he was taught by Pham Quang Cuong and I-Ming Chen. He also interned at Meta AI, Akshara Rai, and MIT-IBM AI Lab, where his mentor was Chuang Gan.
Currently, his research focuses on building a unified neural policy and simulation infrastructure for scalable robot learning.
In addition, the co-author is Chen Feng from Tsinghua University's Yao Class.
The team leader, Gan Chuang, is currently the chief scientist of IBM and assistant professor of the University of Massachusetts. He is a disciple of Academician Yao Qizhi. During his doctoral studies, he won the Tsinghua Special Award, Microsoft Scholar, and Baidu Scholar. His research has also been funded by Amazon Research Award, Sony Faculty Award, Cisco Faculty Award, Microsoft Accelerate Foundation Models Research Program, etc.
References:
https://robogen-ai.github.io