“Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan2022-04-04 (, , , , )⁠:

[demo video, Kilcher/interview, Twitter; more powerful closed-loop version; cf. Socratic models, Decision Transformers/Gato, Flamingo] Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a major weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment.

We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes”, while the language model supplies high-level semantic knowledge about the task.

We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment.

We evaluate our method SayCan on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator.

The project’s website and the video can be found at Github.

Figure 10: Per-skill evaluation performance of the best policies and number of skills over the duration of the project. The performance as well as the number of skills that the robots are able to handle grow over time due to the continuous data collection efforts as well as improving the policy training algorithms.

[Eric Jang on scaling: “I’m very proud of how we scaled up # of tasks vs. time in the SayCan paper. Some of these tasks (opening a drawer or flipping a bottle upright) are quite challenging. The jump 551 → 100,000 tasks will not require much additional engineering, just additional data collection.”]

Figure 2: A scoring language model is queried with a prompt-engineered context of examples and the high-level instruction to execute and outputs the probability of each skill being selected. To iteratively plan the next steps, the selected skill is added to the natural language query and the language model is queried again.

Skill specification, reward functions, and action space: To complete the description of the underlying MDP that we consider, we provide the reward function as well as the skill specification that is used by the policies and value functions. As mentioned previously, for skill specification we use a set of short, natural language descriptions that are represented as language model embeddings. We use sparse reward functions with reward values of 1.0 at the end of an episode if the language command was executed successfully, and 0.0 otherwise. The success of language command execution is rated by humans where the raters are given a video of the robot performing the skill, together with the given instruction. If 2 out of the 3 raters agree that the skill was accomplished successfully, the episode is labeled with a positive reward.

To additionally process the data, we also ask the raters to mark the episodes as unsafe (ie. if the robot collided with the environment), undesirable (ie. if the robot perturbed objects that were not relevant to the skill) or infeasible (ie. if the skill cannot be done or is already accomplished). If any of these conditions are met, the episode is excluded from training.

Results: …Across all instruction families in the mock kitchen, SayCan achieved a planning success rate of 70% and an execution rate of 61%. In Table 3 we further verify the performance of SayCan out of the lab setting and in the real kitchen on a subset of the instructions, particularly to verify the performance of the policies and value functions in this setting. We find no substantial loss of performance between the 2 settings, indicating SayCan and the underlying policies generalize well to the full kitchen. The full task list and results can be found in the Appendix Table 5 and videos of experiment rollouts and the decision making process can be found on the project website.

…When comparing the performance of different instruction families in Table 2 & 3 (see Table 1 for an explanation), we see that the natural language nouns performed worse than natural language verbs, due to the number of nouns possible (15 objects and 5 locations) versus number of verbs (6). The structured language tasks (created to ablate the performance loss of spelling out the solution versus understanding the query) were planned correctly 100% of the time, while their natural language verb counterparts were planned correctly 80%. This indicates that the sequence is reasonable for the planner, but that the language was challenging to parse.

The embodiment tasks were planned correctly 64% of the time, generally with failures as a result of affordance function misclassification. SayCan planned and executed crowd-sourced natural queries with performance on par with other instruction families. SayCan performed worst on the most challenging long-horizon tasks, where most failures were a result of early termination by the LLM (eg. bringing one object but not the second). We also find that SayCan struggles with negation (eg. “bring me something that isn’t an apple from the table”) and ambiguous references (eg. asking for drinks with caffeine, sugary drinks), which is a known issue inherited from underlying language models (Hosseini et al 2021). Overall, 65% of the errors were a result of LLM failures and 35% were affordance errors.