[video] Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots19, 20, 21. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them?answers that change over time in response to the agent?s own choices.
In this work, we investigate to what extent LLMs [PaLM] used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios.
We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction.
We find that closed-loop language feedback substantially improves high-level instruction completion on 3 domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment in the real world.
Figure 1: Inner Monologue enables grounded closed-loop feedback for robot planning with large language models by leveraging a collection of perception models (eg. scene descriptors and success detectors) in tandem with pretrained language-conditioned robot skills. Experiments show our system can reason and replan to accomplish complex long-horizon tasks for (a) mobile manipulation and (b, c) tabletop manipulation in both simulated and real settings.
Finally, we show that Inner Monologue, without requiring additional training beyond a frozen language model and pre-trained robotic skills, can accomplish complex, long-horizon, and unseen tasks in simulation as well as on 2 real-world robotic platforms. Notably, we show that it can efficiently retry under observed stochastic failure, replan under systematic infeasibility, or request human feedback for ambiguous queries, resulting in substantially improved performance in dynamical environments. As a demonstration of the versatility of LLMs and grounded closed-loop feedback, we additionally show several surprising capabilities emerging from the inner monologue formulation, including continued adaptation to new instructions, self-proposed goals, interactive scene understanding, multilingual interactions, and more.
3.2 Inner Monologue: We formulate an ‘inner monologue’ by continually injecting information from the various sources of feedback into the LLM planning language prompts as the robot interacts with the environment. While LLMs have demonstrated exceptional planning capabilities for embodied control tasks,20 prior works have found it crucial to ground LLM predictions with external components such as affordance functions21 in order to produce useful plans that are executable by robots. However, LLMs used in this context have thus far remained one-directional—providing a list of skills, without making corrections or leveraging opportunities to replan accordingly. In contrast, Inner Monologue studies settings where grounded environment feedback is provided directly to the LLM in a closed-loop fashion. This promotes improved LLM reasoning in complex long-horizon settings, even before any external affordance-based grounding methods are applied.
Our analysis assumes textual feedback is provided to the planner, but does not assume a single specific method of fusing LLM planning with low-level robotic control or a specific method of extracting environment feedback into language. Rather than focusing on a particular algorithmic implementation, our aim is to provide a case study on the value of incorporating different types of feedback into closed-loop LLM-based planning. Thus, Inner Monologue in §4 uses language feedback within separate systems that incorporate different LLMs, different methods of fusing planning with control, different environments and tasks, and different methods of acquiring control policies. We note that in our specific implementations of Inner Monologue, we use pre-trained LLMs for planning that are not finetuned, but rather evaluated solely with few-shot prompting; the full prompts can be found in the Appendix.
Figure 2: Various types of textual feedback. Success Detection (purple) gives task-specific task completion information, Passive Scene Description (green) gives structured semantic scene information at every planning step, and Active Scene Description (blue) gives unstructured semantic information only when queried by the LLM planner.
Figure 3: Different instantiations of Inner Monologue in 3 distinct domains—simulated tabletop rearrangement (top), real-world tabletop rearrangement (middle), and real-world kitchen mobile manipulation (bottom). Each domain uses different prompts and different feedback models. Sharing across the domains is the same Inner Monologue formulation that uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (yellow) by the agent, while accounting for injected embodied feedback from different models, such as object recognizers (green) and success detectors (purple). In real-world kitchen mobile manipulation domain (bottom), we additionally ground the actions using pre-trained affordance functions built in SayCan, which do not communicate back to the language model.
4.4 Emergent Capabilities: Although LLMs can generate fluent continuation from the prompted examples, we surprisingly find that, when informed with environment feedback, Inner Monologue demonstrates many impressive reasoning and replanning behaviors beyond the examples given in the prompt. Using a pre-trained LLM as the backbone, the method also inherits many of the appealing properties from its versatility and general-purpose language understanding. In this section, we demonstrate a few of these emergent capabilities.
Continued Adaptation to New Instructions: Although not explicitly prompted, the LLM planner can react to human interaction that changes the high-level goal mid-task. Figure 5a demonstrates a challenging case, where Human feedback changes the goal during the plan execution, and then changes the goal yet again by saying ‘finish the previous task’. We can see that the planner incorporates the feedback correctly by switching tasks twice. In another instance, despite not being explicitly prompted to terminate after a human says ‘please stop’, the LLM planner generalizes to this scenario and predicts a ‘done’ action.
Figure 5a: Informing LLM with embodied feedback enables many emergent capabilities, all of which are achieved without similar prompted examples. For instance, Inner Monologue can continually adapt to new instructions given by humans, propose new goals to achieve when faced with infeasibility for the previous plan, interact with humans in different natural languages, and answer questions about the current scene given past actions and feedback. (a) Continued Adaptation to New Instructions.
Self-Proposing Goals under Infeasibility: Instead of mindlessly following human-given instructions, Inner Monologue can also act as an interactive problem solver by proposing alternative goals to achieve when the previous goal becomes infeasible. In Figure 5b, to solve the task ‘put any 2 blocks inside the purple bowl’, Inner Monologue first attempts an action of picking up the purple block—the action fails as the purple block is intentionally made to be too heavy for the robot. After a hint ‘the purple block is too heavy’, it proposes to ‘find a lighter block’ and successfully solves the task in the end.
Figure 5b: Self-Proposing Goals under Infeasibility.
Multilingual Interaction: Pre-trained LLMs are known to be able to translate from one language to another, without any finetuning. We observe that such multilingual understanding also transfers to the embodied settings considered in this work. Specifically, in Figure 5c, the human-provided new instruction is written in Chinese, but the LLM can correctly interpret it, re-narrate it as a concrete goal to execute in English, and accordingly replan its future actions. Occasionally, we find that this capability even extends to symbols and emojis.
Figure 5c: Multilingual Interaction.
Interactive Scene Understanding: We also observe that Inner Monologue demonstrates interactive understanding of the scene using the past actions and environment feedback as context. In Figure 5d, after a task instruction has been executed, we turn to ask questions about the scene, again a structure that has not appeared in the prompt. Surprisingly, we find that it can often correctly answer these questions that require temporal and embodied reasoning.
Figure 5d: Interactive Scene Understanding.
Robustness to Feedback Order: In the main experiments of the paper, we prompted the language model following certain conventions. For instance, in the simulated tabletop domain, the convention is [Robot action, Scene, and Robot thought]. In practice, we find that the LLM planner is robust to occasionally swapping the order of feedback. In Appendix Figure 9a, a new human instruction is injected in the middle of the plan execution, but this structure has not been seen in the example prompts. Yet the planner recognizes the change and generates a new ‘Robot thought: Goal state is…’ statement allowing it to solve the new task.
Robustness to Typos: Inherited from the LLM backbone, our approach is robust to typos in human instruction, as seen in Appendix Figure 9b.
Despite the appealing findings about these emergent capabilities, we observe that they are of varying levels of consistency when no similar examples have been provided in the prompt, likely limited by the current capabilities of the language models. However, we believe that further investigations into these behaviors and addressing their limitations would each lead to exciting future directions.