[cf. SayCan] Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (eg. from spreadsheets, to SAT questions). As a result, these models store different forms of commonsense knowledge across different domains.
In this work, we show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue—in which new multimodal tasks are formulated as a guided language-based exchange between different pre-existing foundation models, without additional finetuning.
In the context of egocentric perception, we present a case study of Socratic Models (SMs) that can provide meaningful results for complex tasks such as generating free-form answers to contextual questions about egocentric video, by formulating video Q&A as short story Q&A, i.e. summarizing the video into a short story, then answering questions about it. Additionally, SMs can generate captions for Internet images, and are competitive with state-of-the-art on zero-shot video-to-text retrieval with 42.8 R@1 on MSR-VTT 1k-A.
SMs demonstrate how to compose foundation models zero-shot to capture new multimodal functionalities, without domain-specific data collection.
Figure 2: In this work we propose Socratic Models (SMs), a framework that uses structured dialogue between pre-existing foundation models, each of which can exhibit unique (but complementary) capabilities depending on the distributions of data on which they are trained. On various perceptual tasks (shown), this work presents a case study of SMs with visual language models (VLMs, eg. CLIP), large language models (LMs, eg. GPT-3, RoBERTa), and audio language models (ALMs, eg. Wav2CLIP, Speech2Text).
From video search, to image captioning; from generating free-form answers to contextual reasoning questions, to forecasting future activities—SMs can provide meaningful results for complex tasks across classically challenging computer vision domains, without any model finetuning.
…Across a number of tasks spanning vision, language, and audio modalities, we find that specific instantiations of SMs, using LMs together with VLMs and audio-language models (ALMs), can generate results on challenging perceptual tasks (examples in Figure 2) that are often coherent and correct. We present results on Internet image captioning (§4) and the common video understanding task of video-to-text retrieval (§5), but our highlighted application is open-ended reasoning in the context of egocentric perception (Figure 4)—from answering free-form contextual reasoning questions about first-person videos (eg. “why did I go to the front porch today?”), to forecasting events into the future with commonsense (eg. “what will I do 3 hours from now?”).
Our egocentric SM system consists of 2 primary components, each of which benefits from multimodal multi-model discussions: (1) assembling video into a language-based world-state history, i.e. a story or event log, then (2) performing various types of open-ended text-prompted tasks based on that world-state history.
We find that simple scripted policies to guide a closed-loop exchange between pre-trained LM, VLM, and ALM models can (1) generate meaningful captions that respond to questions like “what am I doing?” with answers like “receiving a package” that span beyond the label set of standard vision datasets (Sigurdssonet al2018; Smairaet al2020), and (2) exhibit open-ended contextual Q&A capabilities previously thought to be out-of-reach for egocentric perception without domain-specific data collection (Graumanet al2021;Damenet al2020).
…In the context of egocentric perception, we find that formulating video Q&A as reading comprehension in SMs directly leverages the extent to which large LMs are capable of logical reasoning by connecting commonsense relationships with knowledge learned from Internet-scale data. For example, the system returns the following answer when presented with the world-state history log:
8:00 AM: went to grocery store to buy orange juice, chocolate, and bread. 8:15 AM: I went to gas station to fill up the vehicle tank. 8:30 AM: drove back home and left the groceries in the kitchen. 8:45 AM: started cooking eggs in the pan. 9:00 AM: the dog went into the kitchen. 9:15 AM: took the dog out for a walk. 9:30 AM: the dog is sick. Q: Why is the dog sick? A: The dog may have eaten something it was not supposed to, such as chocolate.
Arriving at the answer requires bridging multiple connections between observations eg. that the dog went into the kitchen, that the groceries are still in the kitchen, and that the groceries contain chocolate.
Such results offer a glimpse of what might be possible using SMs for deductive reasoning across multiple domains of information, and raises interesting research questions on (1) how to better assemble language-based world-state histories (beyond what is presented in this work) that capture relevant evidence to im prove the accuracy of conclusions, and (2) how to elicit chain-of-thought prompting (Weiet al2022) to decompose multi-step problems into intermediate ones. For example, one promising extension could be prompting the LM with chain-of-thought sequences to expand on hypotheses:
Q: What are reasons for why I might be chopping wood? A: Reasons might include: needing firewood, wanting to make a statement, or needing the exercise.
to which each hypothesis can be progressively explored by downstream subprograms called at recursively higher resolutions until a conclusion is reached.
These directions suggest pathways towards achieving increasingly meaningful utility and analysis by digital multimodal assistants. [cf. Elicit/Ought]