“Ask Me Anything (AMA): A Simple Strategy for Prompting Language Models”, Simran Arora, Avanika Narayan, Mayee F. Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, Christopher RĂ©2022-10-05 (, , )⁠:

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore effort is dedicated towards designing a painstakingly “perfect prompt” for a task. To mitigate the high degree of effort involved in prompt-design, we instead ask whether producing multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy.

Our observations motivate our proposed prompting method, ASK ME ANYTHING (AMA). We first develop an understanding of the effective prompt formats, finding that question-answering (QA) prompts, which encourage open-ended generation (“Who went to the park?”) tend to outperform those that restrict the model outputs (“John went to the park. Output True or False.”).

Our approach recursively uses the LLM itself to transform task inputs to the effective QA format. We apply the collected prompts to obtain several noisy votes for the input’s true label. We find that the prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions for the inputs.

We evaluate AMA across open-source model families (eg. EleutherAI, BLOOM, OPT, and T0) and model sizes (125M–175b parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT-3-175b on 15⁄20 popular benchmarks. Averaged across these tasks, the GPT-Neo-6B model outperforms few-shot GPT-3-175b.

We release our code here: https://github.com/HazyResearch/ama_prompting.


There has been recent interest in how LLMs improve primarily along the 3 axes of parameter scale, training data, and compute [Kaplan et al 2020, Hoffmann et al 2022, Wei et al 2022c]. In Figure 4, as we increase the number of prompts to be aggregated, the conditional entropy reduces. Prompt aggregation may be another useful axis for understanding LLM scaling performance.

Figure 4: The top plots are for EleutherAI models of sizes ∈ {125M, 1.3B, 6B, 20B} and the bottom plots are for BLOOM models of sizes ∈ {560M, 1.7B, 7.1B, 175B}. The left plots show the conditional entropy metric H(y|ˆy) as a function of model size. Lines represent different prompts p with <em>k</em> = {0, 2, 4, 8} in-context examples and AMA prompt-chains without aggregation. The right plots show the conditional entropy as we aggregate predictions over an increasing number of AMA prompt-chains, with both the majority vote (MV) and weak supervision (WS) aggregation strategies for the GPT-J-6B and BLOOM 7.1B models. All plots are over RTE and each <em>k</em>-shot point is the average of 4 seeds.
Figure 4: The top plots are for EleutherAI models of sizes ∈ {125M, 1.3B, 6B, 20B} and the bottom plots are for BLOOM models of sizes ∈ {560M, 1.7B, 7.1B, 175B}. The left plots show the conditional entropy metric H(y|ˆy) as a function of model size. Lines represent different prompts p with k = {0, 2, 4, 8} in-context examples and AMA prompt-chains without aggregation. The right plots show the conditional entropy as we aggregate predictions over an increasing number of AMA prompt-chains, with both the majority vote (MV) and weak supervision (WS) aggregation strategies for the GPT-J-6B and BLOOM 7.1B models. All plots are over RTE and each k-shot point is the average of 4 seeds.