Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore effort is dedicated towards designing a painstakingly âperfect promptâ for a task. To mitigate the high degree of effort involved in prompt-design, we instead ask whether producing multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy.
Our observations motivate our proposed prompting method, ASK ME ANYTHING (AMA). We first develop an understanding of the effective prompt formats, finding that question-answering (QA) prompts, which encourage open-ended generation (âWho went to the park?â) tend to outperform those that restrict the model outputs (âJohn went to the park. Output True or False.â).
Our approach recursively uses the LLM itself to transform task inputs to the effective QA format. We apply the collected prompts to obtain several noisy votes for the inputâs true label. We find that the prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions for the inputs.
We evaluate AMA across open-source model families (eg. EleutherAI, BLOOM, OPT, and T0) and model sizes (125Mâ175b parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT-3-175b on 15â20 popular benchmarks. Averaged across these tasks, the GPT-Neo-6B model outperforms few-shot GPT-3-175b.
âŠThere has been recent interest in how LLMs improve primarily along the 3 axes of parameter scale, training data, and compute [Kaplanet al2020, Hoffmannet al2022, Weiet al2022c]. In Figure 4, as we increase the number of prompts to be aggregated, the conditional entropy reduces. Prompt aggregation may be another useful axis for understanding LLM scaling performance.
Figure 4: The top plots are for EleutherAI models of sizes â {125M, 1.3B, 6B, 20B} and the bottom plots are for BLOOM models of sizes â {560M, 1.7B, 7.1B, 175B}. The left plots show the conditional entropy metric H(y|Ëy) as a function of model size. Lines represent different prompts p with k = {0, 2, 4, 8} in-context examples and AMA prompt-chains without aggregation. The right plots show the conditional entropy as we aggregate predictions over an increasing number of AMA prompt-chains, with both the majority vote (MV) and weak supervision (WS) aggregation strategies for the GPT-J-6B and BLOOM 7.1B models. All plots are over RTE and each k-shot point is the average of 4 seeds.