Skip to main content
Uncategorized

ASK ME ANYTHING: A SIMPLE STRATEGY FOR PROMPTING LANGUAGE MODELS

Simran Arora1,∗ , Avanika Narayan1,∗, Mayee F. Chen1, Laurel Orr1, Neel Guha1, Kush Bhatia1, Ines Chami2, Frederic Sala3, and Christopher Ré1
1Stanford University
2Numbers Station
3University of Wisconsin-Madison

ABSTRACT

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly perfect prompt for a task. To mitigate the high degree of effort involved in prompting, we instead ask whether collecting multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed prompting method, ASK ME ANYTHING PROMPTING (AMA). We first develop an understanding of the effective prompt formats, finding question-answering (QA) prompts, which encourage open-ended generation (“Who went to the park?”) tend to outperform those that restrict the model outputs (“John went to the park. Output True or False”). Our approach recursively uses the LLM to transform task inputs to the effective QA format. We apply these prompts to collect several noisy votes for the input’s true label. We find that these prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions. We evaluate AMA across open-source model families (EleutherAI, BLOOM, OPT, and T0) and sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B. We release our code for reproducing the results here: https://github.com/HazyResearch/ama_prompting.

1 Introduction

Large language models (LLMs) are bringing us closer to the goal of task-agnostic machine learning [Brown et al., 2020, Bommasani et al., 2021]. Rather than training models for new tasks, LLMs are being applied to new tasks out-of-the box. In this paradigm, termed in-context learning, LLMs are instead controlled via natural language task specifications, or prompts. A prompt is defined by a template, which contains placeholders for the description and demonstrations of the inputs and outputs for the task.
Recent work has evaluated LLM prompting performance on a broad set of tasks and finds the process to be brittle — small changes to the prompt result in large performance variations [Zhao et al., 2021, Holtzman et al., 2021]. The performance further varies depending on the chosen LLM family [Ouyang et al., 2022, Sanh et al., 2022, inter alia.] and model size [Wei et al., 2022a, Lampinen et al., 2022]. To improve reliability, significant effort is dedicated towards designing a painstakingly perfect prompt. For instance, Mishra et al. [2021] and Wu et al. [2022] recommend that users manually explore large search-spaces of strategies to tune their prompts on a task-by-task basis.

Figure 1: AMA first recursively uses the LLM to reformat tasks and prompts to effective formats, and second aggregates the pre- dictions across prompts using weak-supervision. The reformatting is performed using prompt-chains, which consist of functional (fixed, reusable) prompts that operate over the varied task inputs. Here, given the input example, the prompt-chain includes a question()-prompt through which the LLM converts the input claim to a question, and an answer() prompt, through which the LLM answers the question it generated. Different prompt-chains (i.e., differing in the in-context question and answer demonstrations) lead to different predictions for the input’s true label.

This work instead considers aggregating the predictions of multiple effective, yet imperfect, prompts to improve prompting performance over a broad set of models and tasks. Given a task input, each prompt produces a vote for the input’s true label, and these votes are aggregated to produce a final prediction. In pursuit of high quality prompting via aggregation, we face the following challenges:

  1. Effective prompts: High quality prompts are a precursor to improvements from aggregation. We take the original prompts which yield near-random performance in Brown et al. [2020] for two SuperGLUE tasks (CB, RTE). Generating multiple prompts in the same format and taking majority vote prediction across prompts has a minor effect (+4% for CB) and can even hurt performance versus the average prompt performance (-2% for RTE). Many proposals for improved prompts focus on a single task type and evaluate on a single model-family and/or size [Wei et al., 2022a, Jung et al., 2022]. We need a structure for prompting that works across tasks and models.
  2. Scalable collection: After identifying effective prompt formats, we need to obtain multiple prompts in these for- mats — these prompts will be used to collect votes for an input’s true label. The original format of a task varies widely and prior works manually rewrite input examples to new formats in a task-specific manner [Mishra et al., 2021, Wu et al., 2022], which is challenging to scale. We need a scalable strategy for reformatting task inputs.
  3. Prompt aggregation: Using the prompts above (for CB and RTE), we see 9.5% average variation in accuracy and that the Jaccard index over errors is 69% higher than if prompt errors were i.i.d. Majority vote (MV) is the primary unsupervised aggregation strategy in prior prompting work [Jiang et al., 2020, Schick and Schütze, 2021], but it does not account for either property, making it unreliable. We need a strategy that accounts for the varying accuracies and dependencies.

In this work, we propose ASK ME ANYTHING PROMPTING (AMA), a simple approach that surprisingly enables open-source LLMs with 30x fewer parameters to exceed the few-shot performance of GPT3-175B. In AMA:

  1. We identify properties of prompts that improve effectiveness across tasks, model types, and model sizes. We study standard prompt-formats categorized by prior work [Brown et al., 2020] and find prompts that encourage open-ended answers (“Where did John go?”) to be more effective than prompts that restrict the model output to particular tokens (e.g. “John went to the park. Output True or False?”). For instance, converting three SuperGLUE tasks (CB, RTE, WSC) from the original restrictive formats in [Brown et al., 2020] to open-ended formats provides a 72% performance improvement (Section 3.2). Given a task input, we find that a simple structure of (1) forming questions based on the input and (2) prompting the LLM to answer the questions applies quite generally and improves performance across diverse benchmark tasks.
  2. We propose a strategy for scalably reformatting task inputs to the effective formats found in (1). We propose to transform task inputs to the effective open-ended question-answering format by recursively using the LLM itself in a fixed two step pipeline. We first use question()-prompts, which contain task-agnostic examples of how to trans- form statements to various (e.g., yes-no, cloze) questions and second use answer()-prompts that demonstrate ways of answering questions (e.g., concise or lengthy answers). Applying prompt-chains— answer(question(x))—gives a final prediction for the input x.2 Chains are (1) reused across inputs and (2) different pairs of functional prompts can be combined to create variety. We apply the varying functional prompt-chains to an input to collect multiple votes for the input’s true label.
  3. We propose the use of weak supervision (WS) to reliably aggregate predictions. We find that the errors produced by the predictions of different chains can be highly varying and correlated. While majority vote (MV) may do well on certain sets of prompts, it performs poorly in the above cases. AMA accounts for these cases by identifying dependencies among prompts and using WS, a procedure for modeling and combining noisy predictions without any labeled data [Ratner et al., 2017, Varma et al., 2019]. We apply WS to prompting broadly for the first time in this work, showing it improves the reliability of prompting with off-the-shelf LLMs and no further training. We find that AMA can achieve up to 8.7 points of lift over MV and that on 9 tasks, it recovers dependencies among prompts to boost performance by up to 9.6 points.

We apply our proposed prompt-aggregation strategy, AMA, to 20 popular language benchmarks and 14 open-source LLMs from 4 model families (EleutherAI [Black et al., 2021, Wang and Komatsuzaki, 2021, EleutherAI], BLOOM blo [2022], OPT [Zhang et al., 2022], and T0 [Sanh et al., 2022]) spanning 3 orders-of-magnitude (125M-175B parameters). Our proof-of-concept provides an improvement over the few-shot (k = 3) baseline by an average of 10.2% ± 6.1% absolute (21.4% ± 11.2% relative) lift across models. We find the largest gains are on tasks where the knowledge required to complete the task is found in the provided context and comparatively less on closed-book tasks (e.g., factual recall). Most excitingly, ASK ME ANYTHING PROMPTING enables an open-source LLM, which is furthermore 30x parameters smaller, to match or exceed the challenging GPT3-175B few-shot baseline results in Brown et al. [2020] on 15 of 20 benchmarks. We hope AMA and future work help address painpoints of using LLMs [Arora and Ré, 2022, Narayan et al., 2022] by improving the ability to proceed with less-than-perfect prompts and enabling the use of small, private, and open-source LLMs.

2 Related Work

Several existing works study how to improve the zero-to-few-shot task-transfer abilities of LLMs.

Training based strategies Prior works have improved prompting performance by training larger models over more or curated data, and for longer [Kaplan et al., 2020, Chowdhery et al., 2022] — or by explicitly fine-tuning LMs over prompts [Wang et al., 2022a, Wei et al., 2022b, Sanh et al., 2022, Ouyang et al., 2022]. We complementarily aim to improve the prompting performance of off-the-shelf language models with no additional fine-tuning.

Prompt-engineering Prompt-engineering is the process of designing natural language specifications of a task, which are used to condition the LLM at inference time. Prior work finds that the prompt format changes the model behavior and proposes particular formats. Some formats are designed-for or evaluated-on a narrow task type, model type, or model size [Wei et al., 2022a, Jung et al., 2022]. Others require users to manually rewrite task-inputs to the prescribed formats on a example-by-example basis in a task-specific manner [Mishra et al., 2021, Patel et al., 2022, Wu et al., 2022]. Our recursive use of the LLM is similar to Jung et al. [2022], which focuses on commonsense reasoning. We draw inspiration from and share similar ideas with these lines of work.

Complementary work investigates how to simplify complex tasks (e.g., multi-hop reasoning), to achieve better performance in the prompting paradigm. Creswell et al. [2022], Wu et al. [2022] explicitly decompose the complex tasks into steps, which are each handled in a separate inference-pass. However, these methods draw a distinction between explicitly compositional tasks which can be naturally decomposed into multiple steps and “single-step” language tasks. These prior works do not support the single-step tasks, which are the focus of our work.

Prompt sensitivity Prior works note the sensitivity of prompting under slight modifications and propose strategies to improve the performance of single prompts [Zhao et al., 2021, Liu et al., 2021]. Complementing this, we manage the noise by aggregating over multiple prompt outputs. Prompt aggregation has been applied in several prior works. Many works train models to perform the aggregation and/or to achieve strong results with small LMs [Jiang et al., 2020, Schick and Schütze, 2021, Cobbe et al., 2021, Zelikman et al., 2022, inter alia.]. Self-Consistency Wang et al. [2022b], which requires no training, does not report improvements for small LMs (<10B parameters). We also compare AMA to Self-Consistency in Appendix B. The unsupervised aggregation strategy used in prior works is Majority Vote— we are the first to use Weak Supervision for unsupervised prompt aggregation.

Weak supervision (WS) WS is a powerful framework that learns the accuracies and correlations of multiple noisy sources and aggregates them to produce weak labels for training data [Ratner et al., 2017, 2016, 2018, Varma et al.,

2019, Fu et al., 2020]. WS has been applied to prompting in the context of distilling a LLM by aggregating the outputs of hand-curated prompts into a labeled dataset and training a smaller model on it [Smith et al., 2022]. In contrast, we aim to use aggregation to improve out-of-the-box LLM performance reliably.

3 ASK ME ANYTHING PROMPTING

We propose ASK ME ANYTHING PROMPTING (AMA), a prompting approach that uses multiple imperfect prompts— rather than one painstakingly crafted perfect prompt—and reliably aggregates their outputs. We describe and motivate AMA’s prompt format (Section 3.2), how AMA scalably produces collections of prompts (Section 3.3), and AMA’s aggregation method (Section 3.4).

3.1 Preliminaries

We consider supervised tasks, (X , Y), where x ∈ X is the example and y ∈ Y is the output. We have an unlabeled dataset D = {xi}n for which we wish to predict each yi. We apply LLMs to this task by using a prompt—a natural language prefix that demonstrates how to complete a task. A prompt consists of a prompt template, with placeholders for (1) zero or more in-context task demonstrations and (2) for the inference example x as shown in Figure 3. Given a prompt p, we use p : X → Y to refer the output of the prompted LLM which produces a prediction yˆ = p(x). Specifically, the LLM runs inference on p with x substituted for the placeholder in the template.

We denote a collection of m prompts as P = [p1, p2, …, pm]. Given input D, we (1) apply a collection P to each x ∈ D and (2) aggregate their predictions, denoted as P(x) = [p1(x), . . . , pm(x)], using an aggregator function φ : Ym → Y to produce outputs yˆ on each x. We can thus express the procedure via two key components we aim to understand, the prompts P and aggregator φ.

Running examples For the motivating observations in the rest of this section, we use three SuperGLUE [Wang et al., 2019] tasks—CommitmentBank (CB), Recognizing Textual Entailement (RTE), and Winograd Schema Challenge (WSC)—and the DBPedia and AGNews classification tasks [Zhang et al., 2015]. We evaluate over the GPT-J-6B model Wang and Komatsuzaki [2021]. CB and RTE require determining the vailidity of a statement is given some context (as in Figure 1), WSC requires outputting the subject corresponding to a given pronoun, and DBPedia and AGNews contain 14 and 4 classes respectively. We use as a running example: determine if the statement “John went to the park” is valid, given the context “John invited Mark to watch Jurassic Park with his family at the theater”.

Simple baseline To provide some intuition on the challenges around effectively designing the two levers, P and aggregator φ, we start with a naïve baseline with off-the-shelf prompts and the unsupervised majority vote prompt aggregation strategy used in prior work [Jiang et al., 2020, Schick and Schütze, 2021]. We take the prompts proposed in [Brown et al., 2020] for GPT-3 and produce P with five prompts for each task by using different sets of in-context examples. Comparing majority vote (MV), the unsupervised aggregation strategy used in prior work, to the average performance of the prompts, MV gives 39.3% (+2.2%) for CB and 54.5% (-2%) for RTE. The delta from aggregating is minor and in the worst case, harmful. Ideally, we would expect that aggregation should lead to improvement by reducing noise, but we find that performance here is only comparable to the single prompt baseline.

3.2 Effective Prompt Formats

First, we explore what makes an effective prompt format towards improving the quality of P(x).

Standard prompt formats We ground our analysis in three standard categories of prompts used in prior work including Brown et al. [2020], Sanh et al. [2022, inter alia.]: (1) questions that restrict the model output particular tokens (“John invited Mark to come watch Jurassic Park. Output True or False?”); (2) cloze-questions which ask the model to fill in the remaining text (“John invited Mark to come watch Jurassic _” and using the LLM to fill-the-blank, “Park”); and (3) traditional (yes-no, Wh) free-form questions (“Where did John invite Mark?”). We compare these three prompting formats and make the following observations:

  1. Open-ended prompts appear to outperform restrictive-prompts. We first group the results in Brown et al. [2020] based on the format used for the task, along the above categorizations (see Figure 2). When scaling from GPT3-6.7B to GPT3-175B, we find that the relative gain is far lower on open-ended (cloze and traditional QA) formats vs. restricted formats.
    Next, CB, RTE, and WSC are originally formatted with restrictive-prompts in Brown et al. [2020], and we form copies of the tasks in the open-ended question (cloze and free-form QA) formats. This improves the performance of the small model on average from 41.7% to 71.5% (+72%) . Intuitively, the task of answering open-ended questions
Figure 2: Relative lift with model scale using results and prompt-styles reported in Brown et al. 2020. Ablating the prompt-style using the GPT-J-6B model. We include calibration results Zhao et al. [2021] and the “-” indicates the method cannot be applied to the task (Right).

is aligned with the next-token prediction language modeling objective. We observe that more precise questions give larger lifts. For WSC the restrictive prompt form is: “The pronoun ‘his’ refers to “Mark” in the context. True or False?”, given the context “Mark went to the park with his dog.”. Reformatting to “What does ‘his’ refer to?” and evaluating whether the answer is “Mark” provides 38% lift (69.2% accuracy). Yet, further extracting the portion of the context that mentions the pronoun (“his dog”), reformatting (“Whose dog?”) and prompting with precise questions gives 49.4% lift (74.7%).

  1. The use of open-ended questions over restrictive-prompts can increase the difficulty of mapping open-ended answers to valid output classes. For tasks with output spaces that are likely observed during pretraining (yes-no questions, sentiment classification), we see that the LLM naturally generates valid yˆ ∈ Y. For tasks with specialized output classes (i.e. multi-class classification), we need to map the answer to the open-ended question (e.g., “What is the document about?”) to a valid output class. For example, given ‘Personality and Mental Health … is a quarterly
    peer-reviewed academic journal published by …”, we observe that the LLM typically outputs semantically correct summaries of the document topic, e.g. “journal”. We find that inserting a step for the LLM to map the open-ended output “journal” to a valid category via the prompt “A ‘journal’ maps to category: written work” enables a 33.3% and 11.1% lift over the few-shot baseline on DBPedia (14-way classification) and AGNews (4-way) respectively.

Why is the QA prompt format effective? We analyze the LM pretraining corpus to better understand why the proposed QA prompt template may be effective. The EleutherAI models are trained on The Pile corpus Black et al. [2021], Wang and Komatsuzaki [2021], Gao et al. [2021]. Over a 2% random sample of the ∼200B token Pile data, we find that open-ended QA structures (i.e., which ask the model “Is . . . ?”, “Who . . . ?”) appear on the order of 1000× more frequently than the restrictive-prompt structures (i.e., which instruct the model to output “True or False”, “Yes or No”). The prompt structures and frequencies are in Table 8.

When applying the few-shot restrictive prompts, we observe large imbalances in the F1-scores for different classes (Table 10). Therefore, we next ask if answering the restrictive prompts is challenging due to biases acquired during pretraining. We find in Pile that there are large imbalances between the frequencies of “yes” vs. “no”, and “True” vs. “False” for instance, which may instil the biases and contribute to the low quality from restrictive-prompts. Detailed results of the Pile analysis are in Appendix F.

AMA’s prompt format Motivated by our observations about the effectiveness of QA prompt structures, we proceed in AMA with a two-step prompting pipeline: (1) generating questions based on the input and (2) prompting the LLM to answer the generated questions. These prompts are effective, and to further improve performance we next turn to generating and aggregating over multiple prompt-outputs for each input. For intuition, different questions (with our running example: “Who went to the park?”, “Did John go the park?”, “Where did John go?”) emphasize different aspects of the input and can provide complementary information towards reasoning about the answer. Manually generating multiple prompts per input is challenging, and so we study how to do this at scale in the following section.

3.3 Creating Prompt Collections at Scale

Our goal is to produce a collection of prompts, P, that can be applied to tasks at scale. To produce prompts in the effective open-ended question-answering format, our insight is to recursively apply the LLM itself using a chain of functional prompts, referred to as a prompt()-chain. We describe these prompts as functional because they apply a task-agnostic operation to all inputs in the tasks, without any example-level customization. We describe the two functional prompts used in AMA below. We use Figure 1 as a running example to explain each type.
(a) question(): x → q generates a question q (such as “Did John go to the park?”) from an input x (“John went to the park.”). question() prompts simply contain demonstrations of how a statement can be transformed to an open-ended question.

Figure 3: Example prompt with the in-context demonstrations and placeholder (Left) with two different prompt variations (Right) created by changing demonstrations and question style.

(b) answer(): q → a applies the question generated by (a) to the context of x to produce intermediate answers a (such as “No” or “theater”). The answer() prompts contain demonstrations of how to answer a question (optionally) given some input context.

To create P for aggregation, AMA constructs different prompt()-chains where each unique prompt()-chain is a different view of the task and can emphasize different aspects of x. Inspired by Sanh et al. [2022] and Liu et al. [2021], we also vary chains through two key levers—the in-context demonstrations and the style of prompt questions—as shown in Figure 3. To vary the style of open-ended prompt questions, we construct question() and answer() prompts that produce and answer either Yes/No, Wh, multiple-choice, or cloze- questions.

3.4 Prompt Aggregation

To aggregate the prompt predictions P(x) into outputs yˆ reliably, we apply tools from weak supervision, a powerful approach for learning high-quality models from weaker sources of signal without labeled data [Ratner et al., 2017]. We first describe properties of P(x) that illustrate when the simple baseline of majority vote may perform poorly. We then describe our aggregator φWS, which explicitly identifies and then accounts for these properties.

Baseline observations We understand how to aggregate P(x) by presenting a set of observations on CB, RTE, and WSC. For each, we compare two baselines for constructing P: (1) PT: varying the prompt template with no overlap in the in-context examples, and (2) PE: varying the in-context examples for a fixed prompt template, all with | P | = 5. We observe the following properties on P:

  1. Varied overall accuracies: While prompts in PE may seem more similar than those in PT, the gap between the best and worst pi ∈ P is large in both cases — 12.1% for PE and 9.6% for PT.
  2. Varied class-conditional accuracies [Zhao et al., 2021]: Beyond overall prompt accuracy, the average variance of class-conditional prompt accuracies is 9.7% across the tasks and baselines.
  3. Highly-correlated outputs: Prompt predictions have dependencies among each other. The Jaccard index over error sets averaged across tasks is 42.2 for PE and 39.9 for PT. For reference, two prompts that produce i.i.d. errors and have 60% accuracy each would have a score of 0.25.

The three observations present challenges in aggregating predictions via simple approaches like MV. MV tends to do better than using one prompt, but it weights all prompts equally and treats them independently. Such an aggregation method may be sufficient over certain collections of prompts but is not reliable across general P that may exhibit the three properties we have observed.

AMA Aggregation Given the varied accuracies and dependencies among prompt()-chains, in AMA we draw on recent work in weak supervision [Ratner et al., 2017], which is able to account for the accuracy and dependency properties without relying on labeled data. We learn a probabilistic graphical model on PrG,θ(y, P(x)) and define the aggregator as φWS(x) = arg maxy∈Y PrG,θ(y| P(x)). G = (V, E) is a dependency graph where V = {y, P(x)} and E is an edgeset where (pi(x), pj(x)) ∈ E iff pi(x) and pj(x) are conditionally independent given y. θ are the accuracy parameters for P(x). Since we lack labeled data y, we cannot estimate G or θ directly from D, so our procedure is as follows:

  1. We use a structure learning approach from Varma et al. [2019] to recover the dependency structure Gˆ using P(x) applied to D.
  2. We use Gˆ, D, and P(x) to learn the accuracies θ of the prompts P from Ratner et al. [2018].
  3. We compute PrGˆ,θˆ(y| P(x)) and aggregate our predictions.
Figure 4: The top plots are for EleutherAI models of sizes ∈ {125M, 1.3B, 6B, 20B} and the bottom plots are for BLOOM models of sizes ∈ {560M, 1.7B, 7.1B, 175B}. The left plots show the conditional entropy metric H(y|yˆ) as a function of model size. Lines represent different prompts p with k = {0, 2, 4, 8} in-context examples and AMA prompt-chains without aggregation. The right plots show the conditional entropy as we aggregate predictions over an increasing number of AMA prompt-chains, with both the majority vote (MV) and weak supervision (WS) aggregation strategies for the GPT-J-6B and BLOOM 7.1B models. All plots are over RTE and each k-shot point is the average of 4 seeds.

The key insight is that the inverse covariance matrix on V , Σ−1, is graph-structured, meaning that Σ−1 = 0 iff pi(x) and pj(x) are conditionally independent. This property yields systems of equations on V from which we can recover dependencies and accuracies, without any training. WS hence improves the reliability of aggregation.

4 Information Flow in AMA

Before evaluating end-to-end quality, we look at a simple information theoretic metric to understand the contributions of the individual components — P and φ — in the prompting procedure.
Information flow metric Specifically, we examine the conditional entropy, H(y|yˆ), which measures the amount of uncertainty remaining in the true label y given a prediction yˆ. Intuitively, H(y|yˆ) will be low when yˆ encodes information relevant to y. In our setting, yˆ = φ(P(x)) is dependent on the two components of the prompting procedure, the prompts P and aggregator φ. The following simple decomposition of H(y|yˆ) enables studying the contribution of each component:

Through the first term H(y| P(x)), H(y|yˆ) depends on the quality and quantity of the individual prompts in P(x) (since H(y| P(x)) ≤ H(y|p(x))). A set of prompts that contains relevant information for y contributes to a low H(y|yˆ). The second term H(y|yˆ) − H(y| P(x)) shows that H(y|yˆ) depends on how the aggregation step compresses the information in P(x) to predict yˆ. An aggregator φ that more accurately matches the true Pr(y| P(x)) reduces the information loss in the compression step.

Evaluation We use (1) to evaluate our proposed solution AMA both empirically and theoretically. First considering H(y| P(x)), in Figure 4 (Left) we observe AMA outperforms k-shot baselines with expected scaling in terms of both individual prompt()-chain quality (as shown by AMA No Agg) and their quantity.

Next we consider the gap term H(y|yˆ) − H(y| P(x)). It enables us to understand why MV is insufficient: it com- presses information from P(x) according to a specific construction of Pr(y, P(x)), for which pi(x) ⊥ pj(x)|y for all i, j ∈ [m], and Pr(pi(x) = c|y = c) for c ∈ Y is a single better-than-random constant across i and c. When the true distribution is vastly different—as is common—this misspecification results in a large gap between the optimal H(y| P(x)) and H(y|yˆMV) in Figure 4 (Right). Weak supervision can improve φ over the standard MV baseline to reduce the information loss H(y|yˆAMA) − H(y| P(x)).

In addition to empirical measurements, we can provide a theoretical characterization for the information flow. In Appendix D, we express H(y|yˆAMA) in terms of the individual prompt accuracies under the standard weak supervision model (i.e., Ising model on y and P(x) [Ratner et al., 2018]).
There has been recent interest in how LLMs improve primarily along the three axes of parameter scale, training data, and compute [Kaplan et al., 2020, Hoffmann et al., 2022, Wei et al., 2022c]. In Figure 4, as we increase the number of prompts to be aggregated, the conditional entropy reduces. Prompt aggregation may be another useful axis for understanding LLM scaling performance.

5 Results

We evaluate ASK ME ANYTHING PROMPTING on 20 popular language benchmarks used in Brown et al. [2020], Sanh et al. [2022]. We report results across 14 unique LLMs including 4 model families (EleutherAI [Black et al., 2021, Wang and Komatsuzaki, 2021], OPT [Zhang et al., 2022], BLOOM, and T0 [Sanh et al., 2022]) spanning 3 orders-of-magnitude in size (125M-175B). We aim to validate whether AMA provides consistent lift across diverse tasks (Section 5.1), works across model families (Section 5.2), and reliably aggregates the predictions across prompts (Section 5.3).

Experimental details We use a diverse set of tasks: SuperGLUE [Wang et al., 2019], NLI [Mostafazadeh et al., 2017, Nie et al., 2020], classification [Zhang et al., 2015, Socher et al., 2013, He and McAuley, 2016], and QA tasks [Kasai et al., 2022, Kwiatkowski et al., 2019, Berant et al., 2013, Dua et al., 2019]. For all tasks, we compare to published results of the OpenAI few-shot-prompted GPT3-175B parameter model using the numbers reported in Brown et al. [2020] and, for classification tasks, Zhao et al. [2021]. Brown et al. [2020] uses k ∈ [32..70] depending on the task and Zhao et al. [2021] uses k ∈ [1..8], providing a challenging baseline for comparison.

For AMA we use 3 − 6 prompt()-chains to generate predictions per input. We model the correlations between prompt-predictions per task, without using any labeled training data, to obtain the final prediction per example via weak supervision (WS). We report both the average performance over the prompt()-chains (QA) and with AMA’s WS aggregation (QA + WS). We report QA + WS across 5 random seeds for the model. Model details and prompt()- chains are in the Appendix. 3

5.1 Main Results

We report benchmark results in Table 1 comparing the open-source GPT-J-6B and few-shot (k ∈ [32..70]) GPT3- 175B. We find that the open-source 6B parameter model exceeds the average few-shot performance of the GPT3-175B model on 15 of 20 benchmarks. Over the 20 tasks, AMA gives an average improvement of 41% over the 6B parameter model’s few-shot (k = 3) performance to achieve this.
We find that AMA provides the most lift on tasks where all requisite knowledge is included in the task input (e.g., reading comprehension) and that largely rely on model’s natural language understanding (NLU) abilities. The lift is lower on tasks that rely on the LLMs memorized knowledge (e.g. commonsense, closed-book). AMA can help close the gap on knowledge-intensive tasks. The closed-book WebQ task includes simple questions, where the answers are likely seen during pretraining. We find that using an open-ended prompt that asks the LM to generate relevant context, and then prompting the model to answer the original question using the generated context is effective. However, there are limitations as seen on NQ.

We similarly see limitations when tasks cannot rely on the latent knowledge. We observe a small performance gap between model sizes on RealTimeQA, which includes questions that have temporally changing answers that are less likely to be memorized. Similarly, for tasks requiring domain knowledge, e.g. the “Amazon Instant Video” class in the Amazon task, all model-sizes achieve near-0 performance. In such cases, information retrieval may help close the gap. The flexible LLM interface permits asking and answering questions over diverse knowledge sources such as databases or a search engine [Nakano et al., 2021]. We provide an extended error analysis Table 1 results in Appendix G.

Table 1: AMA results for the GPT-J-6B parameter model [Black et al., 2021] compared to the few-shot GPT3-175B. The GPT-175B numbers are as reported in Brown et al. [2020], Zhao et al. [2021], where the numbers of in-context examples is in parentheses. Note that prompts can abstain from predicting, which can lead to lower average numbers for QA, including on COPA and StoryCloze. For the question-answering tasks and ReCoRD, we report the majority vote aggregation, as using WS is complex with the open- ended output space. The same results for the BLOOM 7.1B parameter model are in Appendix 3.

5.2 Evaluation across Models

Benchmark results We evaluate the lift from AMA over out-of-the-box few-shot performance across different sizes of four open-source LMs (EleutherAI, OPT, BLOOM, and T0) across 7 tasks (4 NLU, 2 NLI, 1 classification). In this analysis, we want to understand the effectiveness of AMA’s prompt()-chains reformattings across models and report the average prompt performance over the 3-6 prompt()-chains used per task. EleutherAI, OPT, and BLOOM are GPT models, while T0 is obtained by explicitly fine-tuning a T5 LM [Raffel et al., 2019] on prompt-input-output tuples.

Excitingly, the AMA prompt()-chains apply quite generally. We see a 10.2% ± 6.1% absolute (21.4% ± 11.2% relative) lift on average across models and tasks (see Figure 5a (a)). We observe the absolute lift increases with model size and levels out, however we note that there are few-models per size grouping. The average absolute (relative) lift by model family (across tasks and sizes) is 11.0% (24.4%) for EleutherAI, 11.0% (23.4%) for BLOOM, and 11.9% (22.7%) for OPT, and 2.9% (8.3%) for T0. We hypothesize the lower lift on T0 arises because the model was fine-tuned on zero-shot prompts, which may compromise its in-context learning abilities.

Diagnostics for understanding AMA lift To further understand why models see different degrees lift, we create a set of diagnostic tasks that correspond to the steps in prompt()-chains. The diagnostics measure four basic operations—question generation, answer generation, answer selection, and extraction. For each operation, we create 1-3 tasks with 50 manually-labeled samples per task. See Appendix E for task details.
We measure the average performance across each operation across different sizes of models in the four families (EleutherAI, OPT, BLOOM, and T0). We group models and sizes into four buckets of T0 (3B parameters) and GPT models (< 1B, 1B, and 6 − 7B parameters). Figure 5b shows results where the buckets are ordered by their average AMA lift across the 7 tasks from Section 5.2, meaning T0 (3B) sees the least lift while 6 − 7B GPT models realize the most lift. We find that overall, models with higher performance across the four operations see more lift with AMA. T0 performs poorly on the generative tasks, indicating the importance of text and question generation for AMA.

5.3 Evaluation against other aggregation methods

We compare our WS aggregation approach with the standard unsupervised approach, majority vote (MV), on prompt()-chains. We find that AMA can achieve up to 8.7 points of lift over MV, and does not do worse than MV on

Figure 5: Evaluation across model sizes for diagnostics and benchmarks. We report the absolute lift from AMA over few-shot (k = 3) performance, averaged over 7 tasks with 95% confidence intervals (Left). Diagnostic plots are ordered by the amount of lift models of the size-category see on 7 the benchmarks (Right).

16 out of 20 tasks. On the remaining 4 tasks, we perform worse than MV by at most 1.0 points. We also examine the effect of modeling dependencies in WS. We find that on 9 tasks, our approach recovers dependencies in the data (rather than assuming conditionally independent P(x)), which improves performance by up to 9.6 points and an average of 2.2 points. We provide more details and evaluation against labeled data baselines in Table 5 (Appendix B.3).

Table 2: Performance of T0 as reported in Sanh et al. [2022] compared to majority vote (MV) and weak supervision (WS) over 10 different prompt formats in Prompt-Source. When using the Prompt-Source prompts, the average lift across tasks is 3.6 points for MV and 6.1 points for WS.

Next, we evaluate T0 on zero-shot prompts from the public PromptSource [Bach et al., 2022], which are better aligned with how this model has been trained. Specifically, we take 10 unique PromptSource prompts for CB, WIC, WSC and RTE respectively, and find that aggregating with MV yields an average lift of 3.7 accuracy points and aggregating with WS gives an average lift of 6.1 accuracy points (see Table 2).

6 Conclusion


In this work, we introduce ASK ME ANYTHING PROMPTING which (1) scalably obtains multiple prompts given a task input and (2) combines the intermediate answers to these prompts using weak supervision to give the final prediction. The steps in AMA stem from our observations on the effectiveness of open-ended questions over restrictive prompts, and the ability to model the varying accuracies and dependencies across a collection of prompts using weak- supervision. Overall, AMA provides lift across four language model families and across model sizes ranging from 125M-175B parameters. Most excitingly, we find that AMA enables a 30x smaller LM to exceed the average performance of few-shot GPT3-175B averaged across 20 popular language benchmarks. Several LM applications involve private data or require operating over large amounts of data — for these applications, using APIs to access closed- source models or hosting large models locally is challenging. We hope the strategies in AMA and subsequent work help enable such applications.

7 Reproducibility Statement

We release prompts and code for reproducing all benchmark results for few-shot and AMA prompting, and our diagnostic evaluation splits here: https://github.com/HazyResearch/ama_prompting.

8 Ethics Statement

We intend for AMA to aid practitioners in their exploration and use of LLMs—especially smaller, open-source LLMs. However, we recognize that AMA could be used to perform harmful or unethical tasks. AMA is a proof-of-concept; it has error-modes and we recognize the inherent risks to using LLMs. Detailed discussions of these risks are in Bommasani et al. [2021], Weidinger et al. [2021].

Acknowledgements

The computation required in this work was provided by Together Computer (https://together.xyz/). We are grateful to the Numbers Station (https://numbersstation.ai/), Snorkel (https://snorkel.ai/), Stan- ford Center for Research on Foundation Models (https://crfm.stanford.edu/), and Stanford HAI (https:
//hai.stanford.edu/) organizations for the resources that supported this work. We thank Karan Goel, Maya Varma, Joel Johnson, Sabri Eyuboglu, Kawin Ethayarajh, Niladri Chatterji, Neha Gupta, Alex Ratner, and Rishi Bommasani for their helpful feedback and discussions. We gratefully acknowledge the support of DARPA under Nos. FA86501827865 (SDH) and FA86501827882 (ASED); NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ONR under No. N000141712266 (Unifying Weak Supervision); the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, IBM, Mi- crosoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family Insurance, Google Cloud, Swiss Re, Brown Institute for Media Innovation, Department of Defense (DoD) through the National Defense Science and Engineering Graduate Fellowship (NDSEG) Program, Fannie and John Hertz Foundation, National Science Foundation Graduate Research Fellowship Program, Texas In- struments, and members of the Stanford DAWN project: Teradata, Facebook, Google, Ant Financial, NEC, VMWare, and Infosys. SA is supported by a Stanford Graduate Fellowship. LO is supported by an Intelligence Community Post- doctoral Fellowship. The U.S. Government is authorized to reproduce and distribute reprints for Governmental pur- poses notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.

References

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

Tony Z. Zhao, 1 Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In arXiv:2102.09690v2, 2021. URL https://arxiv.org/pdf/2102.09690. pdf.

Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. arXiv:2104.08315, 2021.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022a.
Andrew K. Lampinen, Ishita Dasgupta, Stephanie C. Y. Chan, Kory Mathewson, Michael Henry Tessler, Antonia Crwswell, James L. McClelland, Jane X. Wang, and Felix Hill. Can language models learn from explanations in context?, 2022.

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing instructional prompts to gptk’s language. arXiv preprint arXiv:2109.07830, 2021.

Tongshuang Wu, Michael Terry, and Carrie Jun Cai. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In CHI Conference on Human Factors in Computing Systems, pages 1–22, 2022.

Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations, 2022. URL https://arxiv. org/abs/2205.11822.

Zhengbao Jiang, Frank F. Xu, Jun Araki Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics (TACL), 2020. Timo Schick and Hinrich Schütze. It’s not just size that matters: Small language models are also few-shot learners.
arXiv:2009.07118v2, 2021.

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré . Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269–282, nov 2017. doi:10.14778/3157794.3157797. URL https://doi.org/10.14778%2F3157794.3157797.

Paroma Varma, Frederic Sala, Ann He, Alexander Ratner, and Christopher Re. Learning dependency structures for weak supervision models. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6418–6427. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/varma19a.html.

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.

Ben Wang and Aran Komatsuzaki. Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021. EleutherAI. URL https://www.eleuther.ai/.

Bigscience large open-science open-access multilingual language model, 2022. URL https://huggingface.co/ bigscience/bloom.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.

Simran Arora and Christopher Ré. Can foundation models help us achieve perfect secrecy?, 2022. URL https://arxiv.org/abs/2205.13722.

Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911, 2022.

Jared Kaplan, Sam McClandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv, 2022a.

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Le Quoc V. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2022b.

Pruthvi Patel, Swaroop Mishra, Mihir Parmar, and Chitta Baral. Is a question decomposition unit all we need?, 2022. URL https://arxiv.org/abs/2205.12538.

Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022.

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in- context examples for gpt-3?, 2021.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv:2110.14168v2, 2021. URL https://arxiv.org/pdf/2110.14168.pdf.

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. arXiv:2203.14465v2, 2022. URL https://arxiv.org/pdf/2203.14465.pdf.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le Le, Ed H. Cho, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. 2022b.

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings. neurips.cc/paper/2016/file/6709e8d64a5f47269ed5cea9f625f7ab-Paper.pdf.

Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. Training complex models with multi-task weak supervision, 2018. URL https://arxiv.org/abs/1810.02840.

Daniel Fu, Mayee Chen, Frederic Sala, Sarah Hooper, Kayvon Fatahalian, and Christopher Re. Fast and three-rious: Speeding up weak supervision with triplet methods. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3280–3291. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/fu20a.html.

Ryan Smith, Jason A. Fries, Braden Hancock, and Stephen H. Bach. Language models in the loop: Incorporating prompting into weak supervision. arXiv:2205.02318v1, 2022.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537, 2019.

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In NIPS, 2015.

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2021. URL https://arxiv.org/abs/2101.00027.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022. URL https: //arxiv.org/abs/2203.15556.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022c. URL https://arxiv.org/abs/2206. 07682.

Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse- level Semantics, pages 46–51, 2017.

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/ D13-1170.

Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pages 507–517, 2016.

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, and Kentaro Inui. Realtime qa: What’s the answer right now? arXiv preprint arXiv:2207.13332, 2022.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming- Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533– 1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www. aclweb.org/anthology/D13-1160.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL, 2019.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332, 2021. URL https://arxiv.org/abs/2112.09332.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.

Stephen H Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, et al. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279, 2022.

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.

HuggingFace, Nov 2021. URL https://huggingface.co/models. OpenAI, Nov 2021. URL https://openai.com/api/.

Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J. ACM, 58(3), jun 2011. ISSN 0004-5411. doi:10.1145/1970392.1970395. URL https://doi.org/10.1145/1970392.1970395.

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https:
//aclanthology.org/W04-1013.

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019.

Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124, 2019.

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011.

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, 2018.

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.

Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context- sensitive meaning representations. arXiv preprint arXiv:1808.09121, 2018.

Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.

A Experiment Details

We use A100 NVidia GPUs to run all experiments.

A.1 Models

We evaluate over 4 model families: T0, BLOOM, EleutherAI, OPT, and GPT3. In our evaluations, we use the following model family variants: EleutherAI (GPT-Neo-125M, GPT-Neo-1.3B, GPT-J-6B, GPT-NeoX-20B), BLOOM (BLOOM-560M, BLOOM-1.7B, BLOOM-7.1B, BLOOM-176B), OPT(OPT-125M, OPT-1.3B, OPT-6.7B, OPT-13B, OPT-175B), T0 (T0-3B), and GPT-3 (davinci). We download T0, BLOOM, OPT, and EleutherAI models from the HuggingFace Model Hub [HuggingFace, 2021]. All inference calls to the OpenAI Davinci model were made using the OpenAI API davinci endpoint [OpenAI, 2021], the original GPT-3 175B parameter model used in [Brown et al., 2020]. We access these models by passing our input prompts to the endpoint for a per-sample fee.

A.2 Metrics

For RealTimeQA, the reported GPT-3 performance in Kasai et al. [2022] is reported over the text-davinci-002 API endpoint. Given that all our GPT-3 evaluations are over davinci, we re-evaluate the GPT-3 performance on RealTimeQA using the davinci endpoint and the few-shot prompt from RealTimeQA4.
We follow the metrics used in Brown et al. [2020]. All tasks are scored using matching accuracy except for DROP/Re- alTimeQA that use text f1, WebQ/NQ that use span overlap accuracy, and MultiRC that uses f1a accuracy.

A.3 Weak Supervision

For each task, we use an unlabeled dataset constructed from the test set as well as 1000 samples from the training set (ignoring the labels). We run the structure learning part of the weak supervision algorithm (for Gˆ) with the default parameters from Varma et al. [2019]. If the recovered sparse matrix has all entries greater than 1, we pass in an empty edgeset to the next step of learning θˆ (e.g., data is too noisy to learn structure from); otherwise, we pass in the edge with the highest value in the sparse matrix.

B Additional Results

Table 3: AMA results for the BLOOM-7.1B parameter model compared to the few-shot GPT3-175B. The GPT-175B numbers are as reported in Brown et al. [2020], where the numbers of shots is in parentheses, and the classification task baselines are from from Zhao et al. [2021].

B.1 BLOOM Model Results


In Table 3, we provide results using the BLOOM-7.1B parameter model over all 20 benchmarks. We observe consistent lift over few-shot performance using AMA, though the performance remains below that of the comparably sized GPT-

Table 4: Results from applying prompt aggregation via Majority Vote and Weak Supervision to 3 random few-shot (k = 3) prompts. Here we apply no prompt reformatting to the proposed AMA QA template.

J-6B parameter model reported in Table 5.1. We note that the few-shot results for BLOOM 7.1B are also often lower than the few-shot results for GPT-J-6B.

B.2 AMA Ablations

Here we extend the observations in Section 3 on additional tasks.
We study the degree to which both prompt re-formatting and aggregation are required to achieve high quality. Specifically, we produce 3 few-shot prompts (each with a different set of k = 3 in-context prompt examples), prompt using each, and aggregate the results using majority vote and weak supervision. We reiterate that the proposed AMA QA reformatting is not applied. We find that aggregation alone leaves large performance gaps. Aggregation alone is useful compared to the average performance, however aggregation and re-formatting are both critical and complementary in yielding an effective prompting solution.

B.3 Weak Supervision Ablations

Comparison to other aggregation baselines Table 5 compares AMA’s aggregation method against several other baselines for aggregating prompt()-chains, including majority vote. We compare against weighted majority vote (WMV), where we use labeled data to weight according to each prompt’s accuracy by constructing φWMV(P(x)) = Emi=1exp(−ηεi)li {pi(x) = y}. εi is the error of prompt pi on a training set of 1000 examples, and η is a temperature hyperparameter, for which we perform a sweep over [0.25, 0.5, 1, 2, 4, 8, 16, 32] using a 20% validation split. We also compare against a simple strategy of using the prompt that performs the best on the labeled set of data (Pick Best). Finally, AMA (no deps) is our method when we pass in an empty edgeset to the algorithm in Ratner et al. [2018].

Varying amount of additional data We study the effect of varying the amount of additional unlabeled training data that is used in learning the probabilistic graphical model on y, P(x). On three tasks (RTE, WSC, and AGNews) averaged over 5 runs, we run AMA with 100%, 50%, 20%, 10%, and 0% of the additional dataset while still evaluating on the fixed test dataset. Figure 6 shows AMA’s accuracy versus the amount of additional unlabeled data used. We find that even without any of the additional data, average accuracy does not decrease on WSC or AGNews, and only decreases by 0.4 points on RTE, still outperforming GPT3-175B few-shot. This suggests that the additional data is not necessary for AMA’s performance.

Latency of Weak Supervsion Over RTE, WSC, and AGNews, we find that WS (both learning the graphical model and aggregating outputs) takes an average of 13.0 seconds when dependencies are not modeled. When dependencies are modeled in RTE (as dependencies are ignored in WSC and AGNews because they both exhibit dense recovered structured matrices), the algorithm takes an average of 84.3 seconds to run. As a point of comparison, we include Table 6 which shows the time in seconds for running inference with the GPT-J-6B model on the same tasks. The latency introduced by running weak supervision is comparatively low.

Table 5: AMA Aggregation method ablation for the GPT-J-6B parameter model, as well as the number of prompt()-chains used for each task. For ReCoRD, and QA tasks (DROP, WebQs, RealTimeQA, NQ), we use 3 prompts each and use majority vote as our aggregation strategy reported in the (QA + WS) columns of Table 1 and Table 3.
Figure 6: Performance on RTE, WSC, and AGNews averaged over 5 runs when using varying amounts of additional unlabeled training data for estimating Pr(y, P(x)) in WS.
Table 6: Total inference cost in applying the AMA prompt chains to achieve the results in Table 5.1, using the GPT-J-6B model.

B.4 Additional AMA Baselines

Here we compare AMA to Self-Consistency Wang et al. [2022b], which is particularly relevant in that it also ag- gregates over multiple prompt outputs without requiring any additional supervised training. Self-Consistency builds on Chain-of-Thought prompting Wei et al. [2022a], which proposes to guide the LM to generate reasoning paths in addition to the final prediction. We use the exact prompts and overlapping benchmark tasks provided in the Appendix of Wang et al. [2022b], using GPT-J-6B and report the results in Table 7. For Self-Consistency, we use temperature based sampling as discussed in Wang et al. [2022b], using temperatures ∈ {0.0, 0.3, 0.5, 0.6, 0.7}.

Overall, we observe AMA outperforms Self-Consistency at this model scale. This agrees with the results in Wang et al. [2022b] and Wei et al. [2022a], which report limited performance improvements for small LMs (<10B).

Table 7: Comparison between Self-Consistency Wang et al. [2022b] and AMA using GPT-J-6B and the same number of prompts.

C Weak Supervision Algorithm

We briefly explain the weak supervision algorithm used for constructing φWS. Weak supervision models learn the latent variable graphical model on the distribution Pr(y, P(x)) using the dataset D, and aggregate votes using the learned distribution by setting φ(x) = arg maxy Pr(y|P(x)). Our key insight in our aggregation approach is to parametrize Pr(y, P(x)) so that we can capture variations in accuracy as well as dependencies if they exist. The overall procedure of our aggregation is in Algorithm 1. Formally, we model Pr(y, P(x)) as a probabilistic graphical model with dependency graph G = (V, E), where V = {y, P(x)}. If pi(x) and pj(x) are not conditionally independent given y and the other prompt()-chains, then (pi(x), pj(x)) ∈ E. E also contains edges (pi(x), y) for each i ∈ [m].

The algorithm uses P(x) and D to first learn the dependency structure Gˆ among prompts using the approach from Varma et al. [2019]. The key insight from that work is that the inverse covariance matrix Σ−1 over y and P(x) is graph-structured, meaning that Σ−1 = 0 iff pi(x) and pj(x) are conditionally independent given y. The graph structure means that the inverse covariance over just P(x) decomposes into sparse and low-rank matrices, which can hence be estimated together using RobustPCA [Candès et al., 2011], and the sparse matrix can be used to recover the graph. Next, the algorithm uses the recovered Gˆ along with P(x) and D to learn the accuracies of the prompts with the approach from Ratner et al. [2018]. The key insight from that work is to use the sparsity of Σ−1 to construct a system of equations set equal to 0 that recover the latent accuracy parameters. Once the parameters of the distribution are learned, we can compute PrGˆ,θˆ(y|P(x)) and aggregate our predictions.

D Information-Flow Theoretical Result

In equation 1, we decompose H(y|yˆ) into H(y|P(x)) and H(y|yˆ) − H(y|P(x)). For AMA, suppose that the weak supervision algorithm exactly recovers Pr(y, P(x)). That is, yˆAMA is drawn from Pr(·|P(x)). Then, the second term H(y|yˆ) − H(y|P(x)) can be thought of as an irreducible error corresponding to how much information about y is lost in converting P(x) into an i.i.d. yl randomly drawn from Pr(·|P(x)). Since yl is more likely to change values when this distribution has high entropy, the second term is correlated with our first term H(y|P(x)), the amount of randomness in Pr(y|P(x)). We thus focus on obtaining an expression for H(y|P(x)) in terms of individual prompt accuracies.

We assume that Y = {−1, 1}. We model Pr(y, P(x)) as a probabilistic graphical model with dependency graph G = (V, E), where V = {y, P(x)}. The density of Pr(y, P(x)) follows the following Ising model commonly used in weak supervision [Ratner et al., 2017, Fu et al., 2020]:

where Z is the partition function for normalization and {θy, θi ∀ i ∈ [m], θij ∀ (i, j) ∈ E}. Each θi can be viewed as the strength of the correlation between y and pi(x), while each θij can be viewed as the strength of the dependence between pi(x) and pj(x). We assume that θy = 0, which corresponds to Pr(y = 1) = 1/2.

We present our expression for H(y|P(x)). Define Θ = [θ1, . . . , θm] to be the vector of canonical parameters corresponding to the strength of correlation between y and each pi(x). Define µ = E [pi(x)], which can be written as 2 Pr(pi(x) = y) − 1, a notion of accuracy scaled to [−1, 1].

Note that the above form of the distribution is in terms of canonical parameters θ. This distribution can also be parametrized in terms of the mean parameters corresponding to θ, which are E [y] , E [pi(x)y] for i ∈ [m], and E [pi(x)pj(x)] for (pi(x), pj(x)) ∈ E.

Theorem 1. Assume Pr(y, P(x)) follows equation 2 above. Then, the conditional entropy H(y|P(x)) can be expressed as

The quantity being subtracted from H(y) corresponds to the reduction in entropy of y given that we observe P(x). Within this expression, there are two terms. First, ΘTµ is correlated with how much signal each pi(x) contains about y. Note that this quantity is symmetric—if pi(x) is negatively correlated with y, it still provides information since both θi and E [pi(x)y] will be negative. The second term, EP(x) log cosh ΘTP(x) , is for normalization (otherwise, the first term can grow arbitrarily large with Θ). Note that this quantity is independent of θij, the interactions between prompts.

E AMA Diagnostics

We present a suite of 8 diagnostic tasks, which can be categorized into four task types: question generation, answer generation, answer selection and extraction. We provided details about the tasks and scoring below.

Question Generation: We measure the ability of the model to transform a statement to a question. We construct 3 question generation tasks which evaluate the models ability to transform a statement to a yes/no question (see Question Generation (Yes/No)), transform a statement to a wh- question (see Question Generation (wh-)) and finally, transform a statement about a placeholder entity to a question about the placeholder (see Question Generation (@placeholder)). All question generation tasks are scored using the ROUGE score [Lin, 2004].

Answer Selection: We construct 2 answer selection tasks which measure the model’s ability to generate an answer that is faithful to a set of provided answer choices. Concretely, we measure the models ability to select object categories from a fixed set of options specified in the context (see Answer Selection (category)). Further, we measure the model’s ability to complete a sentence when provided with a context and set of sentence completion candidates (see Answer Selection (completion)). In both tasks, an answer is marked as correct if the generated response is one of the candidates provided in the context.

Answer Generation: We construct 1 answer generation task which measures the model’s ability to generate candidate sentence completions given a context and portion of a statement (see Answer Generation). Here, a generated answer is marked as correct if the model generates 2 candidate answers.

Extraction: We construct 2 extraction tasks which evaluate the ability of the model to extract spans from a given context. The first, and easier task, tests the model’s ability to extract an attribute value from a wikibio (see Extraction (Span)). The second, more difficult task, tests the model’s ability to extract the sentence from the context that mentions a specified entity (see Extraction (Sentence)). For both tasks, we use the Text-F1 score introduced in SQuAD [Rajpurkar et al., 2018].

F Understanding The Effectiveness of the Question-Answering Template

We analyze the LM pretraining corpus to better understand why the proposed QA prompt template may be effective. The EleutherAI models are trained on The Pile corpus Black et al. [2021], Wang and Komatsuzaki [2021], Gao et al. [2021].

Prompt patterns We compute the frequency of regular expression matches that correspond to the restrictive prompts (i.e., which instruct the model to output “True or False”, “Yes or No”) versus open-ended questions (i.e., which ask the model “Is . . . ?”, “Who . . . ?”) in a 2% random sample of the˜200B token Pile corpus. The restrictive prompt-patterns appear frequently in the original GPT-3 prompts Brown et al. [2020]. The frequencies are in Table 8.

We observe that question patterns appear more frequently than the restrictive prompts. Further, we find several in- stances of yes-no questions followed by “yes” or “no”, which mimics the AMA format (Table 9). Overall, we find that QA structured text appears much more frequently in the pretraining corpus, which may help explain why the language models perform better on QA.

Table 8: Frequency of each category of regular expressions in the Pile sample.
Table 9: Yes/No question patterns followed by “Yes” or “No” tokens.

Word frequencies When applying the few-shot restrictive prompts, we observe large imbalances in the F1-scores for different classes (Table 10). Therefore, we next ask if answering the restrictive prompts is challenging due to biases acquired during pretraining. Over the same Pile sample as before, the mean word count is 25.3 ± 7309 occurrences. We compute the frequency of individual words in the “restrictive” and “open-ended question” patterns from Table 8. This leads to two hypotheses about why QA prompts perform well:

  1. First we see that there are imbalances between the occurrence of “yes” vs. “no”, and “true” vs. “neither” for instance. This may bias the model towards certain answer choices. Indeed Zhao et al. [2021] also hypothesizes, but does not provide any analysis over the pretraining corpus, that pretraining may instill particular biases in the model.
  2. The frequency of the words in the “question words” categories is typically an order of magnitude larger than those in the “restrictive words” category. We hypothesize that the representations for the “question words” will be the most context-specific, which is useful for the prompting tasks we consider. Findings in ? support this hypothesis — ? finds that frequently occurring words (e.g. stop-words) have the most context-specific representations. In other words, for the more frequently occurring stop-words the embedding produced by the transformer-based LM changes more significantly depending on the co-occurring words in the context.
Table 10: F1-Score by class for three benchmarks with three different prompting templates each: 1) 0-shot, 2) few-shot with the original GPT-3 restrictive prompts Brown et al. [2020], and 3) AMA prompts. We observe large imbalances in the scores across classes under the 0-shot and few-shot prompting.
Table 11: Frequency of each category of regular expressions in the Pile sample.

Overall, designing prompting templates for an LM based on analysis of the LM pretraining corpus may be a promising path forward for future work.

G Error Analysis

We bucket the common error modes of AMA into three categories: knowledge, instruction-following, and long-context.
Knowledge errors. We find that AMA yields the most gains when the knowledge required to complete the task is explicitly provided in the context (e.g., reading comprehension, extractive QA), which is in line with the trends in Figure

Figure 7: Relative performance gain observed when scaling from GPT3-6.7B to GPT3-175B. Results are directly from Brown et al. [2020] and are categorized by type of knowledge required for the task.
  1. We find that AMA provides comparatively less lift on tasks where the model needs to (1) recall encoded factual knowledge or (2) apply common-sense / real-world knowledge to a given context. We provide concrete examples from the Natural Questions dataset (see Knowledge (Factual) below) in which the GPT-J-6B model wrongly answers the question due to a lack of latent factual knowledge. We additionally provide case-examples from the BoolQ dataset where the model’s limited real-world knowledge limits its ability to correctly answer the questions where the model’s failure to recognize that food that is smoked is cooked, leads it to incorrectly answer the question (see Knowledge (Commonsense) below).

Instruction-following errors. We find that on tasks with more restrictive output spaces (e.g., multi-way classification tasks), a common failure mode is to generate an answer that is not in the desired output space of the AMA prompt, despite being explicitly prompted to do so. In Listing 3 and 4, we provide sample instances from the DBPedia classification task where GPT-J-6B does not correctly map a descriptive adjective (e.g., automobile or singer) to a valid class specified in the prompt.

Long-context errors. We find that the AMA question() functions struggle to generate accurate statement-question transformations when the input is long or contains complex sentence structures (e.g. compound sentences). We provide sample instances from the SuperGLUE record task where GPT-J-6B fails to transform a sentence with a placeholder subject to a question about the placeholder subject (see Long-context (question()) below). Additionally, we find that the AMA answer() functions struggle to extract the correct span in long contexts (greater than 6 sentences). We show a sample instance from the DROP QA task where GPT-J-6B fails to extract the correct span from the long provided context (see Long-context (answer()) below).

H Datasets and Prompts

We evaluate over 20 datasets which fall into 4 categories: SuperGLUE (BoolQ [Clark et al., 2019], CB [De Marn- effe et al., 2019], COPA [Roemmele et al., 2011], MultiRC [Khashabi et al., 2018], ReCoRD [Zhang et al., 2018], RTE [Wang et al., 2019], WiC [Pilehvar and Camacho-Collados, 2018], WSC [Levesque et al., 2012]), NLI (ANLI R1, ANLI R2, ANLI R3 [Nie et al., 2020], StoryCloze [Mostafazadeh et al., 2017]), Classification (DBPedia [Zhang et al., 2015], AGNews [Zhang et al., 2015], SST2 [Socher et al., 2013], Amazon [He and McAuley, 2016]), and Question-Answering (RealTimeQA [Kasai et al., 2022], DROP [Dua et al., 2019], Natural Questions [Kwiatkowski et al., 2019], WebQuestions [Berant et al., 2013]). We provide dataset details along with few shot and AMA prompts for the dataset below.

H.1 AGNews

Description: News article classification dataset with 4 topics. Zhang et al. [2015]
Train Size: 120000, Test Size: 76000

H.2 ANLI R1

Description: Adversarially mined natural language inference dataset from Wikipedia. Nie et al. [2020]
Train Size: 16946, Test Size: 1000

H.3 ANLI R2

Description: Adversarially mined natural language inference dataset from Wikipedia. Nie et al. [2020]
Train Size: 45460, Test Size: 1000

H.4 ANLI R3

Description: Adversarially mined natural language inference dataset from Wikipedia, News and other data sources. Nie et al. [2020]
Train Size: 100459, Test Size: 1200

H.5 Amazon

Description: Amazon product classification dataset with 9 classes. He and McAuley [2016]
Train Size: 9000, Test Size: 9000

H.6 BoolQ

Description: Yes/no QA task over small wikipedia passages. Clark et al. [2019]
Train Size: 9427, Test Size: 3245

H.7 CB

Description: Three-class textual entailement task. Wang et al. [2019]
Train Size: 250, Test Size: 56

H.8 COPA

Description: Casual reasoning dataset where task is to select the alternative that more plausibly has a causal relation with the premise. Wang et al. [2019]
Train Size: 400, Test Size: 100

H.9 DBPedia

Description: Ontology classification dataset with 14 classes. Zhang et al. [2015]
Train Size: 560000, Test Size: 70000

H.10 DROP

Description: A reading comprehension benchmark requiring discrete reasoning over paragraphs. Dua et al. [2019]
Train Size: 77409, Test Size: 9536

H.11 MultiRC

Description: Multi-sentence reading comprehension dataset. Wang et al. [2019]
Train Size: 27243, Test Size: 953

H.12 Natural Questions (NQ)

Description: Open-domain question answering that contains questions from real users. Kwiatkowski et al. [2019]
Train Size: 307373, Test Size: 7830

H.13 RTE

Description: Dataset where the task is to predict whether a proposed premise sentence entails a given hypothesis sentence. Wang et al. [2019]
Train Size: 2490, Test Size: 277

H.14 ReCoRD

Description: Reading comprehension dataset which requires commonsense reasoning. Wang et al. [2019]
Train Size: 100730, Test Size: 10000

H.15 RealTime QA

Description: Dynamic question answering dataset that asks questions about current world facts. Kasai et al. [2022]
Train Size: 90, Test Size: 187

H.16 SST2

Description: Movie review binary sentiment classification dataset. Socher et al. [2013]
Train Size: 6920, Test Size: 1821

H.17 Story Cloze

Description: Commonsense reasoning task that requires choosing the correct ending to a four-sentence story. Mostafazadeh et al. [2017]

Train Size: 1871, Test Size: 1871

H.18 WSC

Description: Task that requires readining a sentence with a pronoun and selecting the referent of that pronoun from a list of choices. Wang et al. [2019]
Train Size: 554, Test Size: 104

H.19 WebQuestions (WQ)

Description: Question answering dataset with questions that can be answered using Freebase, a large knowledge graph. Berant et al. [2013]
Train Size: 3778, Test Size: 2032

H.20 WiC

Description: Word sense disambiguation task cast as binary classification over sentence pairs. Wang et al. [2019]
Train Size: 5428, Test Size: 638