Shayne Longpre∗ Le Hou Tu Vu Albert Webson Hyung Won Chung Yi Tay Denny Zhou Quoc V. Le Barret Zoph Jason Wei Adam Roberts
Google Research
Abstract
We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 models (Chung et al., 2022). Through careful ablation studies on the Flan Collection of instruction tuning tasks and methods, we tease a part the effect of design decisions that enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks—motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available.1
1 Introduction
Large language models such as PaLM (Chowdhery et al., 2022), Chinchilla (Hoffmann et al., 2022), and ChatGPT among others(Brownetal.,2020;Ouyangetal.,2022)have unlocked new capabilities in performing natural language processing (NLP) tasks from reading instructive prompts. Prior art has shown that instruction tuning—finetuning language models on a collection of NLP tasks formatted with instructions— further enhances the ability of language models to perform an unseen task from an instruction (Wei et al., 2021; Sanh et al., 2021; Min et al., 2022).
In this work, we evaluate the methods and results of open sourced instruction generalization efforts, comparing their finetuning techniques and methods. And in particular, we identify and evaluate the critical method-ological improvements in the “Flan 2022 Collection”, which is the term we use for the collection of data and methods for data augmentation and instruction tuning, first implemented and used in Chung et al. (2022). Where Chung et al. (2022) focuses on the emergent and state-of-the-art results of combining Flan 2022 with PaLM 540B, this work focuses in on the details of the instruction tuning methods themselves, ablating individual factors, and comparing them directly to prior work by keeping the pretrained model size and checkpoint consistent.
The Flan 2022 Collection offers the most extensive publicly available set of tasks and methods for instruction tuning, which we have compiled in one place. We have also supplemented this with hundreds more of our own high-quality templates, richer formatting patterns, and data augmentations. We show that a model trained on this collection outperforms other public collections on all tested evaluation benchmarks, including the original Flan 2021 (Wei et al., 2021), T0++ (Sanh et al., 2021), Super-Natural Instructions (Wang et al., 2022c), and the concurrent work on OPT-IML (Iyer et al., 2022). As shown in Figure 1, this includes 4.2%+ and 8.5% improvements on the MMLU (Hendrycks et al., 2020) and BIG-Bench Hard (Suzgun et al., 2022) evaluation benchmarks respectively, for equally sized models.
Analysis of the Flan 2022 method suggests the strong results stem both from the larger and more diverse set of tasks, but also from a set of simple finetuning and data augmentation techniques. In particular, training on a mix of examples templatized with zero-shot, few-shot, and chain-of-thought prompts improves performance in every one of these settings, together. For instance, adding just 10% few-shot prompts improves zero-shot prompting results by 2%+. Additionally, enriching task diversity by inverting input-output pairs, as used in (Sanh et al., 2021; Min et al., 2022), along with balancing task sources, are both shown to be critical to performance. The resulting Flan-T5 model converges faster and at a higher performance than T5 models in single-task finetuning—suggesting instruction-tuned models offer a more computationally-efficient starting checkpoint for downstream applications, corroborating Aribandi et al. (2021) and Liu et al. (2022b).
We hope making these findings and resources publicly available will unify resources around instruction tuning and accelerate research into more general-purpose language models. We summarize this work’s core contributions as follows:
• Methodological: Show that training with mixed zero- and few-shot prompts yields much better perfor-mance in both settings (Section 3.2).
• Methodological: Measureanddemonstratethecriticaltechniquestoeffectiveinstructiontuning: scaling Section 3.3, enriching task variety with input inversion (Section 3.4), adding chain-of-thought training data, and balancing different data sources (Section 3.5).
• Results: Demonstrate these technical choices yield 3-17% Held-Out task improvements over existing open source instruction tuning collections (Figure 1).
• Results: Demonstrate Flan-T5 serves as a stronger and more computationally-efficient starting check-point for single-task finetuning (Section 4).
• Open source the new Flan 2022 task collection, templates, and methods for public research.
2 Public Instruction Tuning Collections
Note that the number of tasks and of examples vary under different assumptions and so are estimates. For instance, the definition of “task” and ”task category” vary by work, and are not easily simplified to one ontology. The reported counts for the number of tasks are reported using task definitions from the respective works.
† indicates concurrent work.
Large Language Models Instruction tuning has emerged as a tool to make large language models (LLMs) and their abilities more useful for interactive dialog and functional tasks. Previous work (Raffel et al., 2020; Liu et al., 2019; Aghajanyan et al., 2021; Aribandi et al., 2021) experimented with large scale multi-task finetuning, to improve downstream single target finetuning, but without instruction prompts. UnifiedQA and others (Khashabi et al., 2020; McCann et al., 2018; Keskar et al., 2019) unified a wide range of NLP tasks into a single generative question answering format, using prompt instructions for multi-task finetuning and evaluation.
The First Wave Since 2020, several instruction tuning task collections have been released in rapid succession, outlinedinFigure2. Natural Instructions (Mishra et al., 2021), Flan 2021 (Wei et al., 2021), P3 (the Public Pool of Prompts, Bach et al., 2022) aggregated large NLP task collections and templatized them with instructions (zero-shot prompting), specifically for finetuning models to generalize to unseen instructions. MetaICL (Min et al., 2022) also consolidated other task collections (Ye et al., 2021; Khashabi et al., 2020) to train models to learn tasks “in-context” – from several input-output examples, known as few-shot prompting, but in this case without instructions. Each of these works affirmed the scaling benefits of task and template diversity, and some reported strong benefits from inverting the inputs and outputs in templates to produce new tasks (“noisy channel” in Min et al., 2022).
The Second Wave A second wave of instruction tuning collections expanded prior resources: combining more datasets and tasks into one resource, like Super-Natural Instructions (Wang et al., 2022c) or OPT-IML (Iyer et al., 2022), adding multilingual instruction tuning in xP3 (Muennighoff et al., 2022), and Chain-of-Thought training prompts in Flan 2022 (Chung et al., 2022). Both the Flan Collection and OPT-IML contain most tasks represented in prior collections.2 Our work is positioned here, coalescing most of these collections (of collections) and their methods, as the strongest starting point for future open source work.
New Directions Concurrent and future work is beginning to explore two new directions: (a) expanding taskdiversityevenmoreaggressivelywithsyntheticdatageneration,particularlyincreative,andopen-ended dialogue (Wang et al., 2022b; Honovich et al., 2022; Ye et al., 2022; Gupta et al., 2022), and (b) offering human feedback signals on model responses (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021; Bai et al., 2022b). We view most of these new directions as likely additive to a foundation of instruction tuning methods.
Tuning with Human Feedback Instruction tuning on human feedback has demonstrated strong results on open-ended tasks, but at the expense of performance on a wide array of more traditional NLP tasks (Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Nakano et al., 2021). (See Ouyang et al. (2022)’s discussion of the “alignment tax”.) Our work focuses specifically on instruction generalization, without human feedback, fortworeasons. First, human feedback data sets are farless publicly available than instruction tuning data sets (and may be model-specific). Second, by itself, instruction generalization shows great promise in enhancing human preferred responses on open-ended tasks, as well as improving traditional NLP metrics (Chung et al., 2022). The extent of obtainable progress without expensive human response demonstrations or ratings remains an open question, and an important pursuit to narrow the gap between public and non-public research.
The Importance of Open Source High profile research is increasingly driven by non-public data, as in the case of GPT-3 and others (Ouyang et al., 2022; Glaese et al., 2022). The inaccessibility of these resources inhibits the research community’s ability to analyze and improve these methods in the public domain. We narrow our purview to open source and accessible data collections, motivated by the goal of democratizing accessibility to research.
3 Flan 2022 Instruction Tuning Experiments
Recent research has yet to coalesce around a unified set of techniques, with different tasks, model sizes, and target input formats all represented. We open source a new collection, first introduced in Chung et al. (2022), denoted “Flan 2022”, which combines Flan 2021, P3++3, Super-Natural Instructions, with some additional reasoning, dialog, and program synthesis datasets. We defer to Chung et al. (2022) for details of templatization and collection; and in this work we take a deeper look at key methodological improvements and compare the collection on equivalent model sizes to existing collections.
In this section, we evaluate the design decisions in Flan and discuss four in particular that yield strong improvements to the instruction tuning recipe. These design components, outlined in Section 2, are: (I) using mixed zero-shot, few-shot, and Chain-of-Thought templates at training (Section 3.2), (II) scaling T5-sized models to 1800+ tasks (Section 3.3), (III) enriching tasks with input inversion (Section 3.4), and (IV) balancing these task mixtures(Section3.5). InSection3.1,we begin by measuring the value of each component and compare the final model against alternative instruction tuning collections (and their methods).
Experimental Setup We finetune on the prefix language model adapted T5-LM (Lester et al., 2021), using the XL (3B) size for all models for consistency, unless otherwise stated. While other sizes of Flan-T5 are available, we felt XL was appropriately sized to run large-scale systematic ablations, while being sufficiently large to draw general conclusions. We evaluate on (a) a suite of 8 “Held-In” tasks represented within the 1800+ training task collection (4 question answering and 4 natural language inference validation sets), (b) Chain-of-Thought (CoT) tasks (5 validation sets), and (c) the MMLU (Hendrycks et al., 2020) and BBH (Suzgun et al., 2022) benchmarks as our set of “Held-Out” tasks, as they are not included as part of Flan 2022 finetuning. The Massivley Multitask Language Understanding benchmark (MMLU) broadly tests reasoning and knowledge capacity across 57 tasks in the sciences, social sciences, humanities, business, health, among other subjects. BIG-Bench Hard (BBH) includes 23 challenging tasks from BIG-Bench (Srivastava et al., 2022) where PaLM under-performs human raters. In our ablations, we also evaluate BBH with Chain-of-Thought inputs, following Chung et al. (2022). Additional finetuning and evaluation details are provided in Appendix A.
3.1 Ablation Studies
Table 1 summarizes the mean contribution to Held-in, Held-out, and Chain-of-thought tasks, by individually deducting methods: mixture weight balancing (“- Mixture Balancing”), Chain-of-thought tasks (“- CoT”), mixed prompt settings (“- Few Shot Templates”), and Input Inversion (“- Input Inversion”). Flan-T5 XL leverages all four of these methods together. We also finetune T5-XL-LM on other collections, including Flan 2021, P3++, Super-Natural Instructions for comparison.
Each of the ablated components of Flan contributes improvements to different metrics: Chain-of-Thought trainingtoChain-of-Thoughtevaluation,inputinversiontoHeld-Outevaluations(MMLUandBBH),few-shot prompt training to few-shot evaluations, and mixture balancing to all metrics.
As compared to T5-XL models trained on alternative instruction tuning collections (and their methods), Flan outperforms in almost every setting. While previous collections are tuned specifically to zero-shot prompts, Flan-T5 XL is tuned for either zero- or few-shot prompts. This yields performance margins of +3-10% for most of the zero-shot settings, and margins of 8-17% for the few-shot settings. Most impressively, Flan 2022 outperforms OPT-IML-Max’s much larger (10x) 30B and (58x) 175B models. Next, we isolate some of Flan 2022’s ablated methods individually, to examine the benefits of each.
3.2 Training with Mixed Prompt Settings
Prior work has shown a wide variety of input templates per task can improve performance. However, separate from the wording of the instruction template, these prior LLMs mostly tune with template sets targeted to a single prompt setting: for zero-shot prompting (Wei et al., 2021; Sanh et al., 2021; Aghajanyan et al., 2021; Aribandi et al., 2021) or for few-shot prompting (Min et al., 2022; Wang et al., 2022c).
An underappreciated design decision in InstructGPT (Ouyang et al., 2022) was to mix training templates for each of these prompt settings, rather than target a single setting. However, since Ouyang et al. (2022) do not examine this choice, we expected a performance trade-off in finetuning for zero-shot or few-shot prompting performance – particularly for smaller models. Instead, we find training with mixed zero- and few-shot prompts significantly improves performance in both settings – most surprisingly, even for models with only 3B parameters.
Figure 3 shows (1) adding as little as 5% few-shot training templates can dramatically improve zero-shot performance, and (2) adding 10%+ of zero-shot data improves few-shot performance too. Both Held-In and Held-Out tasks peak anywhere between 10-90% of few-shot data, but this range is consistently higher than training with only one prompt setting.
3.3 Scaling Small Models to 1.8k+ Tasks
The most recent and concurrent publicly available instruction tuning efforts, like Flan 2022, train on thousands of tasks (Wang et al., 2022c; Iyer et al., 2022), but operate on different task compositions and underlying training methods. To measure the impact of scaling model sizes and tasks for the Flan 2022 collection, we finetune T5-LM adapted models (Small, Base, Large, XL, XXL) on randomly selected task subsets (8, 25, 50, 100, 200, 400, 800, all 1873). Every finetuning run is guaranteed to include the Held-In tasks, so we can estimate how task scaling impacts the model capacity to maintain performance on a given task its already seen.
Figure 4 demonstrates that both Held-In and Held-Out tasks appear to benefit from adding hundreds of finetuning tasks. Held-in task evaluations peak around 200 total tasks, and diminish in performance as more tasks are added, though larger models peak later and diminish less. Held-out task performance increases log-linearly with the number of tasks, achieving the highest performances with all 1836 tasks.
Surprisingly, only T5-Small appears to exceed its Held-Out task performance before 1836 tasks, while larger model sizes continue to improve. These results suggest (a) even T5-Base may not have exhausted its capacity with thousands of tasks, and (b) the largest LMs could benefit from thousands more tasks for Held-In and Held-Out task performance.
One necessary assumption of this analysis is that all tasks are defined and counted equally. Section 3.5 demonstrates how not all task sources are equally beneficial to training, and the model performance may saturate from too many tasks from one source (e.g. Super-Natural Instructions). We would caution conclu-sions that task scaling beyond 1800 would translate to increased returns without also paying attention to task diversity and quality.
3.4 Task Enrichment with Input Inversion
Prior instruction tuning work has enriched their diversity of tasks by inverting the (x, y) input-output pairs in supervised tasks—referred to as “prompts not intended for the original task” in P3 (Bach et al., 2022) or the “noisy channel” in MetaICL (Min et al., 2022). For example, a dataset may be originally designed for, given a question x, evaluate if a model can answer y. Input inversion instead gives a model the answer y and trains it to generate the question x. This is an easy method to enrich the task variety given a limited set of data sources. However, it isn’t clear that this method remains helpful when 100s of unique data sources and 1000s of tasks are already available.
To assess this, we enrich our mixtures with input inverted tasks (details and examples in Appendix B) and measure the effect. In Table 1 we find this is not beneficial for Held-In performance, but strongly beneficial for Held-Out performance. These benefits invigorate the prospect of data augmentation techniques for LLM finetuning, which had previously been shown to have diminishing returns the longer models are pretrained (Longpre et al., 2020).
3.5 Balancing Data Sources
Scaling architecture size and the number of tasks are effective, but our results suggest the mixture weighting deserves as much attention to optimize results. To converge on a balanced weighting, we omit different sets of task sources, one at a time (Flan 2021, T0-SF, Super-Natural Instructions, Chain-of-Thought, Dialog, and
Program Synthesis), and rank their contributions on the MMLU benchmark.4.
As shown in Table 2, Flan 2021 and T0-SF are among the most beneficial mixtures, followed by Super-Natural Instructions and Chain-of-Thought, with Dialog and Program Synthesis last. These findings are corroborated by Iyer et al. (2022) who extensively test data mixing proportions, and also determine their Flan 2021, T0-SF, and T5 mixtures are the most broadly beneficial. Additionally, they find Super-Natural Instructions has limited scaling benefits on Held-Out task performance, which they relate to its unique input format and instruction design. Notably, Chain-of-thought finetuning appears beneficial across all our evaluation settings, especially considering they contain far fewer tasks than Flan 2021, T0-SF or Natural Instructions.
We used these findings to significantly narrow the mixture weights search space, and used our practitioner’s intuition from there. This strategy is simple but effective, as shown in Table 1, but leaves ample room for more sophisticated future work.
3.6 Discussion
OPT-IML (Iyer et al., 2022) presents the closest comparison to this work, including a similar collection of tasks, examples and techniques. However, while their used tasks are all publicly sourced, their collection, with templates, processing, and example mixing, is not released, and as a result cannot be easily compared. Iyer et al. (2022) report that Flan-T5-XL (3B) and XXL (11B) outperforms OPT-IML-Max 175B on both MMLU and BBH. As they discuss, these differences may arise from any combination of pre-training, model architecture, and instruction tuning. Model architecture and pretraining before instruction tuning can play a significant role (Wang et al., 2022a). But there are many other details in instruction tuning that may vary between Flan 2022 and OPT-IML. Likely candidates are are: example templatization, how the mixed input prompting procedures are used at training, and task composition.
How significant are each of these difference? While OPT-IML contains more tasks than Flan 2022, we estimate approximately 94%(2067=2207) are also used in the Flan 2022 collection5, and very few tasks in Flan 2022 are not contained in some format in OPT-IML. This suggests the overall difference in task diversity is not significant when using a shared definition of “task”. Task mixture rates also emphasize similar sources, including Flan 2021 (46% vs 20%), PromptSource/P3 (28% vs 45%), and Super-Natural Instructions (25% vs 25%), for Flan 2022 and OPT-IML respectively.6 OPT-IML’s other collections (Crossfit, ExMix, T5, U-SKG)
are not weighted significantly: 4%, 2%, 2%, 2% respectively.
We believe example templatization and the mixed prompt formats may pose the largest differences with OPT-IMLs instruction tuning. Our template repository was significantly updated from Flan 2021, adding variety not just in instructions, but also along dimensions. For instance, the templatization procedure varies where the instruction is placed (before or after few-shot prompts), the spacing and separators between few-shot and Chain-of-Thought prompts, and the formatting permutations of answer options (and their targets) for multiple-choice examples, which sometimes includes and sometimes excludes answer options in the inputs or exemplars. While we do not have dedicated experiments comparing many iterations of development, we found these procedures dramatically augment input variety and showed repeated performance improvements. Our example templatizing procedure is open sourced for inspection and future work.
4 Instruction Tuning Enhances Single-Task Finetuning
In applied settings, machine learning practitioners deploy NLP models finetuned (FT) specifically for a single target task, usually where finetuning data is already available. While prior work has shown the benefits of intermediate finetuning (Pruksachatkun et al., 2020; Vu et al., 2020) or multi-task finetuning (Aghajanyan et al., 2021; Aribandi et al., 2021) for downstream tasks, this has not been studied extensively for instruction-tuned models.
We evaluate Flan 2022 instruction tuning as an intermediary step before single target finetuning, to understand if Flan-T5 would serve as a better starting checkpoint for applied practitioners. We evaluate three settings in
Figure 5: finetuning T5 directly on the target task as the conventional baseline (blue bars), using Flan-T5 without further finetuning (beige bars), and finetuning Flan-T5 further on the target task (red bars).
Pareto Improvements to Single Task Finetuning For both sets of Held-In and Held-Out tasks examined, finetuning Flan-T5 offers a pareto improvement over finetuning T5 directly. In some instances, usually where finetuning data is limited for a task, Flan-T5 without further finetuning outperforms T5 with task finetuning.
Faster Convergence & Computational Benefits Using Flan-T5 as a starting checkpoint has an added benefit in training efficiency. As demonstrated in Figure 6, Flan-T5 converges much more quickly than T5 during single target finetuning, as well as peaking at higher accuracies. These convergence results also suggest there are strong green-AI incentives for the NLP community to adopt instruction-tuned models, like Flan-T5 for single-task finetuning, rather than conventional non-instruction-tuned models. While instruction tuning is more computationally-expensive than single-task finetuning, it is a one-time cost. On the contrary, pretrained models that require extensive finetuning become more costly when aggregating over many millions of additional training steps (Wu et al., 2022; Bommasani et al., 2021). Instruction-tuned models offer a promising solution to significantly reduce the amount of finetuning steps across a wide swathe of tasks, if they are adopted as a new standard starting point for single-task finetuning.
5 Related Work
Large Language Models As the foundation of instruction tuning, the practice of pretraining one general-purpose language representation that is useful for multiple downstream tasks has a long tradition that goes back at least Mikolov et al. (2013) and Dai and Le (2015). In 2018, Peters et al. (2018) and Devlin et al. (2019) cemented the paradigm of pretraining a large model on a large unsupervised corpus, and the field of NLP quickly converged to using these models which substantially outperform the prior art of non-pretrained task-specific LSTM models on all tasks. However, the dominate way to access that high-quality syntactic and semantic knowledge encoded in pretrained models was not to prompt them with instructions, but to train an additional task-specific linear layer that maps the model activations into numerical class labels. A short year later, Radford et al. (2019), Raffel et al. (2020), and Lewis et al. (2020) popularized the notion that downstream tasks—and multiple tasks—can be jointly learned by directly using the pretrained LM head to generate the answers in natural language (cf. task-specific numerical class labels), the task-general nature of these generative models became the precursor to many multitask transfer learning studies (McCann et al., 2018; Khashabi et al., 2020; Ye et al., 2021; Vu et al., 2020), which in turn led to the first wave of instruction tuning as described in Section 2.
The continuing advancement in research on the pretraining corpora, architectures and pretraining objectives of LMs also has a large impact on instruction tuning. Asof2022,decoder-onlyleft-to-rightcausalTransformers dominate the market of models larger than 100B (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021;
Chowdhery et al., 2022; Hoffmann et al., 2022), and all models of such size class with fully public model parameters are decoder-only (Wang and Komatsuzaki, 2021; Le Scao et al., 2022; Zhang et al., 2022), the decision of which are often due to better hardware and software framework support. However, Raffel et al. (2020), Lewis et al. (2020), and Tay et al. (2022a) have consistently found that left-to-right causal language modeling is a suboptimal objective, while Tay et al. (2022b) and Wang et al. (2022a) particularly showed that a mixture of non-sequential objectives is much superior for downstream tasks with zero-shot and few-shot prompting. An additional factor which remains under-explored is the relationship between pretraining corpora, instruction tuning, and downstream abilities. Typically, public models are all trained on one of a few public corpora: C4 (Raffel et al., 2020), The Pile (Gao et al., 2020), or ROOTs (Laurençon et al., 2022).
Instruction Tuning In Section 2 we outline major developments in instruction tuning. Other important developments include the prospect of complimenting or replacing few-shot in-context learning-the currently predominate method of evaluating pretrained and instruction-tuned models—with parameter-efficient tuning. As standard finetuning of models larger than 100B requires a high number of accelerators with the right interconnects often too expensive even for many industry labs, parameter-efficient tuning (a.k.a. continuous or soft “prompt tuning”) shows that only updating a small subset of model parameters can reach comparable performance as fully tuning all model parameters (Lester et al., 2021; Vu et al., 2022; Hu et al., 2021; see He et al., 2022 for a detailed analysis). Notably, Liu et al. (2022b) show that, due to the long sequence length of few-shot ICL and that the few-shot exemplars need to be repeatedly inferenced for evaluating every example, parameter-efficient tuning can be computationally cheaper and higher performing than in-context learning. Further, Liu et al. (2022b), Vu et al. (2022), Wei et al. (2021), and Singhal et al. (2022) collectively show that both single-task and multi-task parameter-efficient tuning can be productively combined with instruction tuning, either before or after regular full-model instruction tuning. This line of work makes it easy for other researchers to build on top of a general-domain instruction-tuned model, and collect a custom instruction-tuning mixture for their use, e.g., with multiple modalities (Ahn et al., 2022; Huang et al., 2022; Xu et al., 2022) or special domains such as science and medicine (Lewkowycz et al., 2022; Singhal et al., 2022).
Problems Addressed by Instruction Tuning & Alignment Techniques Instruction tuning is part of a line of work designed to “align” language models with more useful objectives and human preferences. In the absence of such methods, language models are known to demonstrate toxic/harmful behaviour (Sheng et al., 2019; Liang et al., 2021; Wallace et al., 2019), generate non-factual information (Maynez et al., 2020; Longpre et al., 2021; Devaraj et al., 2022), and other challenges in deployment and evaluation (Zellers et al., 2019; McGuffie and Newhouse, 2020; Talat et al., 2022). Analyzing, evaluating and mitigating these problems pose a promising direction for future work (Gao et al., 2022; Ganguli et al., 2022). Instruction tuning warrants greater investigation, as it has already demonstrated itself an encouraging remedy in reducing NLP bias metrics, as shown in Chung et al. (2022).
6 Conclusions
The new Flan 2022 instruction tuning collection unifies the most popular prior public collections and their methods, while adding new templates and simple improvements like training with mixed prompt settings. The resulting collection outperforms Flan 2021, P3++, Super-Natural Instructions, and OPT-IML-Max 175B on Held-In QA, NLI, and Chain-of-Thought tasks, and Held-Out MMLU and BBH, often by large margins. Resultssuggestthisnewcollectionservesasamorecompetitivestartingpointforresearchersandpractitioners interested in both generalizing to new instructions, or finetuning on a single new task.
Acknowledgements
We would like to thank Ed H Chi, Xinyun Chen, and Colin Raffel for their advice and feedback on the paper.
References
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. Muppet: Massive multi-task representations with pre-finetuning. In EMNLP, 2021. URL https:// aclanthology.org/2021.emnlp-main.468.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv e-prints, art. arXiv:2204.01691, April 2022.
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021.
Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Dragomir Radev, Mike Tian-jian Jiang, and Alexander Rush. PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-demo.9. URL https://aclanthology.org/2022.acl-demo.9.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language mod-els are few-shot learners. NeurIPS, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh,etal. PaLM: Scaling language modeling with Pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
12
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, vol-ume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ 7137debd45ae4d0ab9aa953017286b20-Paper.pdf.
Ashwin Devaraj, William Sheffield, Byron Wallace, and Junyi Jessy Li. Evaluating factuality in text simplifi-cation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7331–7345, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.506. URL https://aclanthology.org/2022.acl-long.506.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019. URL https://aclanthology.org/N19-1423.
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726, 2022.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, and Jeffrey P Bigham. Improving zero and few-shot generalization in dialogue through instruction tuning. arXiv preprint arXiv:2205.12673, 2022.
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=0RDcd5Axok.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2020. URL https://openreview.net/ forum?id=d7KBjmI3GmQ.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https: //arxiv.org/abs/2106.09685.
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading com-prehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, 2019.
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022.
Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O’Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, and Ves Stoyanov. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022. URL https: //arxiv.org/abs/2212.12017.
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, 2019. URL https://aclanthology.org/D19-1259.
Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Unifying question answering, text classification, and regression via span extraction. arXiv preprint arXiv:1904.09286, 2019.
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. Unified QA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020. URL https://aclanthology.org/2020.findings-emnlp. 171.
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tun-ing. EMNLP, 2021. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021. emnlp-main.243.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858.
Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In ICML, 2021.
Alisa Liu, Swabha Swayamdipta, Noah A Smith, and Yejin Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation. arXiv preprint arXiv:2201.05955, 2022a. URL https: //arxiv.org/abs/2201.05955.
Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022b. URL https://arxiv.org/abs/2205.05638.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4487–4496, 2019.
Shayne Longpre, Yu Wang, and Chris Du Bois. How effective is task-agnostic data augmentation for pretrained transformers? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4401–4411, 2020.
Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, 2021.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/ v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173.
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
Kris McGuffie and Alex Newhouse. The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807, 2020.
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, 2020.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen-tations of words and phrases and their compositionality. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, vol-ume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/ 9aa42b31882ec039965f3c4923ce901b-Paper.pdf.
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In NAACL, 2022. URL https://aclanthology.org/2022.naacl-main.201.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, 2020.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. URL https://arxiv.org/abs/2203.02155.
Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, and Yinfei Yang. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 2855–2870, 2021. URL https://aclanthology.org/2021.eacl-main.249.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, 2021.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. NAACL, 2018. URL https://aclanthology. org/N18-1202.
Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, and Samuel Bowman. Intermediate-task transfer learning with pretrained language models: When and why does it work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5231–5247, 2020.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. URL https://d4mucfpksywv.cloudfront. net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Al-bin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mel-lor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020. URL https://arxiv.org/abs/1910.10683.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018.
Abhilasha Ravichander, Matt Gardner, and Ana Marasović. Condaqa: A contrastive reading comprehension dataset for reasoning about negation. arXiv preprint arXiv:2211.00295, 2022. URL https://arxiv.org/ abs/2211.00295.
Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022. URL https://arxiv.org/ abs/2203.17189.
Alexey Romanov and Chaitanya Shivade. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1586–1596, 2018. URL https://aclanthology.org/D18-1187.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. ICLR 2022, 2021. URL https://arxiv.org/abs/2110.08207.
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339.
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. Large language models encode clinical knowledge, 2022. URL https://arxiv.org/abs/2212.13138.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. URL https://arxiv.org/abs/2206.04615.
Mirac Suzgun, Nathan Scales, Nathaneal Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny ZHou, and Jason Wei. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022. URL https://arxiv.org/abs/ 2210.09261.
Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey, Shayne Longpre, Alexandra Sasha Luccioni10, Maraim Masoud11, Margaret Mitchell10, Dragomir Radev12, et al. You reap what you sow: On the challenges of bias evaluation under multilingual settings. Challenges & Perspectives in Creating Large Language Models, page 26, 2022.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019.
Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022a. URL https://arxiv.org/abs/2205.05131.
Yi Tay, Jason Wei, Hyung Won Chung, David R. So, Siamak Shakeri, Xavier Garcia, Vinh Q. Tran, Hauixiu Steven Zheng, Jinfeng Rao, Denny Zhou, Donald Metzler, Neil Houlsby, Quoc V. Le, and Mostafa Dehghani. Transcending scaling laws with 0.1% extra compute. In arxiv, 2022b.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.
Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, and Mohit Iyyer. Exploring and predicting transferability across NLP tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7882–7926, 2020. URL https://aclanthology.org/2020.emnlp-main.635.
Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. SPoT: Better frozen model adaptation through soft prompt transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5039–5059, 2022. URL https://aclanthology.org/2022.acl-long.346.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.
Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https: //github.com/kingoflolz/mesh-transformer-jax, May 2021.
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. What language model architecture and pretraining objective work best for zero-shot generalization? ICML, 2022a. URL https://arxiv.org/abs/2204.05832.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022b. URL https: //arxiv.org/abs/2212.10560.
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2022c. URL https://arxiv.org/abs/2204.07705.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. ICLR 2022, 2021. URL https: //openreview.net/forum?id=gEZrGCozdqR.
Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795–813, 2022.
Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning, 2022. URL https://arxiv.org/abs/2212.10773.
Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. Crossfit: A few-shot learning challenge for cross-task general-ization in NLP. In EMNLP, 2021. URL https://arxiv.org/abs/2104.08835.
Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, and Minjoon Seo. Guess the instruction! making language models stronger zero-shot learners. arXiv preprint arXiv:2210.02969, 2022.
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. Advances in neural information processing systems, 32, 2019.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Appendix
Table of Contents
21
A Experimental Details | 20 |
A.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 20 |
A.2 Single-Task Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 20 |
A.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 20 |
B Input Inversion Details | 21 |
A Experimental Details
A.1 Instruction Tuning
The Flan Collection experiments are assembled and run using T5X (Roberts et al., 2022). Our instruction tuning follows the same setup described in Chung et al. (2022). For few-shot and few-shot Chain-of-Thought prompts during fine tuning our templatizing procedure generates few-shot examples with 2, 3, or 5exemplars. The experiments in this work use a slightly earlier version of the Flan 2022 collection the one we are releasing, which had some minor improvements to the templates.
The mixture weights used to balance the various sources of data were informed by experiments in Section 3.5, along with the resulting practitioner intuition.
A.2 Single–Task Finetuning
For single-task finetuning, described in Section 4, our models are finetuned for 100;000 steps for all tasks. We useaconstantlearningrateof0.001,adropoutprobabilityof0.1,andabatchsizeof128length-512sequences. We save a checkpoint every 20 steps and report test performance on the model checkpoint corresponding to the highest validation performance. For tasks without a validation split, we hold out 1024 training examples for validation. For tasks without a test split, we hold out 1024 training examples for validation and report results on the original validation set. For PubmedQA, we do not use any of the unlabeled and artificially generated QA instances associated with the dataset. For CxC, we only consider the text-text portion of the dataset, following Vu et al. (2022). For tasks with less than 1K training examples, we report average results across 3 random seeds.
We also evaluate on certain metrics to account for label skew in some of the datasets, as shown in Table 3.
A.3 Evaluation
For Held-In evaluations we use the validation sets from 4 question answering (QA) tasks, BoolQ, ARC Easy, ARC Challenge, and AI2’s Middle School Science Exams, and 4 natural language inference (NLI) tasks, including ANLI R1, R2, R3, and RTE. These datasets are contained in the Flan 2022 finetuning collection and represent challenging benchmarks, often used to evaluate LLMs on QA and NLI. The Held-In score is the mean accuracy across these 8 tasks.
For the Chain-of-Thought (CoT) evaluation, we use the mean accuracy across 5 datasets which have been preparedwithpromptswhichrequeststep-by-stepexplanationsintheirtargetanswers: GSM8K,StrategyQA, SVAMP, Asdiv, and CommonsenseQA.
FortheHeld-Outtasks,weuseMMLU’ssuiteof57exams,andBBH’ssuiteof23taskswherePaLMperformed worse than the average human annotators. MMLU tasks were removed from the Super-Natural Instructions part of the Flan 2022 collection at training, to ensure they were Held-Out.
B Input Inversion Details
For the input inversion experiments we note that Flan 2021, P3++, and Super-Natural Instructions already implicitly include tasks that have been inverted, e.g. question answering to question or context generation. Consequently, we choose to also create input inversions for the remaining datasets in the Flan 2022 collection, including for the Dialog, Program Synthesis, and Chain-of-Thought tasks.
As examples: for Dialog tasks, we write template instructions asking for the previous conversational history fromthecurrentdialogturn; forprogramsynthesisweaskforthecodingquestionwhichthecodesolves; and for Chain-of-Thought we include every permutation of the query-answer-explanation triple, where at least one of the three appears as the in output. An illustration of Chain-of-Thought input inversion permutations are shown in Figure 7.
These inversions are mixed in with the existing tasks at a rate of 30%, meaning for a Dialog task, 3 inverted examples will be generated for every 10 regular examples. We choose this rate for simplicity, approximately mirroring prior work, and leave the large space of exploration for future work.