Jason Wei Najoung Kim Yi Tay Quoc V. Le

Google

Abstract

Scaling up language models has been empirically shown to improve performance and unlock emergent abilities. Conversely, observing worse performance as a function of scale (“inverse scaling”) would indicate that scaling encourages behaviors that are misaligned with human preferences. The Inverse Scaling Prize (McKenzie et al., 2022a) identiﬁed eleven such inverse scaling tasks, evaluated on models of up to 280B parameters and up to 500 zettaFLOPs of training compute.

This paper takes a closer look at these inverse scaling tasks. We evaluate models of up to 540B parameters, trained on ﬁve times more compute than those evaluated in the Inverse Scaling Prize. With this increased range of model sizes and training compute, only four out of the eleven tasks remain inverse scaling. Six out of the eleven tasks exhibit what we call “U-shaped scaling”—performance decreases up to a certain model size, and then increases again up to the largest model evaluated (the one remaining task displays positive scaling). U-shaped scaling suggests that the inverse scaling trend observed in McKenzie et al. (2022b,c) may not continue to hold for larger models, and adds further support to the claim that suﬃciently large models unlock emergent abilities.

Figure 1: Across ten tasks from the Inverse Scaling Prize (McKenzie et al., 2022a), PaLM on average exhibits *U-shaped* *scaling*, which means that performance ﬁrst decreases and then increases again. Scaling can be viewed through the axis of either compute (zettaFLOPs for pre-training: left) or model size (# of parameters: right). All results use the exact prompts and evaluation format speciﬁed by McKenzie et al. (2022a). The y-axis denotes the average accuracy of ten tasks that use accuracy as the metric, excluding Prompt Injection that uses loss as the metric.

1 Introduction

Scaling up language models has been shown to improve model performance and unlock emergent abilities in a range of settings (Kaplan et al., 2020; Brown et al., 2020; Srivastava et al., 2022; Wei et al., 2022a, inter alia). However, are there any tasks for which model behavior gets worse as model scale increases? Such tasks have been referred to as inverse scaling tasks, and they could indicate that the models’ training data or optimization objectives are ﬂawed in some way (McKenzie et al., 2022a).

The Inverse Scaling Prize was created to identify these inverse scaling tasks for which larger language models show increasingly undesirable behavior, with winning submissions potentially receiving monetary awards from a $250k prize pool (McKenzie et al., 2022a). Submissions were scored based on a range of criteria including inverse scaling strength, task importance, novelty/surprisingness, task coverage, reproducibility, and inverse scaling generality across diﬀerent models.

The Inverse Scaling Prize received over eighty unique submissions, with eleven tasks awarded Third Prizes, the datasets for which have been publicly released (McKenzie et al., 2022a). Inverse scaling curves for the eleven tasks were shown on a range of language models with scales spanning several orders of magnitude in parameters, including Gopher (42M–280B; Rae et al., 2021), Chinchilla (400M–70B; Hoﬀmann et al., 2022), and an Anthropic internal model (13M–52B). The eleven tasks are shown in Figure 3.

Table 1: Scale of the largest model in each model
family in the Inverse Scaling Prize compared to
this paper.

In this paper, we take a closer look at the scaling behaviors for these eleven tasks. First, we evaluate PaLM models of up to 540B parameters (Chowdhery et al., 2022), trained on about five times more compute than the models evaluated in the Inverse Scaling Prize submissions (see Table 1). Under this setup, we find that six out of the eleven tasks exhibit what we call U-shaped scaling: performance first decreases up to a certain model scale, and then increases again for larger models. With one task demonstrating positive scaling (monotonically increasing performance) with PaLM, this brings the number of inverse scaling tasks down to four in the context of the additional scale provided in our experiments. This ﬁnding of U-shaped scaling is consistent with prior observations of U-shaped scaling on BIG-Bench tasks such as TruthfulQA (Lin et al., 2022), Persian Idioms, and Identify Math Theorems (Srivastava et al., 2022, see Appendix Figure 7). The implication of U-shaped scaling is that inverse scaling curves may not extrapolate to larger scales, since performance could either keep decreasing (true inverse scaling), or start increasing (U-shaped scaling).

We do not experimentally investigate how or why U-shaped scaling occurs, but we hypothesize that it can happen when a task contains a “distractor task”. Medium-sized models can perform the distractor task better than smaller models, which hurts performance in comparison to the smaller models. As the models scale further, the larger models can ignore the distractor task and perform the true task, which can be seen as an emergent ability that derives from scaling (Ganguli et al., 2022; Wei et al., 2022a).

The second part of this paper explores whether chain-of-thought(CoT)prompting(Weietal.,2022b)changes the scaling trends on four tasks from the ﬁrst round of the Inverse Scaling Prize. CoT prompting is a form of prompt engineering that encourages the model to decompose the task into intermediate steps. Our experiments show that with CoT prompting, none of the four tasks are inverse scaling. With CoT prompting, large models even achieve 100% accuracy on two tasks and seven out of eight sub-tasks in Redeﬁne Math. These results suggest that even when a given task inverse scales with basic prompting, chain-of-thought can be used to mitigate unwanted scaling behavior.

Finally, we ﬁnd that providing 1-shot demonstrations changes three tasks from inverse scaling to U-shaped, resulting in ten out of eleven tasks exhibiting non-inverse scaling. Thus, minimal demonstrations also seem to be an eﬀective mitigation strategy for inverse scaling.

Figure 2: Scaling curves for the eleven Inverse Scaling Prize tasks. Prompt Injection (**Injection**) uses loss as the evaluation metric and is not included in the average. The only model that has been added in this paper is PaLM (Chowdhery et al., 2022). Results from other models are taken from McKenzie et al. (2022b,c).

Overall, the Inverse Scaling Prize has identiﬁed intriguing evaluation tasks for studying language model behavior with respect to scaling and prompting. We also note that the existence of U-shaped scaling does not mean that the these tasks are solved, since several tasks show U-shaped scaling but the best performance still remains worse than chance. Future work could investigate how to achieve beyond-random performance on all inverse scaling tasks. Additionally, the four tasks that remain inverse scaling under the zero-shot setup would merit further scrutiny, even though CoT or few-shot demonstrations can turn them into U-shape or positive scaling, considering that zero-shot would be one of the most typically expected scenarios in downstream user interactions.

2 U-shaped scaling

Setup. In this section, we evaluate PaLM models (Chowdhery et al., 2022) on all Inverse Scaling Prize tasks. We use 8B, 62B, and 540B PaLM models presented in the original paper and also include a 1B model trained on 40B tokens, which is 0.2 zettaFLOPs of compute.1 PaLM-540B has about twice as many parameters as the largest model evaluated in the Inverse Scaling Prize (Gopher-280B), and used about ﬁve times as

Figure 3: Prompts for the eleven inverse scaling tasks from McKenzie et al. (2022a). […] marks where few-shot exemplars are placed. Few-shot exemplars are relevant in the following scenarios: (1)when they are part of the original task (e.g., Hindsight Neglect), and (2) in our 1-shot experiments discussed in Section 5.

much compute, 2.5K zettaFLOPs, versus Chinchilla-70B, which used 560 zettaFLOPs. We follow the exact experimental setup from the Inverse Scaling Prize (McKenzie et al., 2022a), with the same prompts and scoringclassiﬁcationprotocol,whereallanswerchoicesarescoredandtheoptionwiththehighestprobability is chosen as the prediction.2

Results. The results for PaLM on all eleven tasks are shown in Figure 2, with the average performance of PaLMhighlightedinFigure1ontheﬁrstpage. We also plot the results for Anthropic, Gopher, and Chinchilla, as given in McKenzie et al. (2022b,c). In summary, only four out of eleven tasks remain inverse scaling once the PaLM 540B model is included. Six out of eleven tasks change from inverse scaling to U-shaped, and one task (Repetitive Algebra) shows positive scaling with PaLM. This broad emergence of U-shaped scaling demonstrates the diﬃculty of extrapolating inverse scaling curves to larger models.

Potentialexplanation. AnaturalquestionfortheU-shapedscalingresultsis,whydoesperformancedecrease and then increase again? One speculative hypothesis is the following. Each Inverse Scaling Prize task can be decomposed into two tasks: (1) the “true task” and (2) a “distractor task” where performing the distractor task well hurts performance on the true task. Small models cannot perform either task, and performs at around chance. Medium-sized models can perform the distractor task, which results in worse performance compared to smaller models. Large models are able to ignore the distractor task and perform the true task, which then leads back to increased performance and potentially solving the task. We describe potential distractor tasks for some of the inverse scaling tasks in Appendix B, Figure 6. Note that while it could be possible to measure model performance on the distractor task only, this would be an imperfect ablation since the distractor task and true task could not only have a competing but also a joint eﬀect on performance. We leave further explanation of why U-shaped scaling occurs as future work.

Limitations. The broad emergence of U-shaped scaling across these tasks does not mean that the tasks from the Inverse Scaling Prize are solved. This is because for some tasks, even when PaLM 540B increases performance compared to PaLM 62B, the absolute performance is still at or worse than chance (e.g., Negation QA).Hence,thereisanopportunityforfurtherresearchtoimproveperformancebeyondtherandombaseline on these tasks. Furthermore, the tasks that remain inverse scaling after including larger models would merit further scrutiny (Pattern Matching Suppression, Into the Unknown, Redeﬁne Math, Prompt Injection).

3 The eﬀect of chain-of-thought prompting on inverse and U-shaped scaling

We next explore how scaling behavior changes when using a diﬀerent type of prompting. Most of the Inverse Scaling Prize tasks use the prompting strategy that involves an instruction describing the task and some structured formatting (e.g., including “Q:” and “A:” as part of the prompt). Recent work has shown that chain-of-thought (CoT) prompting, which encourages the model to output intermediate steps before giving the ﬁnal answer, can improve performance by a large margin for multi-step reasoning tasks (Wei et al., 2022b; Kojima et al., 2022; Suzgun et al., 2022, inter alia). In other words, prompting without CoT is only a lower-bound of the capabilities of the model—we explore here whether CoT could potentially serve as a viable mitigation strategy for inverse scaling, using four tasks from Round 1 of the Inverse Scaling Prize.

For the experiments in this section only, we changed the prompt templates to follow the protocol of Wei et al. (2022b) and follow-up work. Because CoT prompting requires generating intermediate steps, we use free-form generation followed by exact string match, which requires minor prompt modiﬁcations to accommodate parsing the ﬁnal answer. Speciﬁcally, all prompts are at least one shot, answer options were provided in the input prompt, and the model was prompted to output “the answer is”.3 As an ablation, we also use this new format for experiments on no-CoT as well, which are the same prompts but without CoT. We plot the CoT and no-CoT results in this template alongside the results using the unmodiﬁed (original) prompts from McKenzie et al. (2022a), which do not use CoT.

CoT prompting results for Negation QA, Hindsight Neglect, and Quote Repetition are shown in Figure 4, with the examples of CoT prompts also shown in the same ﬁgure. For Hindsight Neglect and Negation QA, CoT prompting changes the scaling curves to positive (monotonically increasing). For Quote Repetition, CoT prompting still has a U-shaped curve, though performance is noticeably better for 8B/62B models and also achieves a perfect solve rate at 540B. Note that for Negation QA, changing the prompt format alone without CoT changes the scaling curve from slightly inverse to U-shaped, potentially because adding exemplars helped the model learn to solve the task (also see Section 5).

CoT prompting results for the last task, Redeﬁne Math, are shown in Figure 5. Since this task consists of eight sub-tasks, each with a diﬀerent instruction, we also stratify performance by sub-task to explore whether the same scaling behavior holds across sub-tasks. In summary, CoT prompting exhibits positive scaling for all sub-tasks, achieving 100% solve rates at 62B and 540B models for seven out of eight sub-tasks. Illustrative examples of this trend are the “+ as digit” and “+ as random number” sub-tasks that show strong inverse scalingcurvesevenuptoPaLM-540B,butachieveperfectaccuracyusingCoTforallmodels. Theonetaskthat is still not solved by CoT prompting requires the model to execute the modulo operation across multi-digit numbers (e.g., 876 mod 23) and does not achieve above-random performance, even by the 540B model.

Figure4: Chain-of-thought(CoT) prompting changes Negation QA and Hind sight Neglect to positive scaling. Quote Repetition is U-shaped, even with CoT. CoT prompts are shown above. No-CoT prompts diﬀer slightly from the No-CoT (unmodiﬁed) used in McKenzie et al. (2022c) in that they are at least one-shot and say “the answer is”. For Quote Repetition, the unmodiﬁed No-CoT prompt was already in a suitable format.

In summary, all tasks and sub-tasks studied in this section exhibit either U-shaped scaling or positive scaling when using CoT prompting. This does not mean that the no-CoT prompting result is invalid; rather, it adds additional nuance by underscoring how the scaling curve for a task diﬀers based on what type of prompting is used. In other words, the same task can show an inverse scaling curve for one type of prompting and U-shaped scaling or positive scaling for another type of prompting.

4 Conclusions

This paper has two simple takeaways. First, inverse scaling can turn into U-shaped scaling when evaluated on models of suﬃciently large scale, as demonstrated on six out of eleven Inverse Scaling Prize tasks. The prevalenceofU-shapedscalingweidentiﬁedinthispapershowsthatinversescalingcurvesdonotnecessarily extrapolate to larger models. Second, when CoT prompting is applied, all four tasks from Round 1 of the Inverse Scaling Prize could be changed to exhibit either positive or U-shaped scaling. This means that the same task can have a diﬀerent scaling curve based on what type of prompting is used. Together, the implication is that a combination of scaling and prompting techniques appear to be a reliable method for improving model performance. Overall, the inverse scaling tasks serve as a useful analytical tool for large language models with respect to scaling behavior and sensitivity to prompting.

Figure5: TheRedeﬁneMathtaskbecomespositivescalingwhenusedwithchain-of-thought(CoT)prompting. Additionally, every sub-task except one achieves 100% accuracy for 64B and 540B models when using CoT.

5 Addendum: Providing 1-shot demonstrations makes three more tasks U-shaped

The above results where seven out of eleven tasks are U-shaped or increasing are obtained with zero-shot prompts. To gauge the eﬀect of demonstrations, we also ran experiments using prompts including one-shot demonstrationsusingdataprovidedbytheInverseScalingPrizeorganizers, and found that one-shot changes three more tasks from inverse scaling to U-shaped. With one-shot prompts, ten out of eleven tasks are non-inverse scaling (U-shaped or positive). Again we do not explore why in this paper, but we hypothesize that an exemplar may allow the model to learn input–output relationships that are helpful for avoiding the distractor task.

Acknowledgements

Thanks Ethan Perez and Ian McKenzie for their help with sharing the Round 2 data in the fourth version of the report. Thanks Ethan Perez, Ian McKenzie, and Najoung Kim for help with the third version of the report. Thanks Ethan Perez for feedback that we incorporated into the second arXiv version of the report. Thanks Denny Zhou, Ed Chi, and Le Hou for feedback on the initial report. Finally, we really appreciate the spirit and organization of the Inverse Scaling Prize organizers—thank you!

References

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.

NeurIPS, 2020. URL https://arxiv.org/abs/2005.14165.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, etal. PaLM: Scaling language modeling with Pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.

Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022. URL https: //dl.acm.org/doi/abs/10.1145/3531146.3533229.

Jordan Hoﬀmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, et al. Training compute-optimal large language models. NeurIPS, 2022. URL https://arxiv.org/abs/2203.15556.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 2022. URL https://arxiv.org/abs/2205.11916.

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthful QA: Measuring how models mimic human false hoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.

Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. The inverse scaling prize, 2022a. URL https://github.com/inverse-scaling/prize.

Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Announcing the inverse scaling prize ($250k prize pool). Lesswrong, 2022b. URL https://www.lesswrong.com/posts/eqxqgFxymP8hXDTt5/ announcing-the-inverse-scaling-prize-usd250k-prize-pool.

Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Inverse scaling prize: Second round winners. Lesswrong, 2022c. URL https://www. lesswrong.com/posts/DARiTSTx5xDLQGrrz/inverse-scaling-prize-second-round-winners.

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoﬀmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, et al. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. URL https://arxiv.org/abs/2206.04615.

Mirac Suzgun, Nathan Scales, Nathaneal Scharli, Sebastian Gehrmann, YiTay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, et al. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022. URL https://arxiv.org/abs/2210.09261.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raﬀel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. TMLR, 2022a. URL https://openreview.net/forum?id=yzkSU5zdwD.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022b. URL https://arxiv.org/abs/2201.11903.

Appendix

A Full results

Table 2: Exact results for all Inverse Scaling Prize tasks used in this paper (eleven tasks including both Round 1 and 2). We use the exact templates and protocol from McKenzie et al. (2022a) for the zero-shot results.

A.1 Full results for CoT experiments

The full results used in the CoT experiments in this paper are shown below.

Table 3: Exact results for the CoT experiments. Results for models other than PaLM are from McKenzie et al. (2022c). The unmodiﬁed prompts use the exact templates and protocol from McKenzie et al. (2022a). The modiﬁed prompts follow the templates and protocol of Wei et al. (2022b), which uses at least one few-shot exemplar, includes the answer options as part of the input, and uses the phrase “the answer is” to facilitate parsing the ﬁnal answer after CoT.

B Distractor tasks

We show a speculative decomposition of tasks into the true task and a distractor task in Figure 6.

Figure 6: A possible hypothesis for why U-shaped scaling emerges. U-shaped scaling tasks consist of a true task and a distractor task. Medium-sized models are good enough to perform the distractor tasks, which hurts performance compared to smaller models that cannot perform the distractor task nor the true task. Larger models can ignore the distractor task and perform the true task, which leads to increased performance again.

C Prior examples of U-shaped scaling

Figure 7: Three examples of U-shaped scaling behavior from BIG-Bench (Srivastava et al., 2022). a: identify math theorems. b: persian idioms. c: truthful_qa. The above are screenshots from https://github.com/ google/BIG-bench/tree/main/bigbench/benchmark_tasks/.

D Model scale: parameters, data, and compute

As shown in Table 4, we computed training FLOPs following the protocol of Brown et al. (2020).

Table 4: Computation of training FLOPs for GPT-3, Anthropic, Gopher, and Chinchilla, and PaLM.

D.1 Correction

In the second version of the arxiv paper, it was reported that only two of the four ﬁrst-round tasks were U-shaped. However, actually three of the were U-shaped. This error was because I (Jason) accidentally swapped the PaLM 62B numbers for Hindsight and NeQA. I realized the error when I reproduced those tasks for the third arxiv version.

Inverse scaling can become U-shaped

Abstract

1 Introduction

2 U-shaped scaling

3 The eﬀect of chain-of-thought prompting on inverse and U-shaped scaling

4 Conclusions

5 Addendum: Providing 1-shot demonstrations makes three more tasks U-shaped

Acknowledgements

References

Appendix

A Full results

B Distractor tasks

C Prior examples of U-shaped scaling

D Model scale: parameters, data, and compute

LARGER LANGUAGE MODELS DO IN-CONTEXT LEARNING DIFFERENTLY

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai