Jason Wei Najoung Kim Yi Tay Quoc V. Le
Scaling up language models has been empirically shown to improve performance and unlock emergent abilities. Conversely, observing worse performance as a function of scale (“inverse scaling”) would indicate that scaling encourages behaviors that are misaligned with human preferences. The Inverse Scaling Prize (McKenzie et al., 2022a) identiﬁed eleven such inverse scaling tasks, evaluated on models of up to 280B parameters and up to 500 zettaFLOPs of training compute.
This paper takes a closer look at these inverse scaling tasks. We evaluate models of up to 540B parameters, trained on ﬁve times more compute than those evaluated in the Inverse Scaling Prize. With this increased range of model sizes and training compute, only four out of the eleven tasks remain inverse scaling. Six out of the eleven tasks exhibit what we call “U-shaped scaling”—performance decreases up to a certain model size, and then increases again up to the largest model evaluated (the one remaining task displays positive scaling). U-shaped scaling suggests that the inverse scaling trend observed in McKenzie et al. (2022b,c) may not continue to hold for larger models, and adds further support to the claim that suﬃciently large models unlock emergent abilities.
Scaling up language models has been shown to improve model performance and unlock emergent abilities in a range of settings (Kaplan et al., 2020; Brown et al., 2020; Srivastava et al., 2022; Wei et al., 2022a, inter alia). However, are there any tasks for which model behavior gets worse as model scale increases? Such tasks have been referred to as inverse scaling tasks, and they could indicate that the models’ training data or optimization objectives are ﬂawed in some way (McKenzie et al., 2022a).
The Inverse Scaling Prize was created to identify these inverse scaling tasks for which larger language models show increasingly undesirable behavior, with winning submissions potentially receiving monetary awards from a $250k prize pool (McKenzie et al., 2022a). Submissions were scored based on a range of criteria including inverse scaling strength, task importance, novelty/surprisingness, task coverage, reproducibility, and inverse scaling generality across diﬀerent models.
The Inverse Scaling Prize received over eighty unique submissions, with eleven tasks awarded Third Prizes, the datasets for which have been publicly released (McKenzie et al., 2022a). Inverse scaling curves for the eleven tasks were shown on a range of language models with scales spanning several orders of magnitude in parameters, including Gopher (42M–280B; Rae et al., 2021), Chinchilla (400M–70B; Hoﬀmann et al., 2022), and an Anthropic internal model (13M–52B). The eleven tasks are shown in Figure 3.
In this paper, we take a closer look at the scaling behaviors for these eleven tasks. First, we evaluate PaLM models of up to 540B parameters (Chowdhery et al., 2022), trained on about five times more compute than the models evaluated in the Inverse Scaling Prize submissions (see Table 1). Under this setup, we find that six out of the eleven tasks exhibit what we call U-shaped scaling: performance first decreases up to a certain model scale, and then increases again for larger models. With one task demonstrating positive scaling (monotonically increasing performance) with PaLM, this brings the number of inverse scaling tasks down to four in the context of the additional scale provided in our experiments. This ﬁnding of U-shaped scaling is consistent with prior observations of U-shaped scaling on BIG-Bench tasks such as TruthfulQA (Lin et al., 2022), Persian Idioms, and Identify Math Theorems (Srivastava et al., 2022, see Appendix Figure 7). The implication of U-shaped scaling is that inverse scaling curves may not extrapolate to larger scales, since performance could either keep decreasing (true inverse scaling), or start increasing (U-shaped scaling).
We do not experimentally investigate how or why U-shaped scaling occurs, but we hypothesize that it can happen when a task contains a “distractor task”. Medium-sized models can perform the distractor task better than smaller models, which hurts performance in comparison to the smaller models. As the models scale further, the larger models can ignore the distractor task and perform the true task, which can be seen as an emergent ability that derives from scaling (Ganguli et al., 2022; Wei et al., 2022a).
The second part of this paper explores whether chain-of-thought(CoT)prompting(Weietal.,2022b)changes the scaling trends on four tasks from the ﬁrst round of the Inverse Scaling Prize. CoT prompting is a form of prompt engineering that encourages the model to decompose the task into intermediate steps. Our experiments show that with CoT prompting, none of the four tasks are inverse scaling. With CoT prompting, large models even achieve 100% accuracy on two tasks and seven out of eight sub-tasks in Redeﬁne Math. These results suggest that even when a given task inverse scales with basic prompting, chain-of-thought can be used to mitigate unwanted scaling behavior.
Finally, we ﬁnd that providing 1-shot demonstrations changes three tasks from inverse scaling to U-shaped, resulting in ten out of eleven tasks exhibiting non-inverse scaling. Thus, minimal demonstrations also seem to be an eﬀective mitigation strategy for inverse scaling.
Overall, the Inverse Scaling Prize has identiﬁed intriguing evaluation tasks for studying language model behavior with respect to scaling and prompting. We also note that the existence of U-shaped scaling does not mean that the these tasks are solved, since several tasks show U-shaped scaling but the best performance still remains worse than chance. Future work could investigate how to achieve beyond-random performance on all inverse scaling tasks. Additionally, the four tasks that remain inverse scaling under the zero-shot setup would merit further scrutiny, even though CoT or few-shot demonstrations can turn them into U-shape or positive scaling, considering that zero-shot would be one of the most typically expected scenarios in downstream user interactions.
2 U-shaped scaling
Setup. In this section, we evaluate PaLM models (Chowdhery et al., 2022) on all Inverse Scaling Prize tasks. We use 8B, 62B, and 540B PaLM models presented in the original paper and also include a 1B model trained on 40B tokens, which is 0.2 zettaFLOPs of compute.1 PaLM-540B has about twice as many parameters as the largest model evaluated in the Inverse Scaling Prize (Gopher-280B), and used about ﬁve times as
much compute, 2.5K zettaFLOPs, versus Chinchilla-70B, which used 560 zettaFLOPs. We follow the exact experimental setup from the Inverse Scaling Prize (McKenzie et al., 2022a), with the same prompts and scoringclassiﬁcationprotocol,whereallanswerchoicesarescoredandtheoptionwiththehighestprobability is chosen as the prediction.2
Results. The results for PaLM on all eleven tasks are shown in Figure 2, with the average performance of PaLMhighlightedinFigure1ontheﬁrstpage. We also plot the results for Anthropic, Gopher, and Chinchilla, as given in McKenzie et al. (2022b,c). In summary, only four out of eleven tasks remain inverse scaling once the PaLM 540B model is included. Six out of eleven tasks change from inverse scaling to U-shaped, and one task (Repetitive Algebra) shows positive scaling with PaLM. This broad emergence of U-shaped scaling demonstrates the diﬃculty of extrapolating inverse scaling curves to larger models.
Potentialexplanation. AnaturalquestionfortheU-shapedscalingresultsis,whydoesperformancedecrease and then increase again? One speculative hypothesis is the following. Each Inverse Scaling Prize task can be decomposed into two tasks: (1) the “true task” and (2) a “distractor task” where performing the distractor task well hurts performance on the true task. Small models cannot perform either task, and performs at around chance. Medium-sized models can perform the distractor task, which results in worse performance compared to smaller models. Large models are able to ignore the distractor task and perform the true task, which then leads back to increased performance and potentially solving the task. We describe potential distractor tasks for some of the inverse scaling tasks in Appendix B, Figure 6. Note that while it could be possible to measure model performance on the distractor task only, this would be an imperfect ablation since the distractor task and true task could not only have a competing but also a joint eﬀect on performance. We leave further explanation of why U-shaped scaling occurs as future work.
Limitations. The broad emergence of U-shaped scaling across these tasks does not mean that the tasks from the Inverse Scaling Prize are solved. This is because for some tasks, even when PaLM 540B increases performance compared to PaLM 62B, the absolute performance is still at or worse than chance (e.g., Negation QA).Hence,thereisanopportunityforfurtherresearchtoimproveperformancebeyondtherandombaseline on these tasks. Furthermore, the tasks that remain inverse scaling after including larger models would merit further scrutiny (Pattern Matching Suppression, Into the Unknown, Redeﬁne Math, Prompt Injection).
3 The eﬀect of chain-of-thought prompting on inverse and U-shaped scaling
We next explore how scaling behavior changes when using a diﬀerent type of prompting. Most of the Inverse Scaling Prize tasks use the prompting strategy that involves an instruction describing the task and some structured formatting (e.g., including “Q:” and “A:” as part of the prompt). Recent work has shown that chain-of-thought (CoT) prompting, which encourages the model to output intermediate steps before giving the ﬁnal answer, can improve performance by a large margin for multi-step reasoning tasks (Wei et al., 2022b; Kojima et al., 2022; Suzgun et al., 2022, inter alia). In other words, prompting without CoT is only a lower-bound of the capabilities of the model—we explore here whether CoT could potentially serve as a viable mitigation strategy for inverse scaling, using four tasks from Round 1 of the Inverse Scaling Prize.
For the experiments in this section only, we changed the prompt templates to follow the protocol of Wei et al. (2022b) and follow-up work. Because CoT prompting requires generating intermediate steps, we use free-form generation followed by exact string match, which requires minor prompt modiﬁcations to accommodate parsing the ﬁnal answer. Speciﬁcally, all prompts are at least one shot, answer options were provided in the input prompt, and the model was prompted to output “the answer is”.3 As an ablation, we also use this new format for experiments on no-CoT as well, which are the same prompts but without CoT. We plot the CoT and no-CoT results in this template alongside the results using the unmodiﬁed (original) prompts from McKenzie et al. (2022a), which do not use CoT.
CoT prompting results for Negation QA, Hindsight Neglect, and Quote Repetition are shown in Figure 4, with the examples of CoT prompts also shown in the same ﬁgure. For Hindsight Neglect and Negation QA, CoT prompting changes the scaling curves to positive (monotonically increasing). For Quote Repetition, CoT prompting still has a U-shaped curve, though performance is noticeably better for 8B/62B models and also achieves a perfect solve rate at 540B. Note that for Negation QA, changing the prompt format alone without CoT changes the scaling curve from slightly inverse to U-shaped, potentially because adding exemplars helped the model learn to solve the task (also see Section 5).
CoT prompting results for the last task, Redeﬁne Math, are shown in Figure 5. Since this task consists of eight sub-tasks, each with a diﬀerent instruction, we also stratify performance by sub-task to explore whether the same scaling behavior holds across sub-tasks. In summary, CoT prompting exhibits positive scaling for all sub-tasks, achieving 100% solve rates at 62B and 540B models for seven out of eight sub-tasks. Illustrative examples of this trend are the “+ as digit” and “+ as random number” sub-tasks that show strong inverse scalingcurvesevenuptoPaLM-540B,butachieveperfectaccuracyusingCoTforallmodels. Theonetaskthat is still not solved by CoT prompting requires the model to execute the modulo operation across multi-digit numbers (e.g., 876 mod 23) and does not achieve above-random performance, even by the 540B model.
In summary, all tasks and sub-tasks studied in this section exhibit either U-shaped scaling or positive scaling when using CoT prompting. This does not mean that the no-CoT prompting result is invalid; rather, it adds additional nuance by underscoring how the scaling curve for a task diﬀers based on what type of prompting is used. In other words, the same task can show an inverse scaling curve for one type of prompting and U-shaped scaling or positive scaling for another type of prompting.
This paper has two simple takeaways. First, inverse scaling can turn into U-shaped scaling when evaluated on models of suﬃciently large scale, as demonstrated on six out of eleven Inverse Scaling Prize tasks. The prevalenceofU-shapedscalingweidentiﬁedinthispapershowsthatinversescalingcurvesdonotnecessarily extrapolate to larger models. Second, when CoT prompting is applied, all four tasks from Round 1 of the Inverse Scaling Prize could be changed to exhibit either positive or U-shaped scaling. This means that the same task can have a diﬀerent scaling curve based on what type of prompting is used. Together, the implication is that a combination of scaling and prompting techniques appear to be a reliable method for improving model performance. Overall, the inverse scaling tasks serve as a useful analytical tool for large language models with respect to scaling behavior and sensitivity to prompting.
5 Addendum: Providing 1-shot demonstrations makes three more tasks U-shaped
The above results where seven out of eleven tasks are U-shaped or increasing are obtained with zero-shot prompts. To gauge the eﬀect of demonstrations, we also ran experiments using prompts including one-shot demonstrationsusingdataprovidedbytheInverseScalingPrizeorganizers, and found that one-shot changes three more tasks from inverse scaling to U-shaped. With one-shot prompts, ten out of eleven tasks are non-inverse scaling (U-shaped or positive). Again we do not explore why in this paper, but we hypothesize that an exemplar may allow the model to learn input–output relationships that are helpful for avoiding the distractor task.
Thanks Ethan Perez and Ian McKenzie for their help with sharing the Round 2 data in the fourth version of the report. Thanks Ethan Perez, Ian McKenzie, and Najoung Kim for help with the third version of the report. Thanks Ethan Perez for feedback that we incorporated into the second arXiv version of the report. Thanks Denny Zhou, Ed Chi, and Le Hou for feedback on the initial report. Finally, we really appreciate the spirit and organization of the Inverse Scaling Prize organizers—thank you!
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
NeurIPS, 2020. URL https://arxiv.org/abs/2005.14165.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, etal. PaLM: Scaling language modeling with Pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022. URL https: //dl.acm.org/doi/abs/10.1145/3531146.3533229.
Jordan Hoﬀmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, et al. Training compute-optimal large language models. NeurIPS, 2022. URL https://arxiv.org/abs/2203.15556.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 2022. URL https://arxiv.org/abs/2205.11916.
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthful QA: Measuring how models mimic human false hoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. The inverse scaling prize, 2022a. URL https://github.com/inverse-scaling/prize.
Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Announcing the inverse scaling prize ($250k prize pool). Lesswrong, 2022b. URL https://www.lesswrong.com/posts/eqxqgFxymP8hXDTt5/ announcing-the-inverse-scaling-prize-usd250k-prize-pool.
Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Inverse scaling prize: Second round winners. Lesswrong, 2022c. URL https://www. lesswrong.com/posts/DARiTSTx5xDLQGrrz/inverse-scaling-prize-second-round-winners.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoﬀmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, et al. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. URL https://arxiv.org/abs/2206.04615.
Mirac Suzgun, Nathan Scales, Nathaneal Scharli, Sebastian Gehrmann, YiTay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, et al. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022. URL https://arxiv.org/abs/2210.09261.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raﬀel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. TMLR, 2022a. URL https://openreview.net/forum?id=yzkSU5zdwD.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022b. URL https://arxiv.org/abs/2201.11903.
A Full results
A.1 Full results for CoT experiments
The full results used in the CoT experiments in this paper are shown below.
B Distractor tasks
We show a speculative decomposition of tasks into the true task and a distractor task in Figure 6.
C Prior examples of U-shaped scaling
D Model scale: parameters, data, and compute
As shown in Table 4, we computed training FLOPs following the protocol of Brown et al. (2020).
In the second version of the arxiv paper, it was reported that only two of the four ﬁrst-round tasks were U-shaped. However, actually three of the were U-shaped. This error was because I (Jason) accidentally swapped the PaLM 62B numbers for Hindsight and NeQA. I realized the error when I reproduced those tasks for the third arxiv version.