Aug 18, 2023
Haipeng Luo2∗ Qingfeng Sun1∗ Can Xu1† Pu Zhao1 Jianguang Lou1
Chongyang Tao1 Xiubo Geng1 Qingwei Lin1 Shifeng Chen2† Dongmei Zhang1
2Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. WizardMath surpasses all other open-source LLMs by a substantial margin. Furthermore, our model even outperforms ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci-002, PaLM-1 and GPT-3 on MATH. More details and model weights are public at https://github.com/nlpxucan/WizardLM 3 and https://huggingface.co/WizardLM.
Recently, Large-scale language models (LLMs) have garnered significant attention and become the go-to approach for numerous natural language processing (NLP) tasks, including open domain conversation [1–4], coding [5–13] and math [14–19]. A conspicuous example is ChatGPT, developed by OpenAI. This model uses extensive pre-training on large-scale internet data and further fine-tuning with specific instruction data and methods. As a result, it achieves state-of-the-art zero-shot performance on various benchmarks. Subsequently, Anthropic, Google, and Meta also launched their competitive products one after another. Notably, Meta’s series of Llama [4, 20] models have sparked an open-source revolution and quickly narrowed the gap with those closed-source LLMs. This trend also gradually stimulates the releases of MPT8, Falcon , StarCoder , Alpaca , Vicuna , and WizardLM , etc. However, these open models still struggles with the scenarios which require complex multi-step quantitative reasoning, such as solving mathematical and science challenges [25–35].
Chain-of-thought (CoT)  proposes to design better prompts to generate step-by-step solutions, which can lead to improved performance. Self-Consistency  also achieves remarkable performance on many reasoning benchmarks, which generates several possible answers from the model and selects the correct one based on majority vote . In recent,  finds that process supervision with reinforcement learning significantly outperforms outcome supervision for solving challenging MATH problems.
Inspired by Evol-Instruct and Process-supervised Reinforcement Learning, this work aims to enhance the mathematical reasoning abilities of the SOTA open-source LLM, Llama-2 . As shown in the Figure 1, we propose a new method named Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which could firstly generate diverse math instructions data by math-specific Evol-Instruct, then we train an instruction reward model (IRM) and a process-supervised reward model (PRM) [16, 36–41], the former indicates the quality of the evolved instruction and the later receives feedback for each step in the solution. The brand-new Evol-Instruct method includes two downward evolution and upward evolution progress to produce the grade school math and challenging math respectively. Initially, we re-generate, filter and finetune the original math instruction data from GSM8k  and MATH . Immediately, we train the Llama-2 models to obtain the reward models and our WizardMath.
We perform experiments on two mathematical reasoning benchmarks, namely GSM8k  and MATH , the results demonstrate that our WizardMath outperforms all other open-source LLMs, achieving state-of-the-art performance. Specifically, WizardMath observe a substantial improvement in pass@1 with an increase of +24.8 (81.6. vs. 56.8) on GSM8k, and +9.2 (22.7 vs. 13.5) on MATH. Notably, our model even also significantly surpasses OpenAI’s ChatGPT-3.55, Anthropic’s Claude Instant-1 , and Google’s PaLM-2  in terms of pass@1 on GSM8k.
The main contributions of this work are as following:
• We introduce WizardMath model, which enhances the mathematical reasoning abilities for open-source pretrained large language model Llama-2 .
• We propose a new method, Reinforcement Learning from Evol-Instruct Feedback (RLEIF), alongside Evol-Instruct and Reinforcement Learning, for improving LLM reasoning performance.
• WizardMath surpasses all other open-source LLMs by a substantial margin in terms of mathematical reasoning, including Llama-2 70B , Llama-1 65B , Falcon-40B , MPT-30B8, Baichuan-13B Chat9 and ChatGLM2 12B  on both GSM8k  and MATH .
• WizardMath significantly outperforms various main closed-source LLMs, such as ChatGPT5, GPT-3.5, Claude Instant , PaLM-2 , PaLM-1  and Minerva on GSM8k.
In this section, we elaborate on the details of our WizardMath. Following WizardLM and PRMs, we propose Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which integrates the Evol-Instruct and reinforced process supervision method to evolve GSM8k and MATH, and fine-tune the pre-trained Llama-2 with the evolved data and reward models.
As shown in the Figure 1, our methods apply three steps:
1. Supervised fine-tuning.
2. Training instruction reward model, and process-supervised reward model.
3. Active Evol-Instruct, and PPO training.
2.1 Supervised fine-tuning
Following InstructGPT, we also firstly fine tune the base with supervised instruction-response pairs, which contains:
1. To make the parsing of each step easier, we few-shot re-generate 15k answers for GSM8k and MATH with an Alpha version of WizardLM 70B model to produce solutions in a step-by-step format, then find out those with a correct answer, and use this data to finetune base Llama model.
2. To enhance the model’s ability to adhere to the neural and diverse instructions, we also sample 1.5k open-domain conversations from WizardLM’s training data, then merge it with above math corpus as the final SFT training data.
2.2 Evol-Instruct principles for math
Motivated by the Evol-Instruct  method proposed by WiazrdLM and its effective application on WizardCoder , this work attempts to make math instructions with various complexities and diversity to enhance the pre-trained LLMs. Specifically, we adapt Evol-Instruct to a new paradigm including two evolution lines:
1. Downward evolution: It enhances instructions by making the questions easier. For example i): revising high difficulty questions to lower difficulty, or ii) producing a new and easier question with another different topic.
2. Upward evolution: Derived from original Evol-Instruct method, it deepens and generates new and harder questions by i) adding more constraints, ii) concretizing, iii) increasing reasoning.
2.3 Reinforcement Learning from Evol-Instruct Feedback (RLEIF)
Inspired by InstructGPT and PRMs, we train two reward models to predict the quality of the instructions and the correctness of each step in the answer respectively:
1. Instruction Reward Model (IRM): This model aims to judge the quality of the evolved instructions on three aspects: i) Definition, ii) Precision, and iii) Integrity. To produce the ranking list training data of IRM, for each instruction, we firstly use ChatGPT and Wizard-E 4 to generate 2~4 evolved instructions respectively. Then we leverage Wizard-E to rank the quality of those 4~8 instructions.
2. Process-supervised Reward Model (PRM): As there is no powerful open-source math reasoning LLMs before this work, there is no simple way to support highly precise process supervision without professional human-labelers and close-source ChatGPT. Therefore, we depend on ChatGPT to provide process supervision, and ask it to assess the correctness of each step in the solutions generated by our model.
3. PPO training. We evolve the original math (GSM8k + MATH) instructions by 8 turns, increasing the data size from 15k to 96k. We use IRM and PRM to generate the instruction reward(rI) and the answer reward(rA). Then apply a product as the final reward r = rI ·rA.
This section provides a comprehensive overview of the baseline models in our experiments. Subsequently, we mainly elucidate the performance metrics of our models on two prevalent mathematical benchmarks: GSM8k  and MATH .
Close-Source Models. Numerous technology companies have effectively created exceptionally proficient Large Language Models (LLMs) [3, 4, 7, 20, 44, 45, 47, 51–53], but have opted against
Table1: Resultsofpass@1(%)onGSM8kandMATH.Inthisstudy, to ensure equitable and cohesive evaluations, we report the socres of all models within the settings of greedy decoding and CoT . We report the improvement between WizardMath and baseline model with similar parameter size.
Table 2: Results of pass@1 (%) on MATH Subtopics with WizardMath 70B model.
making them publicly available, so they are referred to as close-source models. In our research, we extensively integrate a significant number of close-source models as the foundational benchmarks. Specifically, our baselines encompass the following models: (i) OpenAI’s GPT-3 , GPT-3.5, ChatGPT5, GPT-4 ; (ii) Google’s PaLM 2 , PaLM , and Minerva ; (iii) Anthropic’s Claude Instant , Claude 1.36, Claude 27, DeepMind’s Chinchilla .
Open-Source Models. Massive open-source LLMs [4, 20–23, 45, 52, 53] have been accessible to the AI community. Nonetheless, their performance consistently tends to significantly lag behind the close-source models. As part of our research, we incorporate a significant number of these open-source models as our baselines, which mainly contain the following: Llama 1  & Llama 2 , GAL , GPT-J , GPT-Neo , Vicuna , MPT8, Falcon, Baichuan9, ChatGLM , Qwen10 and RFT .
3.2 Evaluate Benchmarks
We mainly evaluate WizardMath on two benchmarks (GSM8k  and MATH ). The GSM8k  dataset contains approximately 7500 training data and 1319 test data, mainly on grade school level math problems, each of which consists of basic arithmetic operations (addition, subtraction, multiplication, and division), andgenerallyrequires2to8stepstosolve. The MATH dataset collects math problems from prestigious math competitions such as AMC 10, AMC 12, and AIME. It contains 7500 training data and 5,000 challenging test data in seven academic areas: Preal-gebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus. Furthermore, these problems are divided into five levels of difficulty, with ‘1’ denoting the relatively lower difficulty level and ‘5’ indicating the highest level.
3.3 Train and Evaluation prompt
The Llama 2  base serves as our foundation model.
We undertake the training of our WizardMath by employing the prompt from Alpaca :
We evaluate GSM8k  and MATH benchmarks  by employing the following CoT  prompt:
3.4 Evaluation on GSM8k and MATH
Notably, in the Figure 2 and Table 1, we cite the metrics of GPT-4 and GPT-3.5 from . The evaluation of the ChatGPT model’s scores are from . For the assessment of Claude Instant, Claude 1.3, and Claude 2, the scores are extracted from 7. The scores of PaLM 1, PaLM 2, and Minerva are garnered from [7, 15, 44]. Finally, the scores associated with Text-davinci-002, GPT-3 and GPT-2 are garnered from [15,43]. On the open-source models, most scores are retrieved from the paper of Llama 2  or their self-reports. Additionally, we evaluate the Baichuan-chat, Vicuna v1.3 by ourselves. In the Table 2, we show the detailed results of MATH subtopics with our WizardMath 70B model.
Comparing with the Close-Source Models. In Table 1, our WizardMath 70B slightly outperforms some close-source LLMs on GSM8k, including ChatGPT, Claude Instant and PaLM 2 540B. And as shown in Figure 2, our model is currently ranked in the top five on all models. Simultaneously, WizardMath 70B also surpasses the Text-davinci-002 on MATH. The detailed results are as follows:
1. WizardMath 13B outperforms PaLM 1 540B (63.9 vs 56.5), Minerva 540B (63.9 vs 58.8), and GPT-3.5 (63.9 vs 57.1) on GSM8k. Meanwhile,it surpasses PaLM 1 540B (14.0 vs. 8.8), GPT-3 175B (14.0 vs. 5.2) on MATH.
2. WizardMath 70B, our largest model, achieves the superior or comparable performance with Claude Instant (81.6 vs 80.9), ChatGPT (81.6 vs 80.8) and PaLM 2 (81.6 vs 80.7) on GSM8k. Concurrently, WizardMath 70B also exceeds Text-davinci-002 (22.7 vs. 19.1) by a margin of 3.6% on the MATH benchmarks.
Comparing with the Open-Source Models. The findings illustrated in the table 1 explicitly demonstrate that our WizardMath 70B, distinctly manifest a substantial performance advantage over all the open-source models across both the GSM8k and MATH benchmarks. The detailed results are as follows:
1. WizardMath 7B surpasses most open-source models with parameter counts ranging approximately from 7B to 40B, including MPT, Falcon, Baichuan-chat, Vicuna v1.3, ChatGLM 2, Qwen, Llama 1 and Llama 2 on the GSM8k and MATH benchmarks. Even though its parameter counts are significantly lower.
2. WizardMath 13B is significantly superior to Llama 1 65B (63.9 vs. 50.9) and Llama 2 70B (63.9 vs. 56.8) on GSM8k. Additionly, it substantially outperforms both Llama 1 65B (14.0 vs. 10.6) and Llama 2 70B (14.0 vs. 13.5) on MATH.
3. WizardMath 70B, our most extensive model, exemplifies a substantial advancement in performance, surpassing Llama 2 70B (81.6 vs. 56.8) by a significant margin of 24.8% on GSM8k. Concurrently, it also outperforms Llama 2 70B (22.7 vs. 13.5) by a margin of 9.2% on MATH.
3.5 Case Study
Appendix A shows some examples generated by our WizardMath. The examples demonstrate that our model consistently generates accurate response answers accompanied by clear explanations.
4 Related Work
Large Language Models. LLMs have achieved substantial advancements within the realm of Natural Language Processing (NLP), providing a valuable and task-agnostic foundation for widespread applications. These models typically encompass parameter counts reaching into the hundreds of billions, which are trained on extensive large-scale corpuses of textual data. The prominent instances entail OpenAI’s GPT3&4 [3, 51], Anthropic’s Claude7, Google’s PaLM [7, 44], Bard11, DeepMind’s Chinchilla , and Gopher . However none of them have been open-sourced so far, and some of them can only be exclusively accessible through APIs.
Recently, the AI landscape has borne witness to the emergence of numerous open-source LLMs, characterized by publicly accessible model codes and weight parameters. EleutherAI has contributed GPT-NeoX-20B  and GPT-J-6B . BigScience has introduced BLOOM . Similarly, Meta has made strides by releasing OPT , Llama 1 , Llama 2 , and GAL . Tsinghua University has unveiled GLM-130B and ChatGLM . TII has facilitated the release of Falcon . Additionally, LLMs such as Baichuan9 and Qwen10 have also surfaced. Presently, Llama assumes a pivotal role as the foundational model for supervised fine-tuning, ushering in the emergence of several extremely remarkable models, including Alpaca , Vicuna , Guanaco , WizardLM , and Orca , RFT  etc.
Large Language Models For Mathematical reasoning. It’s well known that complex reasoning problems are challenging for NLP models, which include mathematical reasoning [25–30], common-sense reasoning [58, 59], and logical reasoning . A substantial body of current research is centered around the intricate task reasoning of the Mathematical Word Problems(MWP) [30, 42, 60–64], which requires the ability to understand mathematical concepts, computation and multi-step reasoning [16– 19, 36, 40, 46]. Addtitionly, models are evaluated across different levels of MWP benchmarks on some mathematical reasoning datasets such as AddSub , MultiArith , SingleEQ , SVAMP , GSM8K , AQuA  and MATH .
To enhance the reasoning ability of LLMs,  proposed Chain-of-Thought Prompting, which attaches multiple reasoning steps before obtaining the answer for a question. By employing the simple few-shot reasoning strategy, LLMs are able to perform better in complex reasoning problems. Least-to-Most  prompting decomposes the problem into sub-problems that are then solved incrementally. Additionally each step has a more detailed reasoning process. Similarly, the Complex CoT  underscores the pivotal role of prompt complexity by strategically choosing the most intricate problems and their corresponding solutions to function as prompts. To alleviate the burden of manual efforts,  introduced Auto-CoT, an approach that automates the process of acquiring k samples through the application of clustering techniques on a provided dataset. With the objective of mitigating manual intervention,  proposed Zero-shot-CoT, which entails the straightforward practice of appending the phrase “Let’s think step by step” to each answer, eliciting the inference steps without examples. Moreover,  expanded upon this notion by suggesting the exploration of diverse inference paths throughout the reasoning process. Consequently, the ultimate outcome is determined through either the aggregation of answers using majority voting or by leveraging a validation mechanism, as posited by .  employs a straightforward approach for generating augmented samples, focusing on probing the correlation between LLMs and math reasoning ability.
Large Language Models For Reinforcement Learning. Nevertheless, even state-of-the-art models
frequently manifest logical errors and a range of illusions [70, 71]. These anomalies become especially challenging within domains necessitating multi-step reasoning, where a singular logical misstep maybe precipitate the unraveling of an entire solution. An effective strategy involves the training of reward models aimed at discriminating between favorable and unfavorable outputs . Early outcome-based approaches were mainly performed on algorithmic tasks [72–75].  demonstrated the significant benefits of reward models or validators, and  proposed a heuristic-based step-size-aware RM. [2, 77–79] proposed the use of reward models for a reinforcement learning pipeline. [20, 37–39, 42, 80–82] employed rejection sampling for searching to achieve alignment of LLMs with human preferences.
The differences between outcome-based and process-based reward modelling are further discussed by . Outcome-supervised reward models (ORMs) undergo training exclusively utilizing the ultimate outcomes derived from the model’s chain-of-thought process. Conversely, process-supervised reward models (PRMs) are designed to solicit feedback for each individual step within the chain-of-thought progression. In the domain of logical reasoning, ORMs frequently employ incorrect reasoning pathways yet yield the correct final answer [41, 83]. Notably, PRMs has been demonstrated to effectively alleviate this phenomenon of inconsistent behavior . [36, 84, 85] amassed an expansive corpus of process-based supervised signals through meticulous manual annotation, which verified that PRMs and supervision with manual annotation yielded more pronounced advantages for LLMs as compared to ORMs.
Large Language Models For Instruction Fine-Tuning. The initial endeavors in instruction-following training work primarily focused on enhancing the language model’s capacity for generalization across diverse tasks. This often involves the process of fine-tuning across substantially available Natural Language Processing datasets, and evaluates on the different NLP tasks. T5  undertake the earliest attempts to train a range of NLP tasks, including Question and Answer, Document Summarization, and Sentiment Classification, by employing a consistent prompt format across all the data. Subsequently, instruction fine-tuning work such as FLAN , ExT5 , T0 , UnifiedQA , ZeroPrompt , and FLAN-T5  emerged to adapt for a large number of downstream tasks. To address the challenge of misalignment between model outputs and human requirements, OpenAI manually annotates the instruction library to construct a diverse range of tasks. Simultaneously, Reinforcement Learning from Human Feedback technology is employed, which facilitate the rapid development of LLMs such as InstructGPT , ChatGPT5, GPT-4. To reduce manual involvement, self-instruct  improves instruction-following through self-generated instructions. Alpaca  used a dataset of 50k instructions generated from a limited (e.g., 175 samples) seed set of manually-written instructions. Vicuna  used 70k user-shared conversations with ChatGPT collected from ShareGPT.com. Meanwhile, WizardLM  introduces the evol-instruct approach, which seeks to refine the existing instruction data by enhancing both its complexity and diversity.
5 Conclusion and Future Work
This paper introduces WizardMath, a mathematics model fine-tuned with RLEIF. The experimental results demonstrate that WizardMath achieves SOTA performance surpassing all existing open-source LLMs on two widely recognized mathematical reasoning benchmarks: GSM8k and MATH. Furthermore, WizardMath exhibits superior performance compared to some of the largest close-source LLMs, including ChatGPT, GPT-3.5, Claude Instant, PaLM-2, PaLM-1 and Minerva on the GSM8k benchmark.
Future Work. Although our WizardMath achieves impressive mathematics performance, as de-picted in Figure 2, our model still falls significantly behind the SOTA LLM, GPT-4 and Claude-2. Therefore, future work will prioritize the enhancement of the RLEIF or better method to further augment the performance of our model.
Broader Impact. Similar to the other LLMs, our WizardMath could also generate unethical, harmful, or misleading information sometimes. Therefore, future research to address the ethical and societal implications is needed.
 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In NeurIPS, 2022. OpenAI. Gpt-4 technical report, 2023.
 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, FelipePetroskiSuch,DaveCummings,MatthiasPlappert,FotiosChantzis,ElizabethBarnes,ArielHerbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022.
 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023.
 Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 8696–8708. Association for Computational Linguistics, 2021.
 Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation, 2023.
 Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023.
 Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
 Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
 Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
 Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.
 Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling rela-tionship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
 Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797, 2023.
 Shima Imani, Liang Du, and Harsh Shrivastava. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
 Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023.
 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
 Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refined web dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023.
 Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
 Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
 Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535, 2022.
 Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and Julius Berner. Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867, 2023.
 Arindam Bhattacharya. A survey of question answering for math and science problem. arXiv preprint arXiv:1705.04530, 2017.
 Yan Wang, Xiaojiang Liu, and Shuming Shi. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
 Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL, 2017.
 Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California, June 2016. Association for Computational Linguistics.
 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
 Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, 2022.
 Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
 Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
 Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720, 2022.
 Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
 Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
 Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
 Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
 JonathanUesato,NateKushman,RamanaKumar,FrancisSong,NoahSiegel,LisaWang,AntoniaCreswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
 Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022.
 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
 Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
 Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
 AohanZeng,XiaoLiu,ZhengxiaoDu,ZihanWang,HanyuLai,MingDing,ZhuoyiYang,YifanXu,Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
 Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, and Qizhe Xie. Automatic model selection with large language models for reasoning. arXiv preprint arXiv:2305.14333, 2023.
 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, KatieMillican,GeorgevandenDriessche,BogdanDamoc,AureliaGuy,SimonOsindero,KarenSimonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022.
 Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
 InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
 Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Rose Biderman. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. 2021.
 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
 Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
 Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
 Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source auto regressive language model. arXiv preprint arXiv:2204.06745, 2022.
 Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
 Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
 Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
 Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsense QA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the NorthAmericanChapteroftheAssociationforComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
 Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
 Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, 2021.
 Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. Mwptoolkit: an open-source framework for deep learning-based math word problem solvers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 13188–13190, 2022.
 Zhanming Jie, Jierui Li, and Wei Lu. Learning to reason deductively: Math word problem solving as complex relation extraction. arXiv preprint arXiv:2203.10316, 2022.
 Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023.
 Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A contin-uous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
 Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533, Doha, Qatar, October 2014. Association for Computational Linguistics.
 Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
 Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597, 2015.
 Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Huai hsin Chi. Least-to-most prompting enables complex reasoning in large language models. ArXiv, abs/2205.10625, 2022.
 Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, Toronto, Canada, July 2023. Association for Computational Linguistics.
 Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
 Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
 Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Scott Reed and Nando De Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.
 Chengtao Li, Daniel Tarlow, Alexander L. Gaunt, Marc Brockschmidt, and Nate Kushman. Neural program lattices. In International Conference on Learning Representations, 2016.
 Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611, 2017.
 Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336, 2022.
 Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
 Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
 Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
 Eric Nichols, Leo Gao, and Randy Gomez. Collaborative storytelling with large-scale neural language models. In Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games, pages 1–10, 2020.
 Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. Generate & rank: A multi-task framework for math word problems. arXiv preprint arXiv:2109.03034, 2021.
 Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
 Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
 XinyuZhu,JunjieWang,LinZhang,YuxiangZhang,YongfengHuang,RuyiGan,JiaxingZhang,andYujiu Yang. Solving math word problems via cooperative reasoning induced language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2023.
 Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations, 2022.
 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
 Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
 Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Prakash Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler. Ext5: Towards extreme multi-task scaling for transfer learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
 Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, An-toine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli,ThibaultFévry,JasonAlanFries,RyanTeehan,TevenLeScao,StellaBiderman,LeoGao,Thomas Wolf, and Alexander M. Rush. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
 Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. Unifiedqa: Crossing format boundaries with a single QA system. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1896–1907. Association for Computational Linguistics, 2020.
 Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Zeroprompt: Scaling prompt-based pretraining to 1, 000 tasks improves zero-shot generalization. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4235–4252. Association for Computational Linguistics, 2022.
 Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
 Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
A.1 GSM8k Case Study
Table 3: A comparison case on different scale size models
Table 4: A comparison case on different scale size models
Table 5: A comparison case on different scale size models
A.2 MATH Case Study
Table 6: A comparison case on different scale size models
Table 7: A comparison case on different scale size models
Table 8: A comparison case on different scale size models