Skip to main content
Uncategorized

JUDGELM : FINE-TUNED LARGE LANGUAGE MODELS ARE SCALABLE JUDGES

October 26, 2023

Lianghui Zhu1,2 ∗               Xinggang Wang2                Xinlong Wang1

1 Beijing Academy of Artificial Intelligence       2 School of EIC, Huazhong University of Science & Technology

Code & Models: https://github.com/baaivision/JudgeLM

ABSTRACT

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended bench-marks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyzethekeybiasesinfine-tuningLLMasajudgeandconsiderthemasposition bias, knowledge bias, and format bias. To address these issues, JudgeLM intro-duces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge’s performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. OurJudgeLMisefficientandtheJudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement1. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

1 INTRODUCTION

Recent advancements in large language models (LLMs) have fostered significant interest due to their remarkable performance in following instructions and their broad capabilities in dealing with open-ended scenarios. Based on the open-source LLMs, including OPT (Zhang et al., 2022), Flan-T5 (Chung et al., 2022), LLaMA (Touvron et al., 2023a), and Pythia (Biderman et al., 2023), re-searchers propose numerous methods to align these models with human preferences through instruction fine-tuning. These aligned LLMs demonstrate enhanced abilities in comprehending human instructions and generating more coherent responses. Nonetheless, existing benchmarks (Hendrycks et al., 2020; Liang et al., 2022) and traditional metrics (Lin, 2004; Papineni et al., 2002; Zhang et al., 2019; Sellam et al., 2020; Yuan et al., 2021) do not adequately estimate the capabilities of LLMs in open-ended scenarios. Therefore, a new benchmark method that could evaluate LLMs comprehensively in open-ended tasks is needed.

Concurrent works are making efforts to explore various methods for evaluating the performance of LLM. The arena-format (Zheng et al., 2023) methods leverage crowdsourced platforms to extract anonymous LLM competition results. While evaluations by humans are trustworthy, they are also time-consuming and financially demanding. Some approaches (Chiangetal., 2023) utilize GPT-4 as a judge. Nevertheless, these methods grapple with challenges of potential data exposure and volatile API model transitions, potentially compromising the judge’s reproducibility. PandaLM (Wang et al.,, 2023) attempts to fine-tune open-source LLMs for evaluating answers. However, limitations stemming from the model’s size, training data quality, and inherent LLM biases, undermine the effectiveness of such fine-tuned models in the role of a judge.

In this paper, we propose to evaluate LLMs through fine-tuned open-source LLMs, which serve as scalable judges (JudgeLM) achieving satisfactory agreement with the teacher judge. Our method-ology incorporates scalable judges as evaluators in open-ended tasks, coupled with a high-quality dataset conducive to both training and evaluating the judge models. Within our framework, we adapt open-source LLMs to serve as judges and analyze their scaling ability in relation to model size (ranging from 7B to 33B) and volume of training data (extending from 3.5K to 100K). Our curated dataset comprises 105K seed questions, LLM answer pairs, and judgments from the teacher judge, GPT-4, as shown in Fig. 1a. Note that we generated two judgments for each seed task with and without reference answers. This dataset is partitioned, with 100K seed questions allocated for training (×2 larger than PandaLM) and the remainder for validation (×29 larger than PandaLM).

(a) Data generation pipeline of our JudgeLM. We first collect 105K seed tasks as questions. Then, we extract answers from 11 LLMs and randomly sample a pair of answers from the answer set. Last, we input the tasks, the sampled answer pairs, and optionally reference answers to GPT-4, which generates scores and detailed reasons as a judge teacher.

(b) An illustration of the JudgeLM’s fine-tuning and various functions. We use generated judge samples to
fine-tune LLMs as scalable judges. When fine-tuning LLMs as judges, we also propose swap augmentation, reference support, and reference drop to address the position bias, knowledge bias, and format bias, respectively.

(c) An illustration of various functions of our JudgeLM.

Figure 1: An overview of our scalable JudgeLM including data generation, fine-tuning, and various functions.

Utilizing LLMs as judges inevitably introduces biases such as position bias (favoring answers in specific positions), knowledge bias (over-reliance on pre-trained knowledge), and format bias (optimal performance only under specific prompt formats) as shown in Fig. 8, 10, 12, 13. We propose methods to address them. Moreover, our JudgeLM system presents extended capabilities as shown in Fig. 1b, including grading single answers, judging multiple answers, judging multimodal models, and multi-turn chat.

In contrast to arena-format methodologies, our approach is rapid and has a low cost. For instance, a model like JudgeLM-7B requires only 8 A100 GPUs and can evaluate 5000 response pairs in just 3 minutes. In comparison to closed-source LLM judges, JudgeLM ensures reproducibility and pro-tectsuserprivacy. Whencomparedtoconcurrentopen-sourceLLMjudges,oursystemexploresboth the scaling ability and biases in LLM fine-tuning. Furthermore, the dataset we introduce stands as the most diverse and high-quality one, significantly benefitting subsequent research in judge model investigations.

Our main contributions can be summarized as follows:

• We introduce a high-quality, large-scale dataset for judge models, enriched with diverse seed tasks, LLMs-generated answers, and detailed judgments from GPT-4, laying the foundation for future LLMs evaluating research.

• We propose JudgeLM, a scalable language model judge, designed for evaluating LLMs in open-ended scenarios. It achieves an agreement exceeding 90% that surpasses the human-to-human agreement. Our JudgeLM also has broad capacities to deal with extended tasks.

• We analyze the biases inherent to LLM judge fine-tuning and introduce a series of methods to address them. Our methods significantly improve the consistency of the model in different cases, making the JudgeLM more reliable and flexible.

2 RELATED WORKS

2.1 INSTRUCTION FINE-TUNING OF LARGE LANGUAGE MODELS

With the development of large language models (LLMs), researchers find that fine-tuning pre-trained LLMs such as GPT-3 (Brown et al., 2020), T5 (Raffel et al., 2020), OPT (Zhang et al., 2022), and PaLM (Chowdhery et al., 2022) enables LLMs to follow human instructions and help with open-ended tasks. The instruction fine-tuned LLMs such as InstructGPT (Ouyang et al., 2022), ChatGPT(OpenAI,2022), FLAN-T5 (Chungetal.,2022), FLAN-PaLM(Chungetal.,2022), OPT-IML (Iyer et al., 2022), and GPT-4 (OpenAI, 2023) exhibit stronger ability in zero-shot or few-shot tasks than their base models. After Meta released the powerful open-source LLM LLaMA (Tou-vron et al., 2023a) and LLaMA2 (Touvron et al., 2023b), lots of instruction fine-tuning works based on LLaMA or LLaMA2 were proposed in the natural language generation or multimodal genera-tion domain, such as Alpaca, Vicuna (Chiang et al., 2023), OpenFlamingo (Awadalla et al., 2023), LLaMA-Adapter(Zhangetal.,2023), andEmu(Sunetal.,2023). OurJudgeLMalsobelongstothe LLaMA family and takes the Vicuna series as base models. Our JudgeLM follows the instruction fine-tuning manner to create LLM judges and proposes to model the judgment-generation task as “grading, judging, and reasoning”. We further collect a high-quality, large-scale dataset for research in judging the performance of LLMs.

2.2 EVALUATION OF LARGE LANGUAGE MODELS

As many open-source large language models (LLMs) and their fine-tuned variants are proposed and present remarkable performance on various tasks, evaluating the capabilities of LLMs becomes a popular and challenging task. To address this problem, Chatbot Arena (Zheng et al., 2023) aims to build a crowdsourced platform that ranks the LLMs through pairwise comparison and Elo rating. The crowdsourced way to evaluate LLMs has more reliable results but faces high costs and low efficiency. Vicuna (Chiang et al., 2023) uses GPT-4 as a judge to select the better answer. Although the GPT-4-based method can judge LLMs like a human expert, the API-based methods have potential risks of data leakage and unstable performance. Zeno Build (Alex & Graham, 2023) proposes to evaluate LLMs at a customer service dataset, but using traditional metrics such as ChrF (Popovic, 2015) and BERTScore (Zhang et al., 2019) can not fully evaluate the answers of LLMs in open-ended tasks. Besides, PandaLM (Wang et al., 2023) developed a judge model based on LLaMA (Touvron et al., 2023a) to compare answers produced by LLMs. When serving as judges, PandaLM achieves an accuracy close to ChatGPT (OpenAI, 2022) but it only has a 7B model size that limits its performance further. Our JudgeLM contains scalable judges from 7B-parameter to 33B-parameter and achieves state-of-the-art performance in both PandaLM and our benchmarks. Furthermore, researchers can use the proposed JudgeLM locally which ensures reproducibility and data security.

3 DATASET

High-quality, large-scale datasets are crucial for effectively fine-tuning large language models (LLMs) to act as evaluative judges. However, the concurrent datasets, such as the one by Wang et al. (2023), present limitations in terms of diversity and the granularity of judgment criteria. To address this, we introduce a novel dataset replete with a rich variety of seed tasks, comprehensive answers from modern LLMs, answers’ grades from the teacher judge, and detailed reasons for judgments. Section 3.1 elucidates the data generation process, while Section 3.2 delineates the methods adopted for training and evaluation using our dataset.

3.1 DATA GENERATION

Figure 2: The input and output of the proposed JudgeLM data sample. By comparing the answers with ground truth, traditional metrics can not judge answers accurately. However, the LLM judges can understand the questions and answers and give accurate scores and reasons.

The primary objective of our data generation is to create a large-scale and diversified dataset that maximizes the evaluative capabilities of judge models. We sample 105K instruction seed tasks from a large-scale set that contains Alpaca-GPT4 (Peng et al., 2023), Dolly-15K (Conover et al., 2023), GPT4All-LAION (Anand et al., 2023), and ShareGPT. To enhance the heterogeneity of the dataset, answers are collated from 11 leading open-source LLMs including, but not limited to, LLaMA (Touvron et al., 2023a), Alpaca, and Vicuna (Chiang et al., 2023). Following this, we amalgamate LLM-generated answers with the reference answer to create answer sets. Pairs are randomly selected from the sets, upon which, fine-grained scores and detailed reasons are assigned by the advanced teacher model, GPT-4. To ensure robust and comprehensive judgments, we utilize detailed templates as demonstrated in Fig. 3. Additionally, to allow the model to judge with reference answers, the reference-inclusive template is employed as Fig. 4. This encourages the model to integrate external knowledge during the evaluative process.

3.2 TRAINING AND EVALUATING

To better utilize our dataset to train and evaluate the judge models, we partition it into a training split and a validation split. The training set contains 100K judge samples, while the validation set has 5K. We then introduce the way we use this dataset to train and evaluate, respectively.

Training. The training process of JudgeLM adheres to the instruction fine-tuning paradigm. As illustrated in Fig. 2, the model is fed a question alongside a pair of answers, and an optional reference answer, yielding outputs comprising scores and detailed reasons. It is imperative to note the significance of a detailed crafted prompt template to harness the full potential of JudgeLM’s instruction-following ability. Distinct input templates cater to scenarios with and without references, as depicted in Fig. 3 and Fig. 4 respectively.

To further analyze the scaling ability of our JudgeLM, we fine-tune JudgeLM with sizes of 7B, 13B, and 33B parameters. The specific hyperparameters are enumerated in Table 7. As for the scaling analysis for dataset size, we also fine-tune JudgeLM on varying data scales including 3.5K, 10K, 30K, and 100K samples. JudgeLM demonstrates scaling ability both in terms of parameter size and data volume.

Evaluating. For the judge’s result, we model it as “grading, judging, and reasoning”. The judge model first generates scores for answer pairs. Subsequently, we can get the judge result from three situations: “Answer 1 wins” if the answer 1’s score is higher than the answer 2’s, “Answer 2 wins” if the answer 2’s score is higher, or “Tie” if the scores of two answers are the same. Last, the model generates detailed reasons if needed. The advantage of this modeling is that the judge model just needs little time to grade and judge, and generates time-consuming reasoning optionally.

For the metrics, we employ the objective metrics and reliability metrics to evaluate the judge models comprehensively. For the objective metrics, we compute the agreement, precision, recall, and F1-score between the model’s judge results and those of the teacher. This provides insights into the alignment of judge models with established benchmarks, such as GPT-4 or human experts. As for reliability metrics, we first compare the results before and after swapping LLM answers. Then we calculate the consistency to measure the judge model’s reliability. Last, we further calculate the metrics like “bias toward 1st”, “bias toward 2nd”, and “delta bias” to get insights from specific position biases and their variance.

4 INHERENT BIASES

In this paper, we also study the inherent biases that influence the reliability of fine-tuned LLM judges through reliability metrics and visualizations.

Position Bias. Position bias means that the LLM judges prefer answers in a certain position and it widely exists in natural language processing tasks (Ko et al., 2020; Wang et al., 2018) and decision-making of humans (Blunch, 1984; Raghubir & Valenzuela, 2006). The powerful LLMs, ChatGPT and GPT-4, also face this challenge when working as judges (Wang et al., 2023; Zheng et al., 2023). As the qualitative and quantitative results shown in Fig. 8 and Table 5, JudgeLM also faces the position bias and prefers the first answer when swapping the positions of answers.

Knowledge Bias. Knowledge bias arises when the pre-trained data lacks the knowledge of some seed tasks or induces possibly undesirable knowledge (Ko et al., 2020) that could degenerate the generative capabilities of LLMs. Fig. 10 provides an example that LLM judges can not give correct judgments to open-ended tasks if they lack related truth.

Format Bias. Researchers expect that the judge model can make judgments based on pre-trained knowledge when the reference is not available and can make judgments following the reference when it is available. However, our experiments revealed that judge models have a specific preference for the fine-tuning format whether with references or not. We name the situation that a judge fine-tuned without reference but validated with reference as a mismatched format, and vice versa. As shown in Fig. 12, Fig. 13, and Table 6 the judge models perform badly in mismatched formats. We hypothesize that the reason for format bias is that judge models are overfitting with the fixed fine-tuned template format.

5 MEHTODS

In evaluating LLM-generated answers for a seed question, the LLM judge aims to determine the superior answer from a pair of candidates. Motivated by recent methods (Touvron et al., 2023a; Chiang et al., 2023; Ouyang et al., 2022), we present JudgeLM, a scalable judge model, and address inherent biases in such models. Our methodology is depicted in Fig. 1b. The subsequent sections provide a detailed breakdown of our approach.

5.1 SWAP AUGMENTATION

MT-bench (Zheng et al., 2023) and PandaLM (Wang et al., 2023) alleviate the position bias by judging twice with original and reverse order. These methods regard the result as a tie if the judgments are not the same. This kind of method casting double time to evaluate, can be regarded as a compromise and does not fix the inherent position bias of LLMs.

Intuitively, swapping the positions at the fine-tuning stage could push the judge model to pay more attention to the contents of answers rather than positions. Leveraging our structured judge data, we can easily swap the positions of answers to generate a new input sample. Correspondingly, we also swap the scores and question indexes of the judgment from the teacher (i.e., GPT4) to get the new ground truth. As shown in Fig. 15, the augmented judge sample keeps the same results but exchanges the positions of answers. Overall, it is simple but effective to augment the training data and address position bias. The JudgeLM-with-swap-augmentation can give good judgment to the same judge sample as shown in Fig. 9.

5.2 REFERENCE SUPPORT

Introducing external knowledge is an intuitive way to make up for the lack of related pre-trained knowledge. To do so, we propose the reference support method to teach the model to judge with the help of reference answers. We collect reference answers for all judge samples and re-generate reference-guided judgments by GPT-4. Please note that GPT-4 also gives different scores and judgments for most judge samples with or without references. This proves that the differences between pre-trained knowledge and reference answers greatly impact judgments. As shown in Fig. 11, the JudgeLM with reference support can avoid factual errors and give reliable judgments. Furthermore, introducing reference support to LLM judges can simply insert judge preferences. JudgeLM with reference support training can flexibly set reference answers with different preferences for different scenarios and needs. As shown in Fig. 16, changing reference answers does not need extra training and makes JudgeLM more flexible to different preferences.

5.3 REFERENCE DROP

To address the format bias, we introduce a method, named reference drop, in which we randomly drop the training sample with reference and use the corresponding sample without reference. As shown in Fig. 14, judge models with reference drop can alleviate the overfitting for fine-tuning formats and give fair judgments with or without reference. Furthermore, the reference drop method also makes the judge model easy to use and decreases the cost of fitting into different formats.

6 EXPERIMENTS

We study the performance of JudgeLM as follows: Section 6.1 presents the main results of JudgeLM comparing with concurrent methods, Section 6.2 analyzes the scaling ability of JudgeLM from both model sizes and data scales, and Section 6.3 shows ablation studies of proposed methods in detail.

Table 1: Main results for our JudgeLM and concurrent methods on our val set, which uses GPT-4 annotation results as ground truth.

Table 2: JudgeLM zero-shot evaluation results on PandaLM test set, which uses human annotation results as ground truth. “∗” means the results are reported in PandaLM (Wang et al., 2023)

6.1 MAIN RESULTS

Comparison on JudgeLM Benchmark. We first evaluate the proposed JudgeLM on our val set. As shown in Table. 1, we give the quantitative results of GPT-3.5, PandaLM-7B, and our JudgeLM with three model sizes. Among them, GPT-3.5 is used in the form of APIs with the help of tem-plates in Fig. 3 and Fig. 4. PandaLM-7B is deployed with the released checkpoint and template. These two methods could be regarded as zero-shot methods because they are not fine-tuned by the JudgeLM dataset. Our JudgeLMs are fine-tuned with proposed methods, i.e., swap augmentation, reference support, and reference drop. So, they can handle the situations with or without references simultaneously. It can be observed that our JudgeLM-7B outperforms PandaLM-7B in all metrics, and even surpasses GPT-3.5. Furthermore, the proposed JudgeLM-33B exhibits the most powerful judge ability.

Comparison on PandaLM Benchmark. We also zero-shot evaluate our JudgeLM on the PandaLM test set. PandaLM uses the result of three human annotators’ votes as ground truth. It requires swapping the positions of two answers to perform inference twice and the conflicting evaluation results are modified to ‘Tie’. Following the manner of this dataset, we present the zero-shot results of JudgeLM in Table 2. It can be observed that the JudgeLM-7B outperforms GPT-3.5 and PandaLM-7B. When compared with GPT-4, JudgeLM-7B has lower accuracy and higher Precision, Recall, and F1-score than GPT-4. Furthermore, JudgeLM-33B achieves higher results than GPT-4, which demonstrates that fine-tuned JudgeLM can outperform its teacher in this specific task.

Efficiency comparison. To further compare the efficiency between our JudgeLM and PandaLM, we conduct experiments on our val settodisplaythetimecostusingthesamemachinewith8NVIDIA-A100 (40G) GPUs. As shown in Table 3, we display the methods and model sizes in the first and second columns. The third column shows the needed GPUs for each judge model. The models

Table 3: Efficiency comparison for our JudgeLM and PandaLM on our val set. We use a machine with 8 Nvidia-A100 GPUs with 40G memory to evaluate their efficiency.

Table 4: Performance analysis for the scaling JudgeLM on our val set.

with 7B or 13B parameters run on 1 A100 GPU with 40G memory while the 33B-parameter needs 2 GPUs. The fourth column shows whether the methods can judge answers in parallel. PandaLM only runs on 1 GPU because it does not support parallel running but all our JudgeLM can run in parallel to fully use all the 8 GPUs. The fifth column indicates whether judge reasons are generated at runtime. Our JudgeLM can choose whether to generate reasons but PandaLM must generate reasons that make it too time-consuming. The sixth column presents the total time cost including the time of loading checkpoints, judging samples, and calculating metrics. It can be observed that our JudgeLM is much more flexible and efficient than PandaLM.

Compared to the API-based judge methods, the proposed JudgeLM proves to be more cost-effective. For instance, using GPT-4 to evaluate on JudgeLM val set (without generating reasons) incurs a cost of approximately $60. In contrast, JudgeLM-7B accomplishes the same task at a cost of merely $0.5, representing just 0.8% of the expense associated with using GPT-4.

6.2 SCALING ANALYSIS OF JUDGELM

In this section, we analyze the scaling ability of the plain JudgeLM on our val set without reference as illustrated in Table 4. As we increase the model size and data scale, we can observe the metrics increase. It demonstrates that the proposed JudgeLM is scalable and can reach up to 90.06% agreement and 87.93% consistency with 33B-parameter and 100K fine-tuning data.

6.3 ABLATION STUDY

In this section, we present the ablation studies of the proposed methods. For all ablation studies, we use JudgeLM-7B as the base model and 3.5K data for fine-tuning. Based on this baseline, we analyze the improvements brought by swap augmentation, reference support, and reference drop.

Improvements of Swap augmentation. As shown in Table 5, swap augmentation can bring comprehensive improvements to the baseline model. It improves consistency by 5.44%, which demon-

Table 5: Ablation study for the swap augmentation on our val set.

Table 6: Ablation study for the reference support and reference drop on our val set.

strates that swap augmentation can reduce the influence of position bias and push the judge to pay more attention to the contents of answers.

Improvements of Reference support. As shown in the rows with the matching format of Ta-ble 6, JudgeLM fine-tuned with reference support exhibits superior performance on every metric. It demonstrates that the introduction of reference answers induces the judge to rely on external knowledge and addresses the limitation of pre-trained knowledge.

Improvements of Reference drop. As shown in Table 6, baselines can not reach satisfactory performance when facing mismatched formats. With the help of the reference drop, the JudgeLM can handle both the format with or without reference and achieve higher agreement and consistency. It demonstrates that reference drop can address the format bias and avoid the JudgeLM overfitting to a single format.

6.4 EXTENSIONS OF JUDGELM

Not only judging answer pairs, but our JudgeLM can also be applied to many other extended tasks, such as grading a single answer, judging and ranking multiple answers, evaluating multimodal models, and multi-turn chat about judgments.

Grading a single answer. The Concurrent judge method (Wang et al., 2023) usually judges a pair of answers to decide which one is better or tie but they lack the ability to evaluate a single answer. Thanks to our judging mode of scoring first and then calculating the judging results, our JudgeLM provides an alternative practice to grade a single answer by slightly modifying the template as shown in Fig. 5. By putting the reference answer in the first position and giving it a full grade as a prior, JudgeLM can give quantitative fine-grained evaluations as shown in Fig. 17.

Judging multiple answers. To get the optimal ranking for N answers from different LLMs, other judge models need to call the model O(n2) times to get the full matrix, which is a much less effi-cient solution. We attempt to resolve this limitation by extending our JudgeLM to process multiple answers at the same time. We first need to modify the template as shown in Fig. 5. As shown in Fig. 18, JudgeLM can judge and rank the multiple answers within the context limit of LLM.

Judging multimodal models. Traditional multimodal evaluation needs prediction match the ground truth exactly. For some open-ended questions, a human-like evaluator is needed to determine whether the prediction is close to the ground truth range. Our JudgeLM provides good practice for multimodal evaluation by a slightly modified template as shown in Fig. 7. Thanks to its capacity to judge open-ended answers, our JudgeLM can also perform well in judging multimodal models, as shown in Fig. 20. Furthermore, the proposed JudgeLM can be easily applied to the modern open-ended multimodal benchmarks replacing the closed-source LLM judges like GPT-4. As shown in Fig. 21, Fig. 22, and Fig. 23, we utilize JudgeLM to evaluate Emu (Sun et al., 2023) on MM-Vet (Yu et al., 2023) to get specific scores and detailed reasons.

Multi-turn chat about judgments. It is worth noting that fine-tuning with judge samples does not compromise the multi-turn chat ability extended from base models. As illustrated in Fig. 19, our JudgeLM retains the capability to engage in meaningful dialogues with users, providing them with a richer context, detailed information, additional examples, and specific details.

7 CONCLUSION

In this paper, we propose JudgeLM as scalable judges for evaluating LLMs in open-ended tasks efficiently, achieving state-of-the-art judge performance on two benchmarks. The three key biases in fine-tuning LLMs as judges are analyzed and addressed by the proposed methods. We hope our work can motivate more studies to explore the judge models for LLMs in open-ended tasks and build more powerful LLMs with guidance from judge models.

Limitations. Although the proposed JudgeLM achieves encouraging performance and efficiency, the cost of the judge dataset limits further scaling up in the judge dataset. Currently, we spend about 4000 dollars to provide 100K high-quality GPT-4-generated judge data to the public. We expect to further improve the performance of judge models with the help of synthetic judge data.

REFERENCES

Cabrera Alex and Neubig Graham. Zeno chatbot report, 2023. URL https: //github.com/zeno-ml/zeno-build/tree/main/examples/chatbot/ report#zeno-chatbot-report. 3

Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, and Andriy Mulyar. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all, 2023. 4

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023. 3

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023. 1

Niels J Blunch. Position bias in multiple-choice questions. Journal of Marketing Research, 21(2): 216–220, 1984. 5

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 3

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023. 1, 3, 4, 6

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 3

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 1, 3

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/ 12/dolly-first-open-commercially-viable-instruction-tuned-llm. 4

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.              Measuring massive multitask language understanding.    arXiv preprint arXiv:2009.03300, 2020. 1

Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022. 3

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 13

Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. Look at the first sentence: Position bias in question answering. In 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, pp. 1109–1121. Association for Computational Linguistics (ACL), 2020. 5

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022. 1

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. 1

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. 13

OpenAI. Chatgpt, 2022. URL https://openai.com/blog/chatgpt/. 3

OpenAI. Gpt-4 technical report, 2023. 3

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744, 2022. 3, 6

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. 1

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 4

Maja Popovic. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/W15-3049. URL https: //aclanthology.org/W15-3049. 3

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. 3

Priya Raghubir and Ana Valenzuela. Center-of-inattention: Position biases in decision-making. Organizational Behavior and Human Decision Processes, 99(1):66–80, 2006. 5

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021. 13

Thibault Sellam, Dipanjan Das, and Ankur Parikh. Bleurt: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892, 2020. 1

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023. 3, 10

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. 1, 3, 4, 6

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda-tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b. 3

Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. Position bias estimation for unbiased learning to rank in personal search. In Proceedings of the eleventh ACM international conference on web search and data mining, pp. 610–618, 2018. 5

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087, 2023. 2, 3, 4, 5, 6, 7, 9

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023. 10

Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text gener-ation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021. 1

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init atten-tion. arXiv preprint arXiv:2303.16199, 2023. 3

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo-pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. 1, 3

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluat-ing text generation with bert. arXiv preprint arXiv:1904.09675, 2019. 1, 3

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023. 1, 3, 5, 6

A APPENDIX

A.1 FINE-TUNING SETTINGS

We list the hyper-parameters we used.

A.2 PROMPT TEMPLATES

We list all the prompt templates we used.

A.3 CASE STUDIES

We list several case studies.

Table 7: JudgeLM fine-tuning setting.

Figure 3: The template for judging answers without the reference.
Figure 4: The template for judging answers with the reference.
Figure 5: The template for grading a single answer. We set the reference answer in the position of Answer 1. Then, we set the score of the reference answer to 10. Last, the JudgeLM outputs the score of the single answer with such a prior.
Figure 6: The template for judging multiple answers.
Figure 7: The template for multimodal judging.
Figure 8: Bad judgment caused by position bias. The answer placed in the first position always gets a higher score. The judge models generate reasons as possible from the perspective of making the scores reasonable.
Figure 9: Good judgment generated by the judge fine-tuned with swap augmentation. The judge can give judgments based on the content of answers rather than a certain position. The reason is convincing and reasonable.
Figure 10: Bad judgment caused by knowledge bias. This seed task is out of the judge model’s pre-trained knowledge, so it can not be correctly judged by the judge model. The judge model gives contradictory reasons in the judgment.
Figure 11: Good judgment generated by the judge model fine-tuned with reference support. Even though the judge model itself lacks related information, it can also give a reasonable judgment with the reference answer.
Figure 12: Bad judgment caused by format bias. For judging without reference, the judge model trained without reference is matched, so it performs well. However, the judge model trained with reference is mismatched, so it performs badly.
Figure 13: Bad judgment caused by format bias. For judging with reference, the judge model trained with reference is matched, so it performs well. However, the judge model trained without reference is mismatched, so it performs badly.
Figure 14: Good judgment generated by the judge model with reference drop, which addresses the preference for specific fine-tuning formats and gives fair judgments with or without reference.
Figure 15: An illustration of swap augmentation. We use swap augmentation to exchange the positions of answers, and our GPT-4-generated judgments can be modified correspondingly easily due to their structure.
Figure 16: An illustration of changing the reference answer to control model preference. When we change to a different reference answer, the model turns to prefer another answer.
Figure 17: An illustration of grading a single answer.
Figure 18: An illustration of judging multiple answers.
Figure 19: An illustration of multi-turn chat. Users can get more details, advice, examples, etc. by chatting with JudgeLM.
Figure 20: An illustration of multimodal judging. Our JudgeLM has the capacity to judge the VQA task without images.
Figure 21: An illustration of multimodal high-score grading on MM-Vet benchmark. The proposed JudgeLM can replace GPT-4 to grade multimodal answers.
Figure 22: An illustration of multimodal mid-score grading on MM-Vet benchmark. The proposed JudgeLM can replace GPT-4 to grade multimodal answers.
Figure 23: An illustration of multimodal low-score grading on MM-Vet benchmark. The proposed JudgeLM can replace GPT-4 to grade multimodal answers.