Nisan Stiennon∗ Long Ouyang∗ Jeff Wu∗ Daniel M. Ziegler∗ Ryan Lowe∗

Chelsea Voss∗ Alec Radford Dario Amodei Paul Christiano∗

OpenAI

Abstract

As language models become more powerful, training and evaluation are increas- ingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about—summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons be- tween summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforce- ment learning. We apply our method to a version of the TL;DR dataset of Reddit posts [63] and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles [22], producing summaries nearly as good as the human reference without any news-specific fine-tuning.² We con- duct extensive analyses to understand our human feedback dataset and fine-tuned models.³ We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

1 Introduction

Large-scale language model pretraining has become increasingly prevalent for achieving high per- formance on a variety of natural language processing (NLP) tasks. When applying these models to a specific task, they are usually fine-tuned using supervised learning, often to maximize the log probability of a set of human demonstrations.

While this strategy has led to markedly improved performance, there is still a misalignment between this fine-tuning objective—maximizing the likelihood of human-written text—and what we care about—generating high-quality outputs as determined by humans. This misalignment has several causes: the maximum likelihood objective has no distinction between important errors (e.g. making up facts [41]) and unimportant errors (e.g. selecting the precise word from a set of synonyms); models

Figure 1: Fraction of the time humans prefer our models’ summaries over the human-generated reference summaries on the TL;DR dataset.⁴Since quality judgments involve an arbitrary decision about how to trade off summary length vs. coverage within the 24-48 token limit, we also provide length-controlled graphs in Appendix F; length differences explain about a third of the gap between feedback and supervised learning at 6.7B.

are incentivized to place probability mass on all human demonstrations, including those that are low-quality; and distributional shift during sampling can degrade performance [56, 52]. Quality can often be improved significantly by non-uniform sampling strategies such as beam search [51], but these can lead to repetition and other undesirable artifacts [69, 23]. Optimizing for quality may be a principled approach to overcoming these problems.

Our goal in this paper is to advance methods for training language models on objectives that more closely capture the behavior we care about. To make short-term progress towards this goal, we focus on abstractive English text summarization, as it has a long history in the NLP community [16, 8, 54, 59, 50], and is a subjective task where we believe it is difficult to quantify summary quality without human judgments. Indeed, existing automatic metrics for evaluating summary quality, such as ROUGE [39], have received criticism for poor correlation with human judgments [55, 45, 6, 33].

We follow the works of [3, 73], who fine-tune language models from human feedback using reward learning [35]. We first collect a dataset of human preferences between pairs of summaries, then train a reward model (RM) via supervised learning to predict the human-preferred summary. Finally, we train a policy via reinforcement learning (RL) to maximize the score given by the RM; the policy generates a token of text at each ‘time step’, and is updated using the PPO algorithm [58] based on the RM ‘reward’ given to the entire generated summary. We can then gather more human data using samples from the resulting policy, and repeat the process. We follow the works of [48, 4] and use large pretrained GPT-3 models with as many as 6.7 billion parameters.

Our main contributions are four-fold.

We show that training with human feedback significantly outperforms very strong baselines on English summarization. When applying our methods on a version of the Reddit TL;DR dataset [63], we train policies via human feedback that produce better summaries than much larger policies trained via supervised learning. Summaries from our human feedback models are preferred by our labelers to the original human demonstrations in the dataset (see Figure 1).
We show human feedback models generalize much better to new domains than supervised models. Our Reddit-trained human feedback models also generate high-quality summaries of news articles on the CNN/DailyMail (CNN/DM) dataset without any news-specific fine-tuning, almost matching the quality of the dataset’s reference summaries. We perform several checks to ensure that these human preferences reflect a real quality difference: we consistently monitor agreement rates amongst labelers and researchers, and find researcher-labeler agreement rates are nearly as high as researcher-researcher agreement rates (see Section C.2), and we verify models are not merely optimizing simple metrics like length or amount of copying (see Appendices F and G.7).
We conduct extensive empirical analyses of our policy and reward model. We examine the impact of model and data size (Figure 6), study performance as we continue to optimize a given reward model (Section 4.3), and analyze reward model performance using synthetic and human- written perturbations of summaries (Section 4.3). We confirm that our reward model outperforms other metrics such as ROUGE at predicting human preferences, and that optimizing our reward model directly results in better summaries than optimizing ROUGE according to humans (Section 4.4).
We publicly release our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores).

The methods we present in this paper are motivated in part by longer-term concerns about the misalignment of AI systems with what humans want them to do. When misaligned summarization models make up facts, their mistakes are fairly low-risk and easy to spot. However, as AI systems become more powerful and are given increasingly important tasks, the mistakes they make will likely become more subtle and safety-critical, making this an important area for further research.

2 Related work

Most directly related to our work is previous work using human feedback to train summarization models with RL [3, 73]. Bohm et al. [3] learn a reward function from a dataset of human ratings of

2.5k CNN/DM summaries, and train a policy whose summaries are preferred to a policy optimizing ROUGE. Our work is most similar to [73], who also train Transformer models [62] to optimize human feedback across a range of tasks, including summarization on the Reddit TL;DR and CNN/DM datasets. Unlike us, they train in an online manner and find the model highly extractive. They note that their labelers prefer extractive summaries and have low agreement rates with researchers. Compared to [73], we use significantly larger models, move to the batch setting for collecting human feedback, ensure high labeler-researcher agreement, and make some algorithmic modifications, such as separating the policy and value networks.

Human feedback has also been used as a reward to train models in other domains such as dialogue [25, 68, 21], translation [32, 1], semantic parsing [34], story generation [72], review generation [7], and evidence extraction [46]. Our reward modeling approach was developed in prior work on learning to rank [40], which has been applied to ranking search results using either explicit feedback [2, 18] or implicit feedback in the form of click-through data [29, 30]. In a related line of research, human feedback has been used to train agents in simulated environments [10, 24]. There is also a rich literature on using RL to optimize automatic metrics for NLP tasks, such as ROUGE for summarization [50, 65, 45, 15, 19], BLEU for translation [50, 66, 1, 43], and other domains [61, 27, 26]. Finally, there has been extensive research on modifying architectures [22, 59] and pre-training procedures [70, 36, 49, 60, 53, 14] for improving summarization performance.

3 Method and experiment details

3.1 High-level methodology

Our approach is similar to the one outlined in [73], adapted to the batch setting. We start with an initial policy that is fine-tuned via supervised learning on the desired dataset (in our case, the Reddit TL;DR summarization dataset). The process (illustrated in Figure 2) then consists of three steps that can be repeated iteratively.

Step 1: Collect samples from existing policies and send comparisons to humans. For each Reddit post, we sample summaries from several sources including the current policy, initial policy, original reference summaries and various baselines. We send a batch of pairs of summaries to our human evaluators, who are tasked with selecting the best summary of a given Reddit post.

Step 2: Learn a reward model from human comparisons. Given a post and a candidate summary, we train a reward model to predict the log odds that this summary is the better one, as judged by our labelers.

Step 3: Optimize a policy against the reward model. We treat the logit output of the reward model as a reward that we optimize using reinforcement learning, specifically with the PPO algorithm [58].

Figure 2: Diagram of our human feedback, reward model training, and policy training procedure.

We provide a more thorough description of our procedure, including details of the reward model and policy training and our quality control process, in the following sections. In practice, rather than precisely iterating this sequence of three steps, we updated our data collection and training procedures over the course of the project while accumulating labels (see Appendix C.6 for details).

3.2 Datasets and task

Datasets. We use the TL;DR summarization dataset [63], which contains ~3 million posts from reddit.com across a variety of topics (subreddits), as well summaries of the posts written by the original poster (TL;DRs). We additionally filter this dataset (see Appendix A) to ensure quality, including using a whitelist of subreddits that are understandable to the general population. Crucially, we also filter to include only posts where the human-written summaries contain between 24 and 48 tokens, to minimize the potential effect of summary length on quality (see Section 4.1 and Appendix F). Our final filtered dataset contains 123,169 posts, and we hold out ~5% as a validation set. For the remainder of this paper, we refer to this dataset simply as TL;DR.

We chose the TL;DR dataset over the more commonly used CNN/DM dataset primarily because very strong performance can be attained on CNN/DM with simple extractive baselines. We find in Section 4.2 that our labelers prefer lead-3 over the CNN/DM reference summaries,⁵ and that the supervised T5 model [49] with low-temperature sampling already surpasses the reference summary quality, while copying extensively from the article. On the other hand, simple extractive baselines perform poorly on TL;DR in our human evaluations (see Appendix G.2). Instead of training on CNN/DM, we study the transfer performance of our human feedback models to CNN/DM after being trained to summarize Reddit posts.

Task. We define our ground-truth task as producing a model that generates summaries fewer than 48 tokens long that are as good as possible, according to our judgments. We judge summary quality by how faithfully the summary conveys the original post to a reader who can only read the summary and not the post (see Appendix C.5 for further discussion of criteria). Since we have limited capacity to do comparisons, we hire labelers to do the comparisons for us. We rely on detailed procedures to ensure high agreement between labelers and us on the task, which we describe in the next section.

Table 1: Example of post and samples on the TL;DR dataset, chosen to be particularly short. For random samples (along with posts), see Appendix H and our website.

3.3 Collecting human feedback

Previous work on fine-tuning language models from human feedback [73] reported “a mismatch between the notion of quality we wanted our model to learn, and what the humans labelers actually evaluated”, leading to model-generated summaries that were high-quality according to the labelers, but fairly low-quality according to the researchers.

Compared to [73], we implement two changes to improve human data quality. First, we transition entirely to the offline setting, where we alternate between sending large batches of comparison data⁶ to our human labelers and re-training our models on the cumulative collected data. Second, we maintain a hands-on relationship with labelers:⁷ we on-board them with detailed instructions, answer their questions in a shared chat room, and provide regular feedback on their performance. We train all labelers to ensure high agreement with our judgments, and continuously monitor labeler-researcher agreement over the course of the project. See Appendix C.1 and C.5 for details.

As a result of our procedure, we obtained high labeler-researcher agreement: on a subset of compari- son tasks, labelers agree with researchers 77% ± 2% of the time, while researchers agree with each other 73% ± 4% of the time. We provide more analysis of our human data quality in Appendix C.2.

3.4 Models

All of our models are Transformer decoders [62] in the style of GPT-3 [47, 4]. We conduct our human feedback experiments on models with 1.3 billion (1.3B) and 6.7 billion (6.7B) parameters.

Pretrained models. Similarly to [12, 47], we start with models pretrained to autoregressively predict the next token in a large text corpus. As in [48, 4], we use these models as ‘zero-shot’ baselines by padding the context with examples of high-quality summaries from the dataset. We provide details on pretraining in Appendix B, and on our zero-shot procedure in Appendix B.2.

Supervised baselines. We next fine-tune these models via supervised learning to predict summaries from our filtered TL;DR dataset (see Appendix B for details). We use these supervised models to sample initial summaries for collecting comparisons, to initialize our policy and reward models, and as baselines for evaluation. In our final human evaluations, we use T=0 to sample from all models, as we found it performed better than higher temperatures or nucleus sampling (see Appendix B.1).

To validate that our supervised models are indeed strong baselines for comparison, we run our supervised fine-tuning procedure with our 6.7B model on the CNN/DM dataset, and find that we achieve slightly better ROUGE scores than SOTA models [71] from mid-2019 (see Appendix G.4).

Reward models. To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y₀, y₁} is better as judged by a human, given a post x. If the summary preferred by the human is y_i, we can write the RM loss as:

loss(r_θ) = −E₍_x,y0 ,y1 ,i)∼D[log(σ(r_θ(x, y_i) − r_θ(x, y₁₋_i)))]

where r_θ(x, y) is the scalar output of the reward model for post x and summary y with parameters θ, and D is the dataset of human judgments. At the end of training, we normalize the reward model outputs such that the reference summaries from our dataset achieve a mean score of 0.

Human feedback policies. We want to use the reward model trained above to train a policy that generates higher-quality outputs as judged by humans. We primarily do this using reinforcement learning, by treating the output of the reward model as a reward for the entire summary that we maximize with the PPO algorithm [58], where each time step is a BPE token.⁸ We initialize our policy to be the model fine-tuned on Reddit TL;DR. Importantly, we include a term in the reward that penalizes the KL divergence between the learned RL policy π^RL with parameters φ and this original supervised model π^SFT, as previously done in [25]. The full reward R can be written as:

R(x, y) = r_θ(x, y) − β log[π^RL(y|x)/π^SFT(y|x)]

This KL term serves two purposes. First, it acts as an entropy bonus, encouraging the policy to explore and deterring it from collapsing to a single mode. Second, it ensures the policy doesn’t learn to produce outputs that are too different from those that the reward model has seen during training.

For the PPO value function, we use a Transformer with completely separate parameters from the policy. This prevents updates to the value function from partially destroying the pretrained policy early in training (see ablation in Appendix G.1). We initialize the value function to the parameters of the reward model. In our experiments, the reward model, policy, and value function are the same size.

4 Results

4.1 Summarizing Reddit posts from human feedback

Policies trained with human feedback are preferred to much larger supervised policies. Our main results evaluating our human feedback policies on TL;DR are shown in Figure 1. We measure policy quality as the percentage of summaries generated by that policy that humans prefer over the reference summaries in the dataset. Our policies trained with human feedback significantly outperform our supervised baselines on this metric, with our 1.3B human feedback model significantly outperforming a supervised model 10× its size (61% versus 43% raw preference score against reference summaries). Our 6.7B model in turn significantly outperforms our 1.3B model, suggesting that training with human feedback also benefits from scale. Additionally, both of our human feedback models are judged by humans to be superior to the human demonstrations used in the dataset.

Controlling for summary length. When judging summary quality, summary length is a confound- ing factor. The target length of a summary is implicitly part of the summarization task; depending on the desired trade-off between conciseness and coverage, a shorter or longer summary might be better. Since our models learned to generate longer summaries, length could account for much of our quality improvements. We find that after controlling for length (Appendix F), the preference of our human feedback models vs. reference summaries drops by ~5%; even so, our 6.7B model summaries are still preferred to the reference summaries ~65% of the time.

How do our policies improve over the baselines? To better understand the quality of our models’ summaries compared to the reference summaries and those of our supervised baselines, we conduct an additional analysis where human labelers assess summary quality across four dimensions (or “axes”) using a 7-point Likert scale [38]. Labelers rated summaries for coverage (how much important information from the original post is covered), accuracy (to what degree the statements in the summary are stated in the post), coherence (how easy the summary is to read on its own), and overall quality.

Figure 4: Transfer results on CNN/DM. (a) Overall summary quality on CNN/DM as a function of model size. Full results across axes shown in Appendix G.2. (b) Overall scores vs. length for the 6.7B TL;DR supervised baseline, the 6.7B TL;DR human feedback model, and T5 fine-tuned on CNN/DM summaries. At similar summary lengths, our 6.7B TL;DR human feedback model nearly matches T5 despite never being trained to summarize news articles.

The results (Figure 3) indicate that our human feedback models outperform the supervised baselines across every dimension of quality, but particularly coverage. Although our human labelers had a high bar for giving perfect overall scores, summaries from our 6.7B PPO model achieve a 7/7 overall score 45% of the time (compared to 20% and 23% for the 6.7B supervised baseline and reference summaries, respectively).

4.2 Transfer to summarizing news articles

Our human feedback models can also generate excellent summaries of CNN/DM news articles without any further training (Figure 4). Our human feedback models significantly outperform models trained via supervised learning on TL;DR and models trained only on pretraining corpora.

Figure 3: Evaluations of four axes of summary quality on the TL;DR dataset.

In fact, our 6.7B human feedback model performs almost as well as a 6.7B model that was fine-tuned on the CNN/DM reference summaries, despite generating much shorter summaries.

Since our human feedback models transferred to CNN/DM have little overlap in summary length distribution with models trained on CNN/DM, with about half as many tokens on average, they are difficult to compare directly. Thus our evaluations in Figure 4 use a 7-point Likert scale on four quality dimensions, as in Section 4.1 (see Appendix C.5 for labeler instructions). In Figure 4b we show the average overall score at different summary lengths, which suggests our human feedback models would perform even better if they generated longer summaries. Qualitatively, CNN/DM summaries from our human feedback models are consistently fluent and reasonable representations of the article; we show examples on our website and in Appendix H.

4.3 Understanding the reward model

What happens as we optimize the reward model? Optimizing against our reward model is supposed to make our policy align with human preferences. But the reward model isn’t a perfect representation of our labeler preferences, as it has limited capacity and only sees a small amount of comparison data from a relatively narrow distribution of summaries. While we can hope our reward model generalizes to summaries unseen during training, it’s unclear how much one can optimize against the reward model until it starts giving useless evaluations.

To answer this question, we created a range of policies optimized against an earlier version of our reward model, with varying degrees of optimization strength, and asked labelers to compare samples from them to the reference summaries. Figure 5 shows the results for PPO at a range of KL penalty

Figure 5: Preference scores versus degree of reward model optimization. Optimizing against the reward model initially improves summaries, but eventually overfits, giving worse summaries. This figure uses an earlier version of our reward model (see rm3 in Appendix C.6). See Appendix H.2 for samples from the KL 250 model.

Figure 6: Reward model performance versus data size and model size. Doubling amount of training data leads to a ~1.1% increase in reward model validation accuracy, whereas doubling the model size leads to a ~1.8% increase. The 6.7B model trained on all data begins approach- ing the accuracy of a single human.

coefficients (β). Under light optimization, the models improve (according to labelers). However, as we optimize further, true preferences fall off compared to the prediction, and eventually the reward model becomes anti-correlated with human preferences. Though this is clearly undesirable, we note that this over-optimization also happens with ROUGE (see [45] and Appendix G.3). Similar behavior has been observed in learned reward functions in the robotics domain [5].

How does reward modeling scale with increasing model and data size? We conduct an ablation to determine how data quantity and model size affect reward modeling performance. We train 7 reward models ranging from 160M to 13B parameters, on 8k to 64k human comparisons from our dataset. We find that doubling the training data amount leads to a ~1.1% increase in the reward model validation set accuracy, whereas doubling the model size leads to a ~1.8% increase (Figure 6).

What has the reward model learned? We probe our reward model by evaluating it on several validation sets. We show the full results in Appendix G.6, and highlight them here. We find that our reward models generalize to evaluating CNN/DM summaries (Appendix G.7), agreeing with labeler preferences 62.4% and 66.5% of the time (for our 1.3B and 6.7B models, respectively). Our 6.7B reward model nearly matches the inter-labeler agreement value of 66.9%.

We also find that our reward models are sensitive to small but semantically important details in the summary. We construct an additional validation set by having labelers make minimal edits to summaries to improve them. Our RMs prefer the edited summaries almost as often (79.4% for 1.3B and 82.8% for 6.7B) as a separate set of human evaluators (84.1%). Further, when comparing the reference summaries to perturbed summaries where the participants’ roles are reversed, our models reliably select the original summary (92.9% of the time for 1.3B, 97.2% for 6.7B). However, our RMs are biased towards longer summaries: our 6.7B RM prefers improving edits that make the summary shorter only 62.6% of the time (vs. 76.4% for humans).

4.4 Analyzing automatic metrics for summarization

Evaluation. We study how well various automatic metrics act as predictors for human preferences, and compare them to our RMs. Specifically, we examine ROUGE, summary length, amount of copying from the post,⁹ and log probability under our baseline supervised models. We present a full matrix of agreement rates between these metrics in Appendix G.7.

We find that our learned reward models consistently outperform other metrics, even on the CNN/DM dataset on which it was never trained. We also find that ROUGE fails to track sample quality as our

Chart, line chart

Description automatically generated

Figure 7: Summary quality as a function of metric optimized and amount of optimization, using best-of-N rejection sampling. We evaluate ROUGE, our main reward models, and an earlier iteration of the 1.3B model trained on approximately 75% as much data (see Table 11 for details). ROUGE appears to peak both sooner and at a substantially lower preference rate than all reward models. Details in Appendix G.3.

models improve. While ROUGE has ~57% agreement with labelers when comparing samples from our supervised baseline models, this drops to ~50% for samples from our human feedback model.

Similarly, log probability agreement with humans drops to ≤50% on comparisons between samples from our human feedback models, while our RMs still perform above chance (62%). Scaling up the size of the supervised model does not reliably improve log probability’s agreement with labelers.

Optimization. In Figure 7, we show that optimizing ROUGE using a simple optimization scheme doesn’t consistently increase quality, as has been noted in [45]. Optimization against ROUGE peaks both sooner and at a substantially lower quality rate than optimization against our reward models.

5 Discussion

Limitations. One limitation of our work is the time and cost required to produce our final models. Notably, fine-tuning our 6.7B model with RL required approximately 320 GPU-days. Our data collection procedure is also expensive compared to prior work — the training set took thousands of labeler hours and required significant researcher time to ensure quality. For this reason, we were unable to collect baselines such as an equivalent amount of high-quality human demonstrations for supervised baselines. See D for more discussion. We leave this ablation to future work. Nevertheless, we believe reward modeling is more likely to scale to tasks where it is extremely skill-intensive or time-consuming to provide good demonstrations.

Future directions. The methods in this paper could be applied to any task where humans can compare samples, including dialogue, machine translation, question answering, speech synthesis, and music generation. We expect this method to be particularly important for generating long samples, where the distributional shift and degeneracy of maximum likelihood samples can be problematic. It may be possible to improve sample efficiency by training to predict feedback across many tasks [42].

We are particularly interested in scaling human feedback to tasks where humans can’t easily evaluate the quality of model outputs. In this setting, it is particularly challenging to identify whether an ML system is aligned with the human designer’s intentions. One approach is to train ML systems to help humans perform the evaluation task quickly and accurately [9].

There is also a rich landscape of human feedback methods beyond binary comparisons that could be explored for training models [28, 17, 44, 64]. For example, we could solicit high-quality demonstra- tions from labelers, have labelers edit model outputs to make them better, or have labelers provide explanations for why they preferred one model output over another. All of this feedback could be leveraged as a signal to train more capable reward models and policies.

Broader impacts. The techniques we explore in this paper are generic techniques that could be used in a wide variety of machine learning applications, for any task where it is feasible for humans to evaluate the quality of model outputs. Thus, the potential implications are quite broad.

Our research is primarily motivated by the potential positive effects of aligning machine learning algorithms with the designer’s preferences. Many machine learning applications optimize simple metrics which are only rough proxies for what the designer intends. This can lead to problems, such as Youtube recommendations promoting click-bait [11]. In the short term, improving techniques for learning from and optimizing human preferences directly may enable these applications to be more aligned with human well-being.

In the long term, as machine learning systems become more capable it will likely become increasingly difficult to ensure that they are behaving safely: the mistakes they make might be more difficult to spot, and the consequences will be more severe. For instance, writing an inaccurate summary of a news article is both easy to notice (one simply has to read the original article) and has fairly low consequences. On the other hand, imitating human driving may be substantially less safe than driving to optimize human preferences. We believe that the techniques we explore in this paper are promising steps towards mitigating the risks from such capable systems, and better aligning them with what humans care about.

Unfortunately, our techniques also enable malicious actors to more easily train models that cause societal harm. For instance, one could use human feedback to fine-tune a language model to be more persuasive and manipulate humans’ beliefs, or to induce dependence of humans on the technology, or to generate large amounts of toxic or hurtful content intended to harm specific individuals. Avoiding these outcomes is a significant challenge for which there are few obvious solutions.

Large-scale models trained with human feedback could have significant impacts on many groups. Thus, it is important to be careful about how we define the ‘good’ model behavior that human labelers will reinforce. Deciding what makes a good summary is fairly straightforward, but doing this for tasks with more complex objectives, where different humans might disagree on the correct model behavior, will require significant care. In these cases, it is likely not appropriate to use researcher labels as the ‘gold standard’; rather, individuals from groups impacted by the technology should be included in the process to define ‘good’ behavior, and hired as labelers to reinforce this behavior in the model.

We chose to train on the Reddit TL;DR dataset because the summarization task is significantly more challenging than on CNN/DM. However, since the dataset consists of user-submitted posts with minimal moderation, they often contain content that is offensive or reflects harmful social biases. This means our models can generate biased or offensive summaries, as they have been trained to summarize such content. For this reason, we recommend that the potential harms of our models be thoroughly studied before deploying them in user-facing applications.

Finally, by improving the ability of machine learning algorithms to perform tasks that were previously only achievable by humans, we are increasing the likelihood of many jobs being automated, potentially leading to significant job loss. Without suitable policies targeted at mitigating the effects of large-scale unemployment, this could also lead to significant societal harm.

Acknowledgements

We’d like to thank Beth Barnes for help with labeler hiring and general encouragement; Geoffrey Irving for guidance on earlier iterations of the project and inspiring conversations; Ben Mann, Tom Brown, Nick Ryder, and Melanie Subbiah for training and evaluating our pretrained models; Chris Hesse, Eric Sigler, Benjamin Chess, Christopher Berner, Clemens Winter, Mateusz Litwin, and many others for supporting us through computing infrastructure improvements and maintenance; Scott Gray for writing fast GPU kernels; Arvind Neelakantan and Wojciech Kryscinski for discussions on how to present the work, experiment design, and what datasets to use; Shan Carter for help designing the main diagram; Douwe Kiela, Zach Lipton, and Alex Irpan for providing feedback on the paper; and Gretchen Krueger for co-writing the model card accompanying the paper.

Finally, we’d like to thank all of our contractors for providing the data that was essential for training the models in this paper, including: Emill Jayson Caypuno, Rachelle Froyalde, Cyra Denura, Alex Malek, Isik Agil, Reshmi Patel, William Yap, Natalie Silver, Erol Akbaba, Jennifer Brillo, Alexandra

Uifalean, Morris Stuttard, Russell Bernandez, Tasmai Dave, Rachel Wallace, Jenny Fletcher, Jian Ouyang, Justin Dill, Maria Orzek, Megan Niffenegger, William Sells, Emily Mariner, Andrew Seely, Lychelle Ignacio, Jelena Ostojic, Nhan Tran, Purev Batdelgar, Valentina Kezic, Michelle Wilkerson, Kelly Guerrero, Heather Scott, Sarah Mulligan, Gabriel Ricafrente, Kara Bell, Gabriel Perez, and Alfred Lee.

References

[1] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.

[2] B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination of multiple ranked retrieval systems. In SIGIR’94, pages 173–181. Springer, 1994.

[3] F. Böhm, Y. Gao, C. M. Meyer, O. Shapira, I. Dagan, and I. Gurevych. Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214, 2019.

[4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. 2020.

[5] S. Cabi, S. Gómez Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y. Aytar, D. Budden, M. Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv, pages arXiv–1909, 2019.

[6] A. T. Chaganty, S. Mussman, and P. Liang. The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202, 2018.

[7] W. S. Cho, P. Zhang, Y. Zhang, X. Li, M. Galley, C. Brockett, M. Wang, and J. Gao. Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511, 2018.

[8] S. Chopra, M. Auli, and A. M. Rush. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98, 2016.

[9] P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018.

[10] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.

[11] P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016.

[12] A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087, 2015.

[13] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.

[14] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, 2019.

[15] Y. Dong, Y. Shen, E. Crawford, H. van Hoof, and J. C. K. Cheung. Banditsum: Extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672, 2018.

[16] B. Dorr, D. Zajic, and R. Schwartz. Hedge trimmer: A parse-and-trim approach to headline generation. In Proceedings of the HLT-NAACL 03 on Text summarization workshop-Volume 5, pages 1–8. Association for Computational Linguistics, 2003.

[17] S. Fidler et al. Teaching machines to describe images with natural language feedback. In

Advances in Neural Information Processing Systems, pages 5068–5078, 2017.

[18] N. Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle.

ACM Transactions on Information Systems (TOIS), 7(3):183–204, 1989.

[19] Y. Gao, C. M. Meyer, M. Mesgar, and I. Gurevych. Reward learning for efficient reinforcement learning in extractive document summarisation. arXiv preprint arXiv:1907.12894, 2019.

[20] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.

[21] B. Hancock, A. Bordes, P.-E. Mazare, and J. Weston. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019.

[22] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701, 2015.

[23] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.

[24] B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei. Reward learning from human preferences and demonstrations in atari. In Advances in neural information processing systems, pages 8011–8023, 2018.

[25] N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.

[26] N. Jaques, S. Gu, D. Bahdanau, J. M. Hernández-Lobato, R. E. Turner, and D. Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645–1654. PMLR, 2017.[27] N. Jaques, S. Gu, R. E. Turner, and D. Eck. Tuning recurrent neural networks with reinforcement learning. 2017.

[28] H. J. Jeon, S. Milli, and A. D. Dragan. Reward-rational (implicit) choice: A unifying formalism for reward learning. arXiv preprint arXiv:2002.04833, 2020.

[29] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–142, 2002.

[30] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting click- through data as implicit feedback. In ACM SIGIR Forum, volume 51, pages 4–11. Acm New York, NY, USA, 2005.

[31] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[32] J. Kreutzer, S. Khadivi, E. Matusov, and S. Riezler. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958, 2018.

[33] W. Kryscinski, N. S. Keskar, B. McCann, C. Xiong, and R. Socher. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540–551, 2019.

[34] C. Lawrence and S. Riezler. Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252, 2018.

[35] J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.

[36] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.

[37] M. Li, J. Weston, and S. Roller. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087, 2019.

[38] R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.

[39] C.-Y. Lin and F. J. Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 605. Association for Computational Linguistics, 2004.

[40] T.-Y. Liu. Learning to rank for information retrieval. Springer Science & Business Media, 2011.

[41] J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization, 2020.

[42] B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.

[43] K. Nguyen, H. Daumé III, and J. Boyd-Graber. Reinforcement learning for bandit neural machine translation with simulated human feedback. arXiv preprint arXiv:1707.07402, 2017.

[44] T. Niu and M. Bansal. Polite dialogue generation without parallel data. Transactions of the Association for Computational Linguistics, 6:373–389, 2018.

[45] R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.

[46] E. Perez, S. Karamcheti, R. Fergus, J. Weston, D. Kiela, and K. Cho. Finding generalizable evidence by learning to convince q&a models. arXiv preprint arXiv:1909.05863, 2019.

[47] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language under- standing by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.

[48] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.

[49] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

[50] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.

[51] D. R. Reddy et al. Speech understanding systems: A summary of results of the five-year research effort. department of computer science, 1977.

[52] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.

[53] S. Rothe, S. Narayan, and A. Severyn. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 2020.

[54] A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.

[55] N. Schluter. The limits of automatic summarisation according to rouge. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 41–45, 2017.

[56] F. Schmidt. Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292, 2019.

[57] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.

[58] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[59] A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.

[60] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450, 2019.

[61] P. Tambwekar, M. Dhuliawala, A. Mehta, L. J. Martin, B. Harrison, and M. O. Riedl. Con- trollable neural story generation via reinforcement learning. arXiv preprint arXiv:1809.10736, 2018.

[62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.

[63] M. Völske, M. Potthast, S. Syed, and B. Stein. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.

[64] S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.

[65] Y. Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[66] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[67] Y. Yan, W. Qi, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou. Prophetnet: Pre- dicting future n-gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063, 2020.

[68] S. Yi, R. Goel, C. Khatri, A. Cervone, T. Chung, B. Hedayatnia, A. Venkatesh, R. Gabriel, and D. Hakkani-Tur. Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015, 2019.

[69] H. Zhang, D. Duckworth, D. Ippolito, and A. Neelakantan. Trading off diversity and quality in natural language generation. arXiv preprint arXiv:2004.10450, 2020.

[70] J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777, 2019.

[71] Y. Zhang, D. Li, Y. Wang, Y. Fang, and W. Xiao. Abstract text summarization with a convolu- tional seq2seq model. Applied Sciences, 9(8):1665, 2019.

[72] W. Zhou and K. Xu. Learning to compare for better training and evaluation of open domain natural language generation models. arXiv preprint arXiv:2002.05058, 2020.

[73] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv- ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Appendix

Table of Contents

TL;DR dataset details16
1. Further model training details 17
  1. Hyperparameters……………………………………………………………………………………………… 17
  1. Input format……………………………………………………………………………………………………. 18
1. Human data collection details 19
  1. Process for ensuring high-quality human data……………………………………………………… 19
  1. Assessing human feedback quality…………………………………………………………………….. 19
  1. Labeler demographics………………………………………………………………………………………. 20
  1. Labeler website……………………………………………………………………………………………….. 20
  1. Instructions for labelers……………………………………………………………………………………. 21
  1. Composition of the labeled dataset…………………………………………………………………….. 22
  1. Example comparison tasks……………………………………………………………………………….. 26
1. Choice of baselines 28
1. CNN/DM lead-3 vs reference summaries 29
1. Controlling for summary length 30
1. Additional results 31
  1. Value function ablation…………………………………………………………………………………….. 31
  1. Evaluating policies along axes of quality…………………………………………………………….. 31
  1. Studying best-of-N optimization………………………………………………………………………… 31
  1. ROUGE scores………………………………………………………………………………………………… 31
  1. Bigram overlap statistics…………………………………………………………………………………… 33
  1. Reward model validation sets……………………………………………………………………………. 34
  1. Measuring agreement between different evaluation metrics…………………………………… 35
1. Samples38
  1. Random samples……………………………………………………………………………………………… 38
  1. Overoptimized samples……………………………………………………………………………………. 38

A TL;DR dataset details

Here, we discuss the pre-processing steps that we apply to the TL;DR dataset. We first remove all duplicate posts by checking the text body, finding that there are nearly 20,000 exact duplicates. We then re-parse the TL;DR carefully using a set of heuristics, and filter to use only top-level posts (rather than comments). We also filter out any post that is from a subreddit not in our ‘subreddit whitelist’ (see Table 2 for the distribution over subreddits), any post where the title starts with some variant of ‘Edit’ or ‘Update’,¹⁰ and posts that contain certain topics (such as graphic sex or suicide) using heuristics. Finally, to ensure the posts are short enough to fit into the context length of our models, we filter out any post whose body is longer than 512 tokens. This resulted in a set of 287,790 posts filtered by body but not summary, of which we hold out approximately 5% as a validation set. We used this set of posts for RL training since our RL procedure does not require reference summaries.

We next perform additional filtering on the parsed refer- ence summaries that we use for training our supervised baselines. Specifically, we remove summaries where the TL;DR starts with variants of ‘Edit’, ‘Update’, or ‘P.S.’, we heuristically remove summaries with certain levels of profanity, and we remove summaries that are less than 24 tokens or more than 48 tokens. As dis- cussed in Section 4.1, since our RL models tend to gen- erate summaries on the upper end of the allowed length limit, this length filtering ensures that there is enough length overlap between the RL summaries and refer- ence summaries for us to perform a length-controlled analysis. Additionally, we found that summaries shorter than 16 tokens were usually of low quality. We later verified that the summaries we filtered out were lower

Table 2: Number of posts in the training set of our filtered Reddit TL;DR dataset by subreddit.

quality according to our reward model — more than 0.5 nats worse on average (i.e. they are predicted to be exp(0.5) ≈ 1.6 times less likely to be preferred). Our final TL;DR dataset contains 123,169 posts including summaries, again with about 5% held out as a validation set. We use 1913 of these validation articles for model selection during development; the evaluations in this paper exclude these articles.

Note that, from Table 2 we can see that about two thirds of our TL;DR dataset consists of posts relating to relationships or relationship advice, which is a fairly specific domain. This raises potential concerns about the generality of our models, though their strong transfer performance on CNN/DM news articles suggests they are not unreasonably specialized to relationship advice.

Model size	n_layers	d_model	n_heads	Max LR	Max batch size
1.3B	24	2048	16	2e-4	512
3B	32	2560	32	1.6e-4	512
6.7B	32	4096	32	1.2e-4	512
13B	40	5120	40	1e-4	1024

Table 3: Hyperparameters for our models of various sizes.

Figure 8: The sweep we conducted for determining our sampling procedure, varying the temperature and the ‘top p’ value for nucleus sampling. While we didn’t do a large enough test to determine whether nucleus sampling is better or worse than moderate-temperature sampling, we found that very low temperature sampling is better than both on this task.

B Further model training details

B.1 Hyperparameters

All models follow the standard Transformer architecture, with 2048 learned position embeddings. All models are trained with fp16 activations and the Adam optimizer [31]. Nearly all supervised baselines, reward models, and reinforcement learning models are trained with fp32 weights; the exception is our TL;DR supervised baselines, which were trained with fp16 weights.¹¹ All models are trained with the same byte-pair encoding as in [48].

During pretraining, the models were trained to predict the next token on a large text corpus consisting of Commoncrawl, Webtext [48], books, and Wikipedia. Training lasts between 1-3 epochs on each, for a total of 200-300 billion tokens. Learning rate follows a cosine schedule, with a short warmup, decaying to 10% of the maximum value. The batch size ramped up throughout training to some maximum, with each input having 2048 tokens. Hyperparameters for each model are shown in Table 3.

For supervised baselines, we initialize models from the pretrained models. We decay the learning rate with a cosine schedule, using an initial learning rate chosen from a log linear sweep of at least 7 values. This resulted in learning rates of 6.35e-5, 5.66e-5, 2.83e-5, and 2.83e-5 for our TL;DR models of size 1.3B, 3B, 6.7B, and 13B respectively, and a learning rate of 2.38e-5 for our CNN/DM

6.7B model. We use a batch size of 128, and run for a single epoch.

For reward modeling, we initialize to the supervised baseline, but with a reward head on top with weights initialized according to N (0, 1/(d_model + 1)) [20]. We train for one epoch, decaying the

learning rate with a cosine schedule, using an initial learning rate chosen from a log linear sweep of at least 7 values. We also sweep over between 3 and 10 seeds, and choose the reward model that performs best on the development portion of the validation set, as we find that both the data iteration order and reward head initialization affect results [13]. For our main results, the 1.3B and 6.7B reward models had learning rates of 1.5e-5 and 5e-6, respectively. We use a batch size of 64, and run for a single epoch.

For PPO, we run with separate policy and value networks, initializing our policies to the supervised baseline, and our value functions to the reward model. We set γ = 1 and λ = 0.95 for the advantage estimation [57] and do 4 epochs of optimization for each batch of rollouts. We used a linear learning rate decay schedule, with initial learning rates of 1.5e-5 for the 1.3B model and 7e-6 for the 6.7B model, based on small amounts of experimentation and rough model size extrapolation. We used a KL coefficient of 0.05 for both of the main runs we report results for (except when we explicitly vary this value in the reward model optimization graphs). We use a batch size of 512 for the 1.3B model and 256 for the 6.7B model, and run for 1 million episodes.

B.2 Input format

Our model always receives a byte-pair encoded string of a fixed size. When the input is too small, we pad from the beginning of the input with a padding token, and if the input is too long we truncate the post/article field at newlines to stay under the limit.

When sampling from models pretrained only on our pretrain mixture and not fine-tuned on TL;DR, we follow [48] and instead of padding with a padding token, we pad the beginning of the context with examples of posts/articles and high-quality summaries. We use as many examples as will fit in the token limit, with the examples formatted the same way as the main input. Table 4 documents the formats we used (with pythonic format strings).

C Human data collection details

C.1 Process for ensuring high-quality human data

We first detail the procedures we use to ensure high-quality data. While these procedures became more rigorous over the course of the project, they generally involved four steps.

Step 0: Understanding the task ourselves. To understand the task, we first do many summary comparisons ourselves. We also hire a small number of human labelers¹² to do comparisons, and discuss our disagreements. We then draft instructions for a larger set of human labelers.

Step 1: Labeler onboarding. Labelers are hired from Upwork, a freelancing platform, as well as two labeling services, Scale and Lionbridge. Labelers first complete a (paid) training process where they label summaries on a shared set of data. For some comparisons, labelers get immediate feedback about which summary was chosen by us, and why, to help them calibrate. We retain labelers that pass a minimum threshold for speed and agreement with us. To allow for a customizable labeler interface, we built our own website for data collection (see Appendix C.4).

Step 2: Collecting comparison data. Next, we have labelers evaluate a large batch of comparisons on our website, which generates the bulk of our data. Before comparing two summaries directly, we have labelers write their ‘naive interpretations’ of summaries without seeing the original post. We’ve found this helpful for evaluating summaries, as they surface points of ambiguity in the summary that might not have been detected if the summary was read after the original post. After doing naive interpretations, labelers do comparisons by assigning a value on a 9-point scale for how confident they are that summary A is better than summary B (or the converse).

Step 3: Providing labeler feedback. After collecting the comparison data, we can look at agreement rates between labelers. While most comparisons are only given to a single labeler, each labeler gets about 10-20% questions from a shared pool for calibration purposes. We can both attempt to use these statistics as crude measures of quality, and show cases of disagreements to workers to help them improve their labels.

Step 4: Researcher comparison calibrations. We occasionally also do the task ourselves, to measure agreement rates between each labeler and us. This is used for quality assessment (see C.2). We also calculate per-labeler “high confidence” thresholds, by finding the confidence value on the Likert scale for each labeler such that we expect labels above this threshold to agree with us 80% of the time on average. For the purposes of reward model selection, we filter the validation set to contain only these higher confidence labels. For the entire process, we keep a high communication bandwidth with labelers: we use a shared chat room for labelers to ask clarifying questions and discuss difficult comparisons amongst themselves, host office hours, and occasionally have one-on-one video calls with labelers to discuss points of disagreement.

We keep good labelers throughout the lifetime of the project, while firing the lowest-performing workers.

C.2 Assessing human feedback quality

We assess labeler accuracy by comparing the labeler’s preferred summary with the summary we prefer (ignoring the confidence level). We exclude comparisons where either the labeler or researcher expresses indifference. This gives us an agreement rate, in theory ranging from 0% (perfect disagree- ment) to 100% (perfect agreement). For our 2-way comparisons, a random labeler would get 50% agreement.

To obtain our main number comparing labeler-researcher to researcher-researcher agreement, we restrict ourselves to comparisons between summaries from our 1.3B supervised baseline, because this subset of the data has the most researcher-labeled data. On this subset, labelers agree with researchers 77% ± 2% of the time, while researchers agree with each other 73% ± 4% of the time. We believe substantial noise comes from comparisons being quite difficult and subjective.

In general, agreement rates range from about 65% for the least proficient labelers and most difficult comparisons (comparing two high-temperature samples from a single RL policy) to about 85% for

Figure 9: (a) The website we made to collect data from labelers. (b) Naive interpretations of summaries on the website.

the most proficient labelers and easiest comparisons (comparing a high-temperature sample from a supervised baseline to the reference summary). Averaging over all workers, weighted by their volume, gives us an estimated agreement rate of 73% ± 3% for our reward model training corpus.

Labelers agree with each other 72% of the time in the training corpus. This suggests we could get more reliable labels by aggregating labels from multiple workers on the same comparison. Indeed, on the subset of the training data for which we have enough shared comparisons, taking the modal label from 3 labelers increases their agreement rate with researchers from 72% to 77%. However, we usually collect only one label per comparison, in order to maximize label throughput.

On the evaluations for Figure 1, labelers agreed with researchers 73% ± 3% of the time, and labelers agreed with each other 73% ± 2% of the time.

Agreement rate between researchers ranged from about 65% on the most difficult comparisons (comparing two high-temperature samples from a single RL policy), to about 80% on the easiest comparisons (comparing a high-temperature sample from a supervised baseline to the human reference summary), to about 95% in cases where we discussed the comparisons with each other.

Overall we believe that quality is fairly high. Our attempts to filter data generally hurt reward model accuracy. For example, using the confidence thresholds mentioned above, we found that while lower-confidence labels were less useful than high-confidence labels for improving reward model accuracy, they were still better to include than to omit. Similarly, leaving out workers with poorer agreement rates did not help.

C.3 Labeler demographics

When training machine learning models with human feedback, the humans providing the feedback are essential in reinforcing the desired model behavior. If we are to scale human feedback to train models on more complex tasks, where humans might disagree about what the desired model behavior should be, it’s important for members of groups that will be impacted by the model to be included in the labeler population.

To provide more transparency into our labeler demographics, we provide results from a survey given to our labelers in Table 5. The survey was optional, anonymous, and it was made clear that the results would not affect hiring or firing decisions. We find that our labelers span a range of ethnicities, nationalities, ages, and genders, and educational backgrounds, but are more likely to be White and American.

C.4 Labeler website

Since we hired and trained our own set of labelers, rather than using a crowdsourcing website such as Amazon Mechanical Turk, we built our own website to allow for a standardized, customizable user interface for all labelers. Each labeler created a separate profile, allowing us to assign different sets of comparisons to different labelers. The website contains different renderers for different kinds

of questions, including naive interpretations, summary comparisons, and Likert evaluations along different axes, along with room for labelers to express concerns with the question or explanations for their decision. Screenshots from the website are shown in Figure 9. Data collected from the website can be easily ported into a central database containing all of our human data.

C.5 Instructions for labelers

Here we provide more detail on the specific instructions given to labelers for comparing summaries, and for doing Likert evaluations of summaries along axes of quality. We produced separate sets of instructions for evaluating Reddit posts, and for evaluating CNN/DM news articles. For Reddit instructions, we first describe Reddit in general and provide a table that translates Reddit-specific lingo into common parlance.

Instructions for comparing summaries. We show an excerpt of the instructions given to labelers for making comparisons in Table 6. In addition to these instructions, we provide an example labeled comparison between Reddit summaries, and also example naive interpretations for summaries.

Instructions for evaluating summaries along axes of quality. We provide a separate set of de- tailed instructions for labelers for the 7-point Likert evaluations. We first introduce each of the 4 axes of quality we consider, giving an overview of coherence, accuracy, coverage, and overall score (shown in Table 7). We also provide a brief rubric for giving scores of 1, 4, and 7, along with several Reddit summaries annotated with our own judgments of quality along each of these axes (with explanations).

Finally, we provide a FAQ section that answers common questions raised by the small initial set of labelers we assigned to this task.

For CNN/DM, we provide the same set of instructions, except we add some additional clarifications for how to judge news articles. We specifically ask labelers to place less emphasis on fluidity of sentences (because the reference summaries were originally written in bullet-point form, and we didn’t want labelers to penalize this), and to place less emphasis on the summary matching the intent of the article (which was important for Reddit summaries).

In terms of quality control, we conducted a smaller version of the quality control process described in Appendix C.1: we first labeled a small set of summaries ourselves along each axis to understand points of confusion, then we wrote the instructions document to provide to labelers, then we had a small number of labelers do a trial of the task to catch any remaining bugs or points of confusion, and finally we onboarded a larger set of labelers onto the task while remaining available to answer any questions.

C.6 Composition of the labeled dataset

Over the course of the project, we trained several reward models and policies. Each batch of summaries that we sent to the labelers were sampled from a variety of policies. We didn’t have a systematic plan for which policies to sample from; rather, we chose what seemed best at the time in the spirit of exploratory research. Every time we trained a reward model, we trained on all labels we had collected so far. Successive models also benefited from improved hyperparameters and dataset cleaning. Our results could likely be replicated with a simpler, more systematic approach.

In general, as we hire new labelers and as existing labelers perform the task more, it is possible that there is ‘labeler drift’, where the set of criteria used by labelers to evaluate summaries gradually shifts over time. This could lead to a regression in labeler-researcher disagreement, or lead to some policies becoming more or less preferred over time. To help guard against this, in most batches we include comparisons between samples from our supervised baseline and reference summaries, and measure the frequency with which the workers prefer one over the other. If this number drifts over time, it’s an indication that our workers’ preferences are also changing. However, we generally found that this preference number stayed relatively constant, within noise.

Table 8 lists the policies we trained by supervised finetuning on the TL;DR dataset, as well as the reward models, trained on successively larger datasets of human labels. Table 9 lists the RL policies.

Coherence
For this axis, answer the question “how coherent is the summary on its own?” A summary is coherent if, when read by itself, it’s easy to understand and free of English errors. A summary is not coherent if it’s difficult to understand what the summary is trying to say. Generally, it’s more important that the summary is understandable than it being free of grammar errors.

Rubric:
Score of 1: The summary is impossible to understand.
Score of 4: The summary has mistakes or confusing phrasing that make it a bit hard to understand. Score of 7: The summary is perfectly clear.

Accuracy
For this axis, answer the question “does the factual information in the summary accurately match the post?” A summary is accurate if it doesn’t say things that aren’t in the article, it doesn’t mix up people, and generally is not misleading. If the summary says anything at all that is not mentioned in the post or contradicts something in the post, it should be given a maximum score of 5. (If you are confused about how to use ‘6’, see the FAQ!)

Rubric:
Score of 1: The summary is completely wrong, made up, or exactly contradicts what is written in the post.
Score of 4: The summary says at least one substantial thing that is not mentioned in the post, or that contradicts something in the post.
Score of 5: The summary says anything, no matter how small, that is not mentioned in the post, or that contradicts something in the post.)
Score of 7: The summary has no incorrect statements or misleading implications.

Coverage
For this axis, answer the question “how well does the summary cover the important information in the post?” A summary has good coverage if it mentions the main information from the post that’s important to understand the situation described in the post. A summary has poor coverage if someone reading only the summary would be missing several important pieces of information about the situation in the post. A summary with good coverage should also match the purpose of the original post (e.g. to ask for advice).

Rubric:
Score of 1: The summary contains no information relevant to the post.
Score of 4: The summary is missing at least 1 important piece of information required to under- stand the situation.
Score of 7: The summary covers all of the important information required to understand the situation.

Overall quality
For this axis, answer the question “how good is the summary overall at representing the post?” This can encompass all of the above axes of quality, as well as others you feel are important. If it’s hard to find ways to make the summary better, give the summary a high score. If there are lots of different ways the summary can be made better, give the summary a low score.

Rubric:
Score of 1: The summary is terrible.
Score of 4: The summary is an okay representation of the post, but could be significantly improved. Score of 7: The summary is an excellent representation of the post.

Table 7: Instructions given to labelers for evaluating summaries along four different axes of quality.

Table 8: Left: supervised baselines. sup4 and sup4_6b are the final supervised baselines used throughout the paper. Right: reward models. rm4 and rm4_6b are the final reward models used throughout the paper.

RL policy name	# Parameters	Objective	Initialization KL	coefficient	KL(ppo, sup)
sup3 ppo rm1	1.3B	rm1	sup3	0.35	1.8
sup4 ppo rm3 1	1.3B	rm3	sup4	0.10	3.8
sup4 ppo rm3 2	1.3B	rm3	sup4	0.07	9.4
sup4 ppo rm3 3	1.3B	rm3	sup4	0.05	19.0
sup4 ppo rm4	1.3B	rm4	sup4	0.05	18.0
sup4_6b ppo rm4_6b	6.7B	rm4_6b	sup4_6b	0.05	14.0

Table 9: PPO policies. sup4 ppo rm4 and sup4_6b ppo rm4_6b are the final policies used throughout the paper.

BoN policy name	Objective	Base policy	N	KL(BoN, sup)
sup2 bo8 rm1	rm1	sup2	8	1.2
sup3 bo8 rm1	rm2	sup3	8	1.2
sup3 bo63 rm2	rm2	sup3	63	3.2
sup4 bo8 rm3	rm3	sup4	8	1.2
sup4 bo64 rm3	rm3	sup4	64	3.2
sup4 bo128 rm3	rm3	sup4	128	3.9
sup4 bo256 rm3	rm3	sup4	256	4.5
sup4 bo512 rm3	rm3	sup4	512	5.2
sup4 bo128 rm3_6b	rm3_6b	sup4	128	3.9
sup4 bo256 rm3_6b	rm3_6b	sup4	256	4.5

Table 10: Best-of-N policies. KL divergence is computed analytically as KL(boN, sup) = log N – (N-1)/N.

We also explored a simple alternative to reinforcement learning: Sample N summaries from a supervised baseline at temperature 0.7, score them with a reward model, and take the summary with the highest score. This best-of-N (BoN) procedure is effectively a mildly optimized policy requiring no training. These policies are named in Table 10, and samples from them form part of the training data.

Table 11 lists the source policies for the training data for each reward model.

Table 11: Training data for reward models. “ref” refers to human reference summaries.

C.7 Example comparison tasks

To give a sense of the difficulty of the comparisons task, we provide example comparisons between two summaries generated by our 6.7B human feedback model. In Table 12 we show both a random comparison drawn from the TL;DR dataset, and a cherry-picked comparison (selected from 10 comparisons where labelers disagreed) to illustrate the trade-off between accuracy in coverage that can occur when labelers conduct evaluations.

Table 12: Top: Example of a random comparison task on the TL;DR dataset between two summaries from our 6.7B human feedback model. Comparison chosen randomly from the validation set. Bottom: An example of a difficult comparison task on the TL;DR dataset. Chosen by looking at comparisons between supervised baseline summaries with at least 4 labeler judgements and with at least 40% vote for each summary. Cherry-picked out of 10 to highlight an accuracy-coverage tradeoff. Summary A is inaccurate since the author does not explicitly say she is having doubts about trying on wedding dresses. Summary B is entirely accurate but does not capture the general essence of the post. In this case, 4 workers chose A and 3 workers chose B. For more comparisons, see our website.

D Choice of baselines

In testing our human feedback techniques, we collected a large amount of high-quality data from human labelers. In order to compare fairly against supervision-based techniques, we would have needed to spend a similar amount of labeler time collecting high quality demonstrations, and used those to fine-tune a model via supervised learning. Because this is prohibitively expensive, we do not provide such a baseline.

Existing prior work such as PEGASUS [70] has studied supervised methods on a dataset very similar to ours (the /r/tifu subset of TL;DR). However, they use much smaller (500M parameters) models, and report that their model outputs are worse than the human reference summaries, according to human evaluations. Thus, due to our limited labeler budget for evaluation, we decided to use our own supervised and zero-shot models as baselines (after sanity-checking the ROUGE performance of our supervised models), as well as T5 [49].

T5 models [49] are pretrained and fine-tuned in a similar way to our supervised baselines, but they use an encoder-decoder architecture. We used T5 outputs which were obtained via beam search decoding, as described in [49]. We also carefully account for differences in tokenization between model outputs.¹³

E CNN/DM lead-3 vs reference summaries

On the CNN/DM dataset, our labelers significantly preferred lead-3 (a summary consisting of the first 3 sentences of the article) to reference summaries. In part this is due to longer summaries receiving higher coverage scores and lead-3 being 50% longer, as shown in Table 13.

Policy	Length (stdev)	Quality	Quality increase / 100 char.
ref	314 (119)	5.54	0.14
lead-3	475 (114)	6.23	0.34

Table 13: How length affects overall quality on CNN/DM for lead-3 and reference summaries.

However, if we use a linear regression (similar to the procedure in Appendix F) to predict what lead-3 performance would be if its average length were reduced to 314 characters, we still find a quality of 5.68, modestly higher than the reference summaries. Moreover, for lead-3 to even achieve parity with the reference summaries seems to call into question the need for abstractive summarization or sophisticated ML methods, since a simple extractive baseline can match a perfect imitation of the reference summaries.

We wanted to understand labeler behavior on these comparisons, to ensure that it was not an error. To do this, we examined a sample of our labeler’s judgments ourselves. We found that in 20/143 cases labelers preferred lead-3 by 3 points or more, and that excluding these datapoints would raise the relative score of the reference summaries by about 0.5 points.¹⁴ We were surprised to see the reference summaries performing so poorly in a significant fraction of cases, so we looked at labeler’s explanations and confirmed they made sense.

We found that two features of the reference summaries explained most of its underperformance. First, 13 of these 20 summaries omitted one of the key points from the article—the highlights are often written for a reader who had already seen the title of the article, even though the titles are not included in the CNN/DM dataset. Second, 10 of these 20 summaries actually introduced new information not present in the original article. From the perspective of labelers this information is totally confabulated and so led to lower scores. A likely explanation for these errors is that the reference summaries are extracted from “highlights” on the news sites rather than being a straightforward summary of the article. These failures are common enough that they significantly impact the average quality of the reference summaries, and the effects seem to be large relative to quality differences between ML models.

Overall we believe that labeler judgments were reasonable in these cases, and that it is potentially problematic to treat the “highlights” in the CNN/DM dataset as reference summaries. You can view all of our labeler’s judgments on CNN/DM at our website.

Figure 10: (a) A length-controlled version of Figure 1, using the procedure described in Appendix

F. Controlling for length reduces the relative preference of our human feedback models, however they are still preferred to the reference summaries. (b) Plotting model quality for different summary lengths on the TL;DR dataset. Our 6.7B human feedback model outperforms both the 6.7B supervised baseline and the reference summaries (horizontal line at 0.5) across lengths.

F Controlling for summary length

As discussed in Section 4.1, the length of a summary is a confounding factor for evaluating summary quality; depending on the desired trade-off between conciseness and coverage, a shorter or longer summary might be better. Our models generate summaries that are longer than the reference summaries, as this led to higher labeler preference given the 24-48 token limit for our task. Here we describe the procedure we use to attempt to control for length.

To calculate a single length-controlled preference number, we train a logistic regression model to predict the human-preferred summary on our dataset of human comparisons. We provide this model with 2 features: the identity of each policy, and the log ratio of the summary lengths. To calculate the length-controlled preference value between two policies, we simply give each policy ID to our trained logistic regression model and set the log length ratio to zero (see Figure 10a). In Figure 10b we examine summary quality across a range of summary lengths on TL;DR. We find that our human feedback model outperforms the supervised baseline across all length values.

For CNN/DM, we use a similar procedure as described above to control for length, except using a linear regression model to predict the Likert rating from 1-7. We show the expected quality increase for making summaries 100 characters longer in Table 14, which suggests our human feedback models would perform better if they generated longer summaries.

Table 14: How length affects overall quality on CNN/DM. We show average length and quality scores for various policies, and how much the summary quality increases on average per 100 added characters.

G Additional results

G.1 Value function ablation

In this section, we conduct an ablation comparing using separate parameters for the value function and policy, against using a shared network as done in [73]. The results, shown in Figure 11, clearly indicate that using separate networks outperforms the latter. On the other hand, having separate networks increases the memory requirements of running RL fine-tuning. Having separate networks also allows us to initialize the value function to be the learned reward model that is being optimized.

Figure 11: Comparing the reward obtained by optimizing with separate value function and reward model parameters to shared parameters.

G.2 Evaluating policies along axes of quality

We show the full results of the evaluations of policies on a 7-point Likert scale along different axes of quality; for TL;DR this is shown in Figure 12, and for CNN/DM this is shown in Figure 13. It is evident that on both datasets coverage correlates strongly with overall score across models, and all models achieve a high coherence score.

G.3 Studying best-of-N optimization

A natural way to evaluate an automatic evaluation metric is to see the extent to which optimizing against it leads to high performance according to humans. One way to assess this is to use best-of-N as an (inefficient) optimization technique — this has the benefits of being simple and invariant to monotonic transformations. We report results for up to best-of-2048 on ROUGE and three of our reward models in Figure 7, using samples from the 1.3B supervised baseline. The results suggest that optimizing against ROUGE significantly under-performs optimizing against our reward models. The data also suggests ROUGE degrades with too much optimization much faster than our reward models.

With increasing N, the best-of-N policies get higher average reward. Similarly, by decreasing the KL coefficient β, the PPO policies get higher average reward. We found that at a given average reward, the best-of-N and PPO policies have similar quality as judged by human labelers (not shown). However, the PPO policy is farther from the supervised baseline than best-of-N is, as measured by the KL divergence.¹⁵

G.4 ROUGE scores

In Figure 14a and 14b, we show the ROUGE scores of our models on the TL;DR and CNN/DM datasets, respectively. We report results with T=0, consistent with our human evaluations. We found that temperature has an (often significant) impact on ROUGE score, and we did a thorough sweep to verify that the best temperature setting is T=0.

Chart, bar chart

Description automatically generated — Figure 12: Evaluating TL;DR policies on a 7-point Likert scale along several axes of quality.

Model	ROUGE-1	ROUGE-2	ROUGE-L
ProphetNet [67]	44.20	21.17	40.69
T5 [49]	43.52	21.55	40.69
Our 6.7B supervised model	42.49	19.84	39.53
CNN-2sent-hieco-RBM [71]	42.04	19.77	39.42

Table 15: Comparing the ROUGE score of our 6.7B supervised model on CNN/DM to recent SOTA models from the literature. Without any summarization-specific engineering, our model achieves ROUGE scores better than SOTA models from mid-2019, indicating that it is a strong baseline for comparison.

On TL;DR, we find that our human feedback models obtain a slightly lower ROUGE score than the supervised models at T = 0, further indicating that ROUGE correlates poorly with human preferences. For supervised models, lowering temperature has a larger impact than increasing model size. Interestingly, at higher temperatures, our feedback models actually outperform supervised counterparts (not shown).

On CNN/DM, ROUGE agrees with our human evaluations that our human feedback models transfer better than our supervised models. However, unsurprisingly, supervised CNN/DM models still achieve much higher ROUGE. In Table 15, we show the ROUGE results on CNN/DM for our 6.7B supervised baseline and various models from the literature. We find that our model achieves ROUGE scores less than T5 [49], but slightly greater than the CNN-2sent-hieco-RBM model from [71], which was SOTA for abstractive summarization on CNN/DM in mid-2019 according to the NLP-progress leaderboard.¹⁶

G.5 Bigram overlap statistics

In Table 16, we show the bigram overlap statistics for our models on the TL;DR and CNN/DM datasets as a proxy for how much the summaries copy frmo the post. As in Section 4.4, we compute the longest common subsequence of bigrams with the original Reddit post or news article, and dividing by the number of bigrams in the summary. We find that models evaluated on CNN/DM

Figure 14: ROUGE scores for our models on (a) the TL;DR dataset, and (b) the CNN/DM dataset.

Evaluated on TL;DR
Model	Model size	Bigram overlap %
GPT	1.3B	66.7%
GPT	3B	72.7%
GPT	6.7B	61.4%
GPT	13B	75.9%
Supervised (TL;DR)	1.3B	49.0%
Supervised (TL;DR)	3B	48.7%
Supervised (TL;DR)	6.7B	48.9%
Supervised (TL;DR)	13B	48.0%
Human feedback (TL;DR)	1.3B	53.3%
Human feedback (TL;DR)	6.7B	46.0%
Evaluated on CNN/DM
Model	Model size	Bigram overlap %
GPT	1.3B	76.3%
GPT	6.7B	76.2%
Supervised (TL;DR)	1.3B	59.5%
Supervised (TL;DR)	6.7B	56.9%
Human feedback (TL;DR)	1.3B	64.8%
Human feedback (TL;DR)	6.7B	51.2%
Supervised (CNN/DM)	1.3B	66.0%
T5	11B	68.8%
reference	—	36.8%

Table 16: Bigram overlap statistics on the TL;DR dataset (top) and the CNN/DM dataset (bottom). Models trained on CNN/DM copy significantly more than models trained on TL;DR.

(whether or not they were trained on CNN/DM) generally copy more than models evaluated on TL;DR. Further, our supervised and human feedback models copy less than our pretrained models.

G.6 Reward model validation sets

In this section, we report results evaluating our reward models on various manually constructed validation sets, shown in Tables 17 and 18. Notably, we asked our humans to produce a small dataset of edits, by having them make improvements to existing summaries (either reference summaries or supervised baseline summaries). Our 6.7B reward model prefer the improved summaries at a similar rate to humans (who do not know which summary has been edited).

Our reward models are also sensitive to sentence shuffling (whereas metrics like ROUGE are largely not), and are able to detect when the roles portrayed in the summary have been switched. On the other hand, our reward models sometimes exhibit preference for poor artificial summaries, such as

the post title copied twice, or asking for advice at the end of the summary. In Table 19, we show examples where our model is sensitive to small, semantically meaningful changes in the summary.

G.7 Measuring agreement between different evaluation metrics

We are interested in understanding the relationship between different metrics for evaluating summaries. To do this, we compute agreement between various metrics, including automatic metrics and humans, for different subsets of the data for which we have human evaluations. To remove policy quality as a confounding variable, all of the summary comparisons are generated by the same policy at the same temperature value. In Table 20, we use samples from our 1.3B supervised model at T=0.7 on TL;DR; Table 21 has comparisons from our 6.7B supervised model at T=0.7 on TL;DR; Table 22 has comparisons from our 6.7B human feedback model at T=0.7 on TL;DR; and Table 23 has comparisons from our 6.7B supervised baseline trained on CNN/DM.

Our 6.7B reward model generally agrees with labelers as much as other labelers, although an ensemble of labelers does better. On the other hand, ROUGE generally has poor agreement, as does log probability under the supervised baselines, with simple heuristics like copying (longest common subsequence of bigrams with the article) and length often performing comparably.

H Samples

H.1 Random samples

Here we provide non-cherry-picked samples and human evaluations for various models. In Tables 25– 26 we show samples on the TL;DR dataset, and in Tables 27–28 we show samples on the CNN/DM dataset (where we truncate the article for brevity). See our website for more uncurated policy samples.

H.2 Overoptimized samples

We show examples of samples from a policy overoptimized to rm3. The summaries, while clearly long, low quality, and full of idiosyncrasies, do still reflect the rough gist of the post.

Foot Note

^∗This was a joint project of the OpenAI Reflection team. Author order was randomized amongst {LO, JW, DZ, NS}; CV and RL were full-time contributors for most of the duration. PC is the team lead.

²Samples from all of our models can be viewed on our website.

³We provide inference code for our 1.3B models and baselines, as well as a model card and our human feedback dataset with over 64k summary comparisons, here.

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

⁴Throughout the paper, error bars represent 1 standard error.

⁵We manually check this result in Appendix E and find we generally agree with labeler ratings.

⁶ Our decision to collect comparisons rather than Likert scores is supported by recent work, e.g. [37].

⁷We recruited labelers from a freelancing platform, Upwork, and two labeling services, Scale and Lionbridge.

⁸Note that the reward model only gives rewards for entire summaries, and not at intermediate time steps. In RL terminology, each episode terminates when the policy outputs the EOS token, and the discount factor γ = 1.

⁹We measure copying by computing the longest common subsequence of bigrams with the original Reddit post or news article, and dividing by the number of bigrams in the summary.

¹⁰These posts are usually follow-ups of previous posts that have been posted to Reddit, and require the context of the original post to fully understand.

¹¹This was for a historical reason – we found that fp32 weights improved RL performance and so used it for all our RL runs. This introduces a small discrepancy, since supervised runs trained in fp32 would have performed slightly better. Unfortunately, we forgot to address this in our human evaluations. However, the effect on the supervised loss corresponds to increasing model size by less than 20%, which is small compared to effect sizes that are present in this paper (as seen in Figure 1.)

¹²We pay labelers an hourly wage, regardless of the number of comparisons completed.

¹³Since tokenization affects capitalization and punctuation of the model outputs, we normalized all CNN/Daily Mail outputs from all models by lower-casing everything and then heuristically re-capitalizing. We verify that this normalization procedure produces identical results for reference summaries tokenized in different ways.

¹⁴The reference summaries were preferred to lead-3 by a similar margin in only 7/143 cases.

¹⁵We can use KL from the supervised baseline as a distance metric. Note that we can calculate the KL of a best-of-N policy analytically as log(n) − ⁿ⁻¹ .

¹⁶ http://nlpprogress.com/english/summarization.html

Learning to summarize from human feedback