Reiichiro Nakano^∗ Jacob Hilton^∗ Suchir Balaji^∗ Jeff Wu Long Ouyang

Christina Kim Christopher Hesse Shantanu Jain Vineet Kosaraju

William Saunders Xu Jiang Karl Cobbe Tyna Eloundou Gretchen Krueger

Kevin Button Matthew Knight Benjamin Chess John Schulman

OpenAI

Abstract

We fine-tune GPT-3 to answer long-form questions using a text-based web- browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

1 Introduction

A rising challenge in NLP is long-form question-answering (LFQA), in which a paragraph-length answer is generated in response to an open-ended question. LFQA systems have the potential to become one of the main ways people learn about the world, but currently lag behind human performance [Krishna et al., 2021]. Existing work tends to focus on two core components of the task, information retrieval and synthesis.

In this work we leverage existing solutions to these components: we outsource document retrieval to the Microsoft Bing Web Search API,² and utilize unsupervised pre-training to achieve high-quality synthesis by fine-tuning GPT-3 [Brown et al., 2020]. Instead of trying to improve these ingredients,

we focus on combining them using more faithful training objectives. Following Stiennon et al. [2020], we use human feedback to directly optimize answer quality, allowing us to achieve performance competitive with humans.

We make two key contributions:

Figure 1: An observation from our text-based web-browsing environment, as shown to human demonstrators (left) and models (right). The web page text has been abridged for illustrative purposes.

We create a text-based web-browsing environment that a fine-tuned language model can interact with. This allows us to improve both retrieval and synthesis in an end-to-end fashion using general methods such as imitation learning and reinforcement learning.
We generate answers with references: passages extracted by the model from web pages while browsing. This is crucial for allowing labelers to judge the factual accuracy of answers, without engaging in a difficult and subjective process of independent research.

Our models are trained primarily to answer questions from ELI5 [Fan et al., 2019], a dataset of questions taken from the “Explain Like I’m Five” subreddit. We collect two additional kinds of data: demonstrations of humans using our web-browsing environment to answer questions, and comparisons between two model-generated answers to the same question (each with their own set of references). Answers are judged for their factual accuracy, coherence, and overall usefulness.

We use this data in four main ways: behavior cloning (i.e., supervised fine-tuning) using the demon- strations, reward modeling using the comparisons, reinforcement learning against the reward model, and rejection sampling against the reward model. Our best model uses a combination of behavior cloning and rejection sampling. We also find reinforcement learning to provide some benefit when inference-time compute is more limited.

We evaluate our best model in three different ways. First, we compare our model’s answers to answers written by our human demonstrators on a held-out set of questions. Our model’s answers are preferred 56% of the time, demonstrating human-level usage of the text-based browser. Second, we compare our model’s answers (with references stripped, for fairness) to the highest-voted answer provided by the ELI5 dataset. Our model’s answers are preferred 69% of the time. Third, we evaluate our model on TruthfulQA [Lin et al., 2021], an adversarial dataset of short-form questions. Our model’s answers are true 75% of the time, and are both true and informative 54% of the time, outperforming our base model (GPT-3), but falling short of human performance.

The remainder of the paper is structured as follows:

In Section 2, we describe our text-based web-browsing environment and how our models interact with it.
In Section 3, we explain our data collection and training methods in more detail.
In Section 4, we evaluate our best-performing models (for different inference-time compute budgets) on ELI5 and TruthfulQA.In Section 5, we provide experimental results comparing our different methods and how they scale with dataset size, parameter count, and inference-time compute.
In Section 6, we discuss the implications of our findings for training models to answer questions truthfully, and broader impacts.

Table 1: Actions the model can take. If a model generates any other text, it is considered to be an invalid action. Invalid actions still count towards the maximum, but are otherwise ignored.

2 Environment design

Previous work on question-answering such as REALM [Guu et al., 2020] and RAG [Lewis et al., 2020a] has focused on improving document retrieval for a given query. Instead, we use a familiar existing method for this: a modern search engine (Bing). This has two main advantages. First, modern search engines are already very powerful, and index a large number of up-to-date documents. Second, it allows us to focus on the higher-level task of using a search engine to answer questions, something that humans can do well, and that a language model can mimic.

For this approach, we designed a text-based web-browsing environment. The language model is prompted with a written summary of the current state of the environment, including the question, the text of the current page at the current cursor location, and some other information (see Figure 1(b)). In response to this, the model must issue one of the commands given in Table 1, which performs an action such as running a Bing search, clicking on a link, or scrolling around. This process is then repeated with a fresh context (hence, the only memory of previous steps is what is recorded in the summary).

While the model is browsing, one of the actions it can take is to quote an extract from the current page. When this is performed, the page title, domain name and extract are recorded to be used later as a reference. Browsing then continues until either the model issues a command to end browsing, the maximum number of actions has been reached, or the maximum total length of references has been reached. At this point, as long as there is at least one reference, the model is prompted with the question and the references, and must compose its final answer.

Further technical details about our environment can be found in Appendix A.

3 Methods

3.1 Data collection

Guidance from humans is central to our approach. A language model pre-trained on natural language would not be able to use our text-based browser, since it does not know the format of valid commands. We therefore collected examples of humans using the browser to answer questions, which we call demonstrations. However, training on demonstrations alone does not directly optimize answer quality, and is unlikely to lead far beyond human performance [Stiennon et al., 2020]. We therefore collected pairs of model-generated answers to the same question, and asked humans which one they preferred, which we call comparisons.

For both demonstrations and comparisons, the vast majority of questions were taken from ELI5 [Fan et al., 2019], a dataset of long-form questions. For diversity and experimentation, we also mixed in a small number of questions from other sources, such as TriviaQA [Joshi et al., 2017]. In total, we collected around 6,000 demonstrations, 92% of which were for questions from ELI5, and around 21,500 comparisons, 98% of which were for questions from ELI5. A more detailed breakdown of the questions we used along with post-processing details can be found in Appendix B.

Table 2: An answer produced by our 175B best-of-64 model to a randomly-chosen question from the ELI5 test set (not cherry-picked). The full text of the references can be found in Appendix J, along with answers from our human demonstrators and the ELI5 dataset. Further samples are available at https://openaipublic.blob.core.windows.net/webgpt-answer-viewer/index.html.

Question	Why did we decide that certain words were “bad” and shouldn’t be used in social settings?
Answer	We have all been taught that there are certain words that should not be used in public, at least not without good reason. But why do some words fall into this category while others do not? It turns out that there is not a single answer to this question, as each word has a different history and carries different meanings in different cultures and families [2]. However, there are some general categories that “bad” words tend to fall into. “Bad” words generally relate to parts of life that we don’t like talking about in public, like bathroom functions, or negative ways of talking about people’s religion, skin color, ability, or gender [3]. Often, people say bad words when they are experiencing strong emotions, and those words are linked to our emotions [3].
References (titles only)	[1, 2, 3] Why Are Some Words ‘Bad’? \| Vermont Public Radio (www.vpr.org) [4] On Words: ‘Bad’ Words and Why We Should Study Them \| UVA Today (news.virginia.edu) [5] The Science of Curse Words: Why The &@$! Do We Swear? (www.babbel.com)

To make it easier for humans to provide demonstrations, we designed a graphical user interface for the environment (see Figure 1(a)). This displays essentially the same information as the text-based interface and allows any valid action to be performed, but is more human-friendly. For comparisons, we designed a similar interface, allowing auxiliary annotations as well as comparison ratings to be provided, although only the final comparison ratings (better, worse or equally good overall) were used in training.

For both demonstrations and comparisons, we emphasized that answers should be relevant, coherent, and supported by trustworthy references. Further details about these criteria and other aspects of our data collection pipeline can be found in Appendix C.

We are releasing a dataset of comparisons, the details of which can be found in Appendix K.

3.2 Training

The use of pre-trained models is crucial to our approach. Many of the underlying capabilities required to successfully use our environment to answer questions, such as reading comprehension and answer synthesis, emerge as zero-shot capabilities of language models [Brown et al., 2020]. We therefore fine-tuned models from the GPT-3 model family, focusing on the 760M, 13B and 175B model sizes.

Starting from these models, we used four main training methods:

Behavior cloning (BC). We fine-tuned on the demonstrations using supervised learning, with the commands issued by the human demonstrators as labels.
Reward modeling (RM). Starting from the BC model with the final unembedding layer removed, we trained a model to take in a question and an answer with references, and output a scalar reward. Following Stiennon et al. [2020], the reward represents an Elo score, scaled such that the difference between two scores represents the logit of the probability that one will be preferred to the other by the human labelers. The reward model is trained using a cross-entropy loss, with the comparisons as labels. Ties are treated as soft 50% labels.
Reinforcement learning (RL). Once again following Stiennon et al. [2020], we fine-tuned the BC model on our environment using PPO [Schulman et al., 2017]. For the environment reward, we took the reward model score at the end of each episode, and added this to a KL penalty from the BC model at each token to mitigate overoptimization of the reward model.
Rejection sampling (best-of-n). We sampled a fixed number of answers (4, 16 or 64) from either the BC model or the RL model (if left unspecified, we used the BC model), and selected the one that was ranked highest by the reward model. We used this as an alternative method of optimizing against the reward model, which requires no additional training, but instead uses more inference-time compute.

We used mutually disjoint sets of questions for each of BC, RM and RL.

For BC, we held out around 4% of the demonstrations to use as a validation set.

For RM, we sampled answers for the comparison datasets in an ad-hoc manner, using models of various sizes (but primarily the 175B model size), trained using various combinations of methods and hyperparameters, and combined them into a single dataset. This was for data efficiency: we collected many comparisons for evaluation purposes, such as for tuning hyperparameters, and did not want to waste this data. Our final reward models were trained on around 16,000 comparisons, the remaining 5,500 being used for evaluation only.

For RL, we trained on a mixture of 90% questions from ELI5 and 10% questions from TriviaQA. To improve sample efficiency, at the end of each episode we inserted 15 additional answering-only episodes using the same references as the previous episode. We were motivated to try this because answering explained slightly more of the variance in reward model score than browsing despite taking many fewer steps, and we found it to improve sample efficiency by approximately a factor of 2. We also randomized the maximum number of browsing actions, sampling uniformly from the range 20–100 inclusive.

Hyperparameters for all of our training methods can be found in Appendix E.

4 Evaluation

In evaluating our approach, we focused on three “WebGPT” models, each of which was trained with behavior cloning followed by rejection sampling against a reward model of the same size: a 760M best-of-4 model, a 13B best-of-16 model and a 175B best-of-64 model. As discussed in Section 5.2, these are compute-efficient models corresponding to different inference-time compute budgets. We excluded RL for simplicity, since it did not provide significant benefit when combined with rejection sampling (see Figure 4).

We evaluated all WebGPT models using a sampling temperature of 0.8, which was tuned using human evaluations, and with a maximum number of browsing actions of 100.

4.1 ELI5

We evaluated WebGPT on the ELI5 test set in two different ways:

We compared model-generated answers to answers written by demonstrators using our web-browsing environment. For these comparisons, we used the same procedure as compar- isons used for reward model training. We consider this to be a fair comparison, since the instructions for demonstrations and comparisons emphasize a very similar set of criteria.
We compared model-generated answers to the reference answers from the ELI5 dataset, which are the highest-voted answers from Reddit. In this case, we were concerned about ecological validity, since our detailed comparison criteria may not match those of real-life users. We were also concerned about blinding, since Reddit answers do not typically include citations. To mitigate these concerns, we stripped all citations and references from the model-generated answers, hired new contractors who were not familiar with our detailed instructions, and gave them a much more minimal set of instructions, which are given in Appendix F.

In both cases, we treat ties as 50% preference ratings (rather than excluding them).

Figure 2: Human evaluations on ELI5 comparing against (a) demonstrations collected using our web browser, (b) the highest-voted answer for each question. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient (see Figure 8). Error bars represent ±1 standard error.

Although the evaluations against the ELI5 reference answers are useful for comparing to prior work, we believe that the evaluations against human demonstrations are more meaningful, for several reasons:

Fact-checking. It is difficult to assess the factual accuracy of answers without references: even with the help of a search engine, expertise is often required. However, WebGPT and human demonstrators provide answers with references.
Objectivity. The use of minimal instructions makes it harder to know what criteria are being used to choose one answer over another. Our more detailed instructions enable more interpretable and consistent comparisons.
Blinding. Even with citations and references stripped, WebGPT composes answers that are different in style to Reddit answers, making the comparisons less blinded. In contrast, WebGPT and human demonstrators compose answers in similar styles. Additionally, some ELI5 answers contained links, which we instructed labelers not to follow, and this could have biased labelers against those answers.
Answer intent. People ask questions on ELI5 to obtain original, simplified explanations rather than answers that can already be found on the web, but these were not criteria we wanted answers to be judged on. Moreover, many ELI5 questions only ever get a small number of low-effort answers. With human demonstrations, it is easier to ensure that the desired intent and level of effort are used consistently.

4.2 TruthfulQA

To further probe the abilities of WebGPT, we evaluated WebGPT on TruthfulQA [Lin et al., 2021], an adversarially-constructed dataset of short-form questions. TruthfulQA questions are crafted such that they would be answered falsely by some humans due to a false belief or misconception. Answers are scored on both truthfulness and informativeness, which trade off against one another (for example, “I have no comment” is considered truthful but not informative).

We evaluated both the base GPT-3 models used by WebGPT and the WebGPT models themselves on TruthfulQA. For GPT-3, we used both the “QA prompt” and the “helpful prompt” from Lin et al. [2021], and used the automated metric, since this closely tracks human evaluation on answers produced by the GPT-3 model family. For WebGPT, we used human evaluation, since WebGPT’s answers are out-of-distribution for the automated metric. TruthfulQA is a short-form dataset, so

Figure 3: TruthfulQA results. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient (see Figure 8). Error bars represent ±1 standard error.

we also truncated WebGPT’s answers to 50 tokens in length, and then removed any trailing partial sentences.³

Our results are shown in Figure 3. All WebGPT models outperform all GPT-3 models (with both prompts) on both the percentage of truthful answers and the percentage of truthful and informative answers. Moreover, the percentage of truthful and informative answers increases with model size for WebGPT, unlike GPT-3 with either prompt. Further qualitative analysis of WebGPT’s performance on TruthfulQA is given in Section 6.1.

4.3 TriviaQA

We also evaluated the WebGPT 175B BC model on TriviaQA [Joshi et al., 2017]. These results are given in Appendix G.

5 Experiments

5.1 Comparison of training methods

We ran a number of additional experiments comparing reinforcement learning (RL) and rejection sampling (best-of-n) with each other and with the behavior cloning (BC) baseline. Our results are shown in Figures 4 and 5. Rejection sampling provides a substantial benefit, with the 175B best-of-64 BC model being preferred 68% of the time to the 175B BC model. Meanwhile, RL provides a smaller benefit, with the 175B RL model being preferred 58% of the time to the 175B BC model.

Even though both rejection sampling and RL optimize against the same reward model, there are several possible reasons why rejection sampling outperforms RL:

It may help to have many answering attempts, simply to make use of more inference-time compute.
The environment is unpredictable: with rejection sampling, the model can try visiting many more websites, and then evaluate the information it finds with the benefit of hindsight.

Figure 4: Preference of RL models over BC models, with (right) and without (left) using rejection sampling. RL slightly improves pref- erence, but only when not using rejection sam- pling. Error bars represent ±1 standard error.

Figure 5: Preference of the 175B best-of-n BC model over the BC model. The validation RM prediction is obtained using the estimator described in Appendix I, and predicts human preference well in this setting. The shaded region represents ±1 standard error.

The reward model was trained primarily on data collected from BC and rejection sampling policies, which may have made it more robust to overoptimization by rejection sampling than by RL.
RL requires hyperparameter tuning, whereas rejection sampling does not.

The combination of RL and rejection sampling also fails to offer much benefit over rejection sampling alone. One possible reason for this is that RL and rejection sampling are optimizing against the same reward model, which can easily be overoptimized (especially by RL, as noted above). In addition to this, RL reduces the entropy of the policy, which hurts exploration. Adapting the RL objective to optimize rejection sampling performance is an interesting direction for future research.

It is also worth highlighting the importance of carefully tuning the BC baseline for these comparisons. As discussed in Appendix E, we tuned the number of BC epochs and the sampling temperature using a combination of human evaluations and reward model score. This alone closed much of the gap we originally saw between BC and RL.

5.2 Scaling experiments

We also conducted experiments to investigate how model performance varied with the size of the dataset, the number of model parameters, and the number of samples used for rejection sampling. Since human evaluations can be noisy and expensive, we used the score of a 175B “validation” reward model (trained on a separate dataset split) for these experiments. We found this to be a good predictor of human preference when not optimizing against a reward model using RL (see Figure 5). Recall that the reward represents an Elo score, with a difference of 1 point representing a preference of sigmoid(1) ≈ 73%.

Scaling trends with dataset size and parameter count are shown in Figures 6 and 7. For dataset size, doubling the number of demonstrations increased the policy’s reward model score by about 0.13, and doubling the number of comparisons increased the reward model’s accuracy by about 1.8%. For parameter count, the trends were noisier, but doubling the number of parameters in the policy increased its reward model score by roughly 0.09, and doubling the number of parameters in the reward model increased its accuracy by roughly 0.4%.

For rejection sampling, we analyzed how to trade off the number of samples against the number of model parameters for a given inference-time compute budget (see Figure 8). We found that it is

Figure 6: BC scaling, varying the proportion of the demonstration dataset and parameter count of the policy.

Figure 7: RM scaling, varying the proportion of the comparison dataset and parameter count of the reward model.

Figure 8: Best-of-n scaling, varying the parameter count of the policy and reward model together, as well as the number of answers sampled.

generally compute-efficient to use some amount of rejection sampling, but not too much. The models for our main evaluations come from the Pareto frontier of this trade-off: the 760M best-of-4 model, the 13B best-of-16 model, and the 175B best-of-64 model.

6 Discussion

6.1 Truthfulness of WebGPT

As NLP systems improve and become more widely deployed, it is becoming increasingly important to develop techniques for reducing the number of false statements they make [Evans et al., 2021]. To assess the contribution of WebGPT to this aim, it is helpful to distinguish two categories of false statement made by a model:

Imitative falsehoods. These are false statements that are incentivized by the training objective (even in the limit of infinite data and compute), such as reproducing common misconceptions [Lin et al.,2021].
Non-imitative falsehoods. These are false statements that are the result of the model failing to achieve its training objective, including most hallucinations, which are statements that are false, but look plausible at a glance [Maynez et al., 2020].

Our TruthfulQA results suggest that WebGPT produces fewer imitative falsehoods than GPT-3. We believe this is because WebGPT is incentivized to prefer reliable sources (both because of filtering performed by the Bing API, and because we specify this in our instructions). Nevertheless, as shown in Table 3, WebGPT still sometimes quotes from highly unreliable sources in response to TruthfulQA questions. We hypothesize that this is because of the distribution shift from ELI5 to TruthfulQA, and that training on adversarially-selected questions is a promising way to improve this. It would be important in such an endeavor to pay close attention to labeler judgments of source trustworthiness (see Appendix C).

Our results on ELI5 suggest that WebGPT also produces fewer non-imitative falsehoods than GPT-

We did not test this hypothesis directly, since we found that it was challenging for labelers to spot subtle hallucinations. However, prior work shows that the use of retrieval reduces the rate of hallucinations [Shuster et al., 2021], and moreover WebGPT performs about as well as human demonstrations for factual accuracy on ELI5 (see Figure 2(a)). Nevertheless, WebGPT still sometimes produces non-imitative falsehoods, which are typically mistakes when attempting to paraphrase or synthesize information rather than wild hallucinations.

Table 3: Two questions from TruthfulQA, cherry-picked to highlight a success and a failure of WebGPT. While GPT-3 175B with the helpful prompt answers “I have no comment” to 49% of questions, WebGPT almost always tries to answer the question, but sometimes quotes from unreliable sources. In spite of this, WebGPT still answers more truthfully overall (see Figure 3).

Key: X = false, ✓ = true but uninformative, ✓ = true and informative

6.2 Perceived truthfulness of WebGPT

In order to assess the benefits and risks of WebGPT, it is necessary to consider not only how often it makes false statements, but also how likely users are to rely on those statements. Although WebGPT makes false statements less frequently than GPT-3, its answers also appear more authoritative, partly because of the use of citations. In combination with the well-documented problem of “automation bias” [Goddard et al., 2012], this could lead to overreliance on WebGPT’s answers. This is particularly problematic because, as discussed in Section 6.1, WebGPT can make more mistakes than humans on out-of-distribution questions. Documentation of these limitations could help inform those interacting with WebGPT, and further research is required to understand how else to mitigate this.

6.3 Reinforcement of bias

There are a number of ways in which WebGPT tends to perpetuate and reinforce existing assumptions and biases. Firstly, WebGPT inherits the biases of the base model from which it is fine tuned, GPT-3 [Brown et al., 2020], and this influences the way in which it chooses to search for and synthesize information. Search and synthesis both depend on the ability to include and exclude material depending on some measure of its value, and by incorporating GPT-3’s biases when making these decisions, WebGPT can be expected to perpetuate them further. Secondly, the fact that WebGPT synthesizes information from existing sources gives it the potential to reinforce and entrench existing beliefs and norms. Finally, WebGPT usually accepts the implicit assumptions made by questions, and more generally seems to be influenced by the stance taken by questions. This is something that could exacerbate confirmation bias in users.

These problems could be mitigated with improvements both to WebGPT’s base model and to WebGPT’s training objective, and we discuss some alternative objectives in the next section. It may also be important to control how WebGPT is used, both by limiting access and by tailoring the design and documentation of applications.

Additional analysis of the effect of question stance and of reference point bias is given in Appendix H.

6.4 Using references to evaluate factual accuracy

Central to our approach is the use of references collected by the model to aid human evaluation of factual accuracy. This was previously suggested by Metzler et al. [2021], and has several benefits:

More accurate feedback. It is very challenging to evaluate the factual accuracy of arbitrary claims, which can be technical, subjective or vague. In contrast, it is much easier to evaluate how well a claim is supported by a set of sources.
Less noisy feedback. It is also easier to specify an unambiguous procedure for evaluating how well a claim is supported by a set of sources, compared to evaluating the factual accuracy of an arbitrary claim. This improves agreement rates between labelers, which helps data efficiency.
Transparency. It is much easier to understand how WebGPT composes answers than it is for GPT-3, since the entire browsing process can be inspected. It is also straightforward for end-users to follow up on sources to better judge factual accuracy for themselves.

Despite these benefits, references are far from a panacea. Our current procedure incentivizes models to cherry-pick references that they expect labelers to find convincing, even if those references do not reflect a fair assessment of the evidence. As discussed in Section 6.3, there are early signs of this happening, with WebGPT accepting the implicit assumptions of questions, and the problem is likely to be exacerbated by more capable models and more challenging or subjective questions. We could mitigate this using methods like debate [Irving et al., 2018], in which models are trained to find evidence both for and against different claims. Such setups can also be viewed as simple cases of recursive reward modeling [Leike et al., 2018] and Iterated Amplification [Christiano et al., 2018], in which the model assists its own evaluation.

Our approach also raises a challenging problem with societal implications: how should factual accuracy be evaluated when training AI systems? Evans et al. [2021, Section 2] propose a number of desiderata, but a substantial gap remains between these and the highly specific criteria needed to train current AI systems with reasonable data efficiency. We made a number of difficult judgment calls, such as how to rate the trustworthiness of sources (see Appendix C), which we do not expect universal agreement with. While WebGPT did not seem to take on much of this nuance, we expect these decisions to become increasingly important as AI systems improve, and think that cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound.

6.5 Risks of live web access

At both train and inference time, WebGPT has live access to the web via our text-based browsing environment. This enables the model to provide up-to-date answers to a wide range of questions, but potentially poses risks both to the user and to others. For example, if the model had access to forms, it could edit Wikipedia to construct a reliable-looking reference. Even if human demonstrators did not perform such behavior, it would likely be reinforced by RL if the model were to stumble across it.

We believe the risk posed by WebGPT exploiting real-world side-effects of its actions is very low. This is because the only interactions with the outside world allowed by the environment are sending queries to the Bing API and following links that already exist on the web, and so actions like editing Wikipedia are not directly available to the model. While a capable enough system could escalate these privileges [Harms, 2016], WebGPT’s capabilities seem far below what would be required to achieve this.

Nevertheless, much more capable models could potentially pose much more serious risks [Bostrom, 2014]. For this reason, we think as the capabilities of models increase, so should the burden of proof of safety for giving them access to the web, even at train time. As part of this, measures such as tripwire tests could be used to help catch exploitative model behavior early.

7 Related work

Combining machine learning with an external knowledge base, for the task of question-answering, preceded the rise of pre-trained language models in the late 2010s. One notable system of this kind was DeepQA (also known as IBM Watson), which was used to beat the best humans at Jeopardy

[Ferrucci et al., 2010]. A large body of newer work uses language models to answer questions with the help of retrieved documents; these systems are more general and conceptually simpler than DeepQA. One approach is to use inner product search to retrieve relevant documents and then generate an answer given these documents:

p(passage query) ∝ exp(embed(passage) ⋅ embed(query)). (1)

Given a training dataset that specifies relevant passages for each question, dense passage retrieval (DPR) trains the retriever directly using a contrastive objective [Karpukhin et al., 2020]. Retrieval Augmented Language Modeling (REALM) [Guu et al., 2020] and Retrieval Augmented Generation (RAG) [Lewis et al., 2020a] train the retriever and question-answering components end-to-end using a language modeling objective. Unlike DPR, RAG, and REALM, which focus on benchmarks with short answers, Krishna et al. [2021] use a similar system to tackle long-form question-answering on the ELI5 dataset [Fan et al., 2019]. They find that automated metrics like ROUGE-L are not meaningful, which motivates our choice to use human comparisons as the main metric. Note that the aforementioned family of methods, which rely on inner product search (Equation 1), differ from WebGPT in that they formulate retrieval as a differentiable process. Fully differentiable retrieval has the advantage of fast optimization; two disadvantages are that it cannot deal with non-differential processes like using a search engine, and it is less interpretable.

Like WebGPT, some other recent work defines document retrieval or web browsing as a reinforcement learning (RL) problem. Yuan et al. [2019] apply RL to reading comprehension benchmarks, where (as in WebGPT) the action space includes searching and scrolling through the provided source document. They suggest web-level QA (like WebGPT) as a direction for future work. Adolphs et al. [2021] set up an RL problem that involves performing a series of search queries for short-form question-answering. They train their system in two alternative ways: behavior cloning (BC) on synthetically-generated sequences and RL. Finally, there is another body of work that uses BC and RL to control web browsers, for automating other tasks besides question-answering [Shi et al., 2017, Gur et al., 2018].

8 Conclusion

We have demonstrated a novel approach to long-form question-answering, in which a language model is fine-tuned to use a text-based web-browsing environment. This allows us to directly optimize answer quality using general methods such as imitation learning and reinforcement learning. To make human evaluation easier, answers must be supported by references collected during browsing. Using this approach, our best model outperforms humans on ELI5, but still struggles with out-of-distribution questions.

9 Author contributions

Reiichiro Nakano, Jacob Hilton, Suchir Balaji and John Schulman jointly led the project, devel- oped the codebase, ran all data collection and experiments, and wrote the paper.

Jeff Wu, Long Ouyang, Xu Jiang and Karl Cobbe provided invaluable advice on a multitude of topics over the course of the project.

Jeff Wu, Vineet Kosaraju, William Saunders and Xu Jiang made key contributions to the project codebase.

Christina Kim, Christopher Hesse and Shantanu Jain built and supported infrastructure used for model training and inference.

Tyna Eloundou and Gretchen Krueger conducted the analysis of bias and contributed to the paper.

Kevin Button and Matthew Knight provided computer security support.

Benjamin Chess provided computer networking support.

10 Acknowledgments

We would like to thank Leo Gao, Hyeonwoo Noh and Chelsea Voss for working on future directions; Steve Dowling, Christian Gibson, Peter Hoeschele, Fraser Kelton, Bianca Martin, Bob McGrew,

Felipe Such and Hannah Wong for technical, logistical and communications support; Steven Adler, Miles Brundage, David Farhi, William Guss, Oleg Klimov, Jan Leike, Ryan Lowe, Diogo Moitinho de Almeida, Arvind Neelakantan, Alex Ray, Nick Ryder and Andreas Stuhlmüller for helpful discussions; Owen Cotton-Barratt, Owain Evans, Jared Kaplan, Girish Sastry, Carl Shulman, Denis Yarats and Daniel Ziegler for helpful discussions and feedback on drafts; Beth Barnes and Paul Christiano for helpful discussions and feedback on drafts, and in particular for suggesting the project; and Dario Amodei for suggesting to work on factual inaccuracy in language models. We would also like to thank Surge AI for helping us with data collection, in particular Edwin Chen, Andrew Mauboussin, Craig Pettit and Bradley Webb.

Finally, we would like to thank all of our contractors for providing demonstrations and comparisons, without which this project would not have been possible, including: Jamie Alexander, Andre Gooden, Jacquelyn Johns, Rebecca Kientz, Ashley Michalski, Amy Dieu-Am Ngo, Alex Santiago, Alice Sorel, Sam Thornton and Kelli W. from Upwork; and Elena Amaya, Michael Baggiano, Carlo Basile, Katherine Beyer, Erica Dachinger, Joshua Drozd, Samuel Ernst, Rodney Khumalo, Andrew Kubai, Carissa Lewis, Harry Mubvuma, William Osborne, Brandon P., Kimberly Quinn, Jonathan Roque, Jensen Michael Ruud, Judie Anne Sigdel, Bora Son, JoAnn Stone, Rachel Tanks, Windy Thomas, Laura Trivett, Katherine Vazquez, Brandy and Shannon from Surge AI.

References

L. Adolphs, B. Boerschinger, C. Buck, M. C. Huebscher, M. Ciaramita, L. Espeholt, T. Hofmann, and Y. Kilcher. Boosting search engines with interactive agents. arXiv preprint arXiv:2109.00527, 2021.

S. Bhakthavatsalam, D. Khashabi, T. Khot, B. D. Mishra, K. Richardson, A. Sabharwal, C. Schoenick,

O. Tafjord, and P. Clark. Think you have solved direct-answer question answering? Try ARC-DA, the direct-answer AI2 reasoning challenge. arXiv preprint arXiv:2102.03315, 2021.

N. Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,

P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

H. Cheng, Y. Shen, X. Liu, P. He, W. Chen, and J. Gao. UnitedQA: A hybrid approach for open domain question answering. arXiv preprint arXiv:2101.00178, 2021.

D. Chong and J. N. Druckman. Framing theory. Annu. Rev. Polit. Sci., 10:103–126, 2007.

P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts.

arXiv preprint arXiv:1810.08575, 2018.

O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, and W. Saunders. Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674, 2021.

A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli. ELI5: Long form question answering. arXiv preprint arXiv:1907.09190, 2019.

D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock,

E. Nyberg, J. Prager, et al. Building watson: An overview of the deepqa project. AI magazine, 31 (3):59–79, 2010.

K. Goddard, A. Roudsari, and J. C. Wyatt. Automation bias: a systematic review of frequency, effect mediators, and mitigators. Journal of the American Medical Informatics Association, 19(1): 121–127, 2012.

I. Gur, U. Rueckert, A. Faust, and D. Hakkani-Tur. Learning to navigate the web. arXiv preprint arXiv:1812.09195, 2018.

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. REALM: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020.

M. Harms. Crystal Society. Crystal Trilogy. CreateSpace Independent Publishing Platform, 2016. ISBN 9781530773718.

G. Irving, P. Christiano, and D. Amodei. AI safety via debate. arXiv preprint arXiv:1805.00899, 2018.

M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.

V. Karpukhin, B. Og˘uz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.

K. Krishna, A. Roy, and M. Iyyer. Hurdles to progress in long-form question answering. arXiv preprint arXiv:2103.06332, 2021.

J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,

T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401, 2020a.

P. Lewis, P. Stenetorp, and S. Riedel. Question and answer test-train overlap in open-domain question answering datasets. arXiv preprint arXiv:2008.02637, 2020b.

S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.

J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.

D. Metzler, Y. Tay, D. Bahri, and M. Najork. Rethinking search: Making experts out of dilettantes.

arXiv preprint arXiv:2105.02274, 2021.

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.

K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.

N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and

P. Christiano. Learning to summarize from human feedback. arXiv preprint arXiv:2009.01325, 2020.

X. Yuan, J. Fu, M.-A. Cote, Y. Tay, C. Pal, and A. Trischler. Interactive machine comprehension with information seeking agents. arXiv preprint arXiv:1908.10449, 2019.

A Environment design details

Our text-based web-browsing environment is written mostly in Python with some JavaScript. For a high-level overview, see Section 2. Further details are as follows:

When a search is performed, we send the query to the Microsoft Bing Web Search API, and convert this to a simplified web page of results.
When a link to a new page is clicked, we call a Node.js script that fetches the HTML of the web page and simplifies it using Mozilla’s Readability.js.
We remove any search results or links to reddit.com or quora.com, to prevent the model copying answers from those sites.
We take the simplified HTML and convert links to the special format

【<link ID>†<link text>†<destination domain>】, or

【<link ID>†<link text>】 if the destination and source domains are the same. Here, the link ID is the index of the link on the page, which is also used for the link-clicking command. We use special characters such as 【 and 】 because they are rare and encoded in the same few ways by the tokenizer, and if they appear in the page text then we replace them by similar alternatives.

We convert superscripts and subscripts to text using ^ and _, and convert images to the special format [Image: <alt text>], or [Image] if there is no alt text.
We convert the remaining HTML to text using html2text.
For text-based content types other than HTML, we use the raw text. For PDFs, we convert them to text using pdfminer.six. For all other content types, and for errors and timeouts, we use an error message.
We censor any pages that contain a 10-gram overlap with the question (or reference answer, if provided) to prevent the model from cheating, and use an error message instead.
We convert the title of the page to text using the format <page title> (<page domain>). For search results pages, we use Search results for: <query>.
When a find in page or quote action is performed, we compare the text from the command against the page text with any links stripped (i.e., including only the text from each link). We also ignore case. For quoting, we also ignore whitespace, and allow the abbreviated format <start text>━<end text> to save tokens.
During browsing, the state of the browser is converted to text as shown in Figure 1(b). For the answering phase (the last step of the episode), we convert the question to text using the format <question>■, and follow this by each of the collected quotes in the format [<quote number>] <quote page title> (<quote page domain>)

<double new line><quote extract>■.

B Question dataset details

For our demonstration and comparison datasets, the vast majority of questions were taken from ELI5 [Fan et al., 2019], to which we applied the follow post-processing:

We included URLs in full, rather than using special _URL_ tokens.
We filtered out questions with the title “[deleted by user]”, and ignored the selftext “[deleted]” and “[removed]”. (The “selftext” is the body of the post.)
We concatenated the title and any non-empty selftext, separated by a double new line.
We prepended “Explain: ” to questions that were not phrased as actual questions (e.g., we used “Explain: gravity” rather than simply “gravity”).

The final step was performed because there is sometimes an implicit “Explain Like I’m Five” at the start of questions. We considered a question to be phrased as an actual question if it included either a question mark, or one of the following sequences of characters with a regex-word boundary at either end, case-insensitively:

explain, eli5, which, what, whats, whose, who, whos, whom, where, wheres, when, whens, how, hows, why, whys, am, is, isn, isnt, are, aren, arent, was, wasn, wasnt, were, weren, werent, do, don, dont, does, doesn, doesnt, did, didn, didnt, can, cant, could, couldn, couldnt, have, haven, havent, has, hasn, hasnt, may, might, must, mustn, mustnt, shall, shant, should, shouldn, shouldnt, will, wont, would, wouldn, wouldnt

For diversity and experimentation, we also mixed in a small number of questions from the following datasets:

TriviaQA. This is a dataset of short-form questions taken from trivia websites [Joshi et al.,2017].
AI2 Reasoning Challenge (ARC). This is a dataset of grade-school level, multiple-choice science questions [Bhakthavatsalam et al., 2021], which we converted to free-form questions using the format <question><new line>A. <option A><new line>. This dataset is sub-divided into two difficulties, “Challenge” and “Easy”.

Hand-written. We constructed this small dataset of miscellaneous questions written by people trying out the model.
ELI5 fact-check. We constructed this dataset using answers to questions from ELI5 given by an instruction-following model.⁴ Each question has the following format: Fact-check each of the claims in the following answer. <double new line>Question:

<ELI5 question><double new line>Answer: <model answer>

The numbers of demonstrations and comparisons we collected for each of these datasets are given in Table 4.

Table 4: Breakdown of our demonstrations and comparisons by question dataset.

Question dataset	Demonstrations	Comparisons
ELI5	5,711	21,068
ELI5 fact-check	67	185
TriviaQA	143	134
ARC: Challenge	43	84
ARC: Easy	83	77
Hand-written	162	0
Total	6,209	21,548

C Data collection details

To collect demonstrations and comparisons, we began by hiring freelance contractors from Upwork (https://www.upwork.com), and then worked with Surge AI (https://www.surgehq.ai) to scale up our data collection. In total, around 25% of our data was provided by 10 contractors from Upwork, and around 75% by 46 contractors from Surge AI. The top 5 contractors provided around 50% of the data.

For both types of task, we provided contractors with a video and a detailed instruction document (linked below). Due to the challenging nature of the tasks, contractors were generally highly educated, usually with an undergraduate degree or higher. Contractors were compensated based on hours worked rather than number of tasks completed, and we conducted a survey to measure job satisfaction (see Appendix D).

For data quality, we put prospective contractors through a paid trial period lasting a few hours, and manually checked their work. For comparisons, we also completed around 100 tasks ourselves for all labelers to complete, and monitored both researcher–labeler agreement rates and labeler–labeler agreement rates. Treating the agreement rate between a neutral label and a non-neutral label as 50%, we measured a final researcher-labeler agreement rate of 74%, and a labeler-labeler agreement rate of 73%.

Demonstrations took an average of around 15 minutes each, and comparisons took an average of around 10 minutes each. Despite conventional wisdom that human labelling tasks should be quick and repeatable, we did not think it would be straightforward to decompose our tasks into significantly simpler ones, but we consider this to be a promising direction for further research.

C.1 Demonstrations

We designed the demonstration interface in such a way that, as a rule, the user is given the same information as the model, and has the same actions available. There were a couple of exceptions to this:

Unlike humans, the model has no memory of previous steps. We therefore included a summary of past actions in the text given to the model. However, we felt that it was unnecessary to display this to humans.
The Scrolled <up, down> <2, 3> actions are useful for reducing the number of actions taken, but humans are used to scrolling one step at a time. We therefore made these actions unavailable to humans, and instead simply merged any repeated Scrolled <up, down> 1 actions that they made.

The full instruction document we provided to contractors for demonstrations can be viewed here.

C.2 Comparisons

To minimize label noise, it is important to make comparisons as unambiguous as possible. We therefore designed the following procedure for comparing two answers to a given question:

Read the question, and flag if it does not make sense or should not be answered (in which case the rest of the comparison is skipped).
Read the first answer and its references.
Rate the trustworthiness of any references relied upon by the answer.Annotate each of the claims in the answer with the level of support it has and its relevance to the question. A screenshot of the annotation tool is shown in Figure 9.Repeat steps 2–4 for the second answer and its references.Give comparison ratings for the amount of unsupported and irrelevant information, the usefulness of information with different levels of support, and coherence.
Weighing everything up, give a final comparison rating for overall usefulness.

Figure 9: Screenshot from the comparison interface, showing the annotation tool.

For each of the comparison ratings, we used a 5-point Likert scale with the options “A much better”, “A better”, “Equally good”, “B better” and “B much better”.

Importantly, we did not require contractors to perform independent research to judge the factual accuracy of answers, since this would have been difficult and subjective. Instead, we asked contractors to judge whether claims in the answer are supported, i.e., either backed up by a reliable reference, or common knowledge.

For the final comparison rating, we encouraged contractors to use their best judgment, but to roughly consider the following criteria in descending order of priority:

Whether or not the answer contains unsupported information.
Whether or not the core question has been answered.
Whether or not there is additional helpful information, which does not necessarily need to answer the question directly.How coherent the answer is, and whether or not there are any citation errors.
How much irrelevant information there is in the answer. (This can be higher priority in extreme cases.)

The full instruction document we provided to contractors for comparisons can be viewed here.

For most of the project, we made every part of this procedure required 10% of the time, and made every part except for the final comparison rating optional 90% of the time. Towards the end of the project, we removed the question flags from the first part since we felt that they were being overused, and made the comparison ratings for unsupported information and coherence required all of the time.

Despite the complexity of this procedure, we only used the final comparison rating in training, even collapsing together the “much better” and “better” ratings. We experimented with predicting some of the other information as an auxiliary loss, but we were not able to significantly improve the validation accuracy of the reward model. Nevertheless, we consider this to be another promising direction for further research.

D Contractor survey

It was valuable to gather feedback from our contractors, both to understand and improve their process, and to monitor job satisfaction. To this end, we sent them a questionnaire with the following questions:

Please say how much you agree with each of the statements. (Required 5-point Likert rating and optional comments)
- It was clear from the instructions what I was supposed to do.
- I found the task enjoyable and engaging.
- I found the task repetitive.
- I was paid fairly for doing the task.
- Overall, I am glad that I did this task.
What would you change about the task to make it more engaging or enjoyable? (Encouraged)
Are there any other tools you could be given that would make it easier to complete the task to a consistently high standard? (Encouraged)
Did you come up with any shortcuts that you used to do the task more quickly, and if so, what were they? (Encouraged)
Do you have any other comments? (Optional)

The “encouraged” questions were required questions but with instructions to put “N/A” if they really could not think of anything (this was rare).

We surveyed all contractors who completed 32 or more tasks (thus we excluded people who dropped out after the trial period or shortly thereafter). We did this 3 times over the course of the project: once for demonstrations and twice for comparisons. The quantitative results from these surveys are given in Figure 10. The vast majority of respondents reported that they enjoyed the task, were paid fairly and were glad that they did the task overall. A significant minority of respondents also reported that they found the task repetitive.

Figure 10: Likert ratings aggregated over all 3 of our contractor surveys. All ratings are weighted equally, even when the same contractor provided ratings in multiple surveys. In total, there are 41 ratings for each question.

E Hyperparameters

Hyperparameters for all of our training methods are given in Tables 6 and 7. We mostly used the same hyperparameters for the different model sizes, with the caveat that we expressed the Adam step sizes as multiples of the pre-training Adam step sizes, which are given in Table 5.

For each training method, we implemented some form of early stopping:

For BC, we stopped after a certain number of epochs based on reward model score (which usually improves past the point of minimum validation loss).
For RM, we stopped after a certain number of epochs based on validation accuracy.
For RL, we stopped after a certain number of PPO iterations based on the reward model score for some KL budget. The KL here is measured from the BC model, and summed over the episode. For the 175B model, we compared a couple of different KL budgets using human evaluations, and for the 760M and 13B models, we chose KL budgets informed by the 175B evaluations.

The points at which we early stopped are given in Table 8.

We tuned hyperparameters using similar criteria to early stopping. We used human evaluations sparingly, since they were noisy and expensive, and put less effort into tuning hyperparameters for the 760M and 13B model sizes. As a rule, we found the most important hyperparameter to tune to be the Adam step size multiplier.

For BC and RM, we used Polyak–Ruppert averaging [Polyak and Juditsky, 1992], taking an exponentially-weighted moving average (EMA) of the weights of the model as the final check- point. The “EMA decay” hyperparameter refers to the decay of this EMA per gradient step. For RL (but not rejection sampling), we did not use the EMA model for the 760M or 13B reward models, due to a bug.

For RL, most PPO hyperparameters did not require tuning, but a few points are worth noting:

As discussed in Section 3 of the paper, the reward is the sum of the reward model score at the end of each episode and a KL penalty from the BC model at each token. Even though the reward is part of the environment, we treat the coefficient of this KL penalty as a hyperparameter, called the “KL reward coefficient”.
We express hyperparameters such that each timestep corresponds to a single completion (rather than a single token), but we applied PPO clipping and the KL reward at the token level. We also trained token-level value function networks, allowing a token-level baseline to be used for advantage estimation, but we did not use token-level bootstrapping or discount rates.
We used separate policy and value function networks for simplicity, although we think that using shared networks is a promising direction for future research.We used 1 epoch, since we were concerned more with compute efficiency than with sample efficiency.Due to GPU memory constraints, we used 16 times as many minibatches per epoch as the default for PPO, but this was easily compensated for by reducing the Adam step size multiplier by a factor of 4.We used the same number of parallel environments and timesteps per rollout as the default for PPO, even though it resulted in slow PPO iterations (lasting multiple hours). This is the easiest way to ensure that PPO performs enough clipping (around 1–2% of tokens). Compared to using fewer timesteps per rollout and fewer minibatches per epoch, we found the KL from the BC model to grow more slowly at the start of training, making training less sensitive to the KL reward coefficient until approaching convergence. This allowed us to replace tuning the KL reward coefficient with early stopping to some extent.
We did not use an entropy bonus, which is usually used for exploration. An entropy bonus is equivalent to a KL penalty from the uniform distribution, but the uniform distribution over tokens is somewhat arbitrary – in particular, it is not invariant to “splitting” a single token into two equally-likely indistinguishable tokens. Instead, the KL reward prevents entropy collapse in a more principled way. We still found it useful to measure entropy for monitoring purposes.
We happened to use a GAE discount rate of 1 rather than the usual default of 0.999, but we do not expect this to have made much difference, since episodes last for well under 1,000 timesteps.
As discussed in Section 3 of the paper, at the end of each episode we inserted additional answering-only episodes using the same references as the previous episode, which is what the “answer phases per browsing phases” hyperparameter refers to.
Since some actions (such as quotes and answers) require many more tokens than others, we modified the environment to “chunk” long completions into multiple actions, to improve rollout parallelizability. This is what the “maximum tokens per action” hyperparameter refers to. Note that it has a minor effect on GAE.

Table 5: Pre-training Adam step sizes, to which we apply multipliers. These are the same as those given in Brown et al. [2020].

Table 6: Behavior cloning and reward modeling hyperparameters.

Table 7: Reinforcement learning hyperparameters.

Hyperparameter	Value
Number of parallel environments	256
Timesteps per rollout (T )	256
Epochs (E)	1
Minibatches per epoch	128
Adam step size multiplier	0.004
KL reward coefficient	0.02
Entropy coefficient	0
PPO clipping parameter (E)	0.2
GAE discount rate (γ)	1
GAE bootstrapping parameter (λ)	0.95
Reward normalization?	No
Advantage normalization?	Yes
Answer phases per browsing phase	16
Maximum tokens per action	64

Table 8: Early stopping points.

Model size	BC epochs	RM epochs	RL stopping point (PPO iterations)	RL stopping point (KL per episode)
760M	2	1	19	10.5 nats
13B	5	1	30	6.8 nats
175B	3	1	18	∼12 nats

F Minimal comparison instructions

As discussed in Section 4, for comparing WebGPT’s answers to the reference answers from the ELI5 dataset, we used a much more minimal set of instructions, for ecological validity. The full instructions consisted of the following text:

Comparing answers (minimal version)

In this task, you’ll be provided with a question and a set of two answers. We’d like you to provide ratings comparing the two answers for the following categories:

Accuracy – which answer is more factually accurate?
- Please use a search engine to fact-check claims in an answer that aren’t ob- vious to you. Answers may have subtly incorrect or fabricated information, so be careful!
Coherence – which answer is easier to follow?
Usefulness overall – all things considered, which answer would be more helpful to the person who asked this question?

FAQ

What should I do if there’s a URL in the question or one of the answers?
- Please don’t click any URLs and interpret the questions and answers based on their remaining textual content.
What should I do if the question doesn’t make any sense, or isn’t a question?
- Sometimes you’ll see a statement instead of a question, which you should interpret as “Explain: . . . ”.
  - E.g. a question titled “Magnets” should be interpreted as “Explain: magnets” or “How do magnets work?”
- If the question is ambiguous but has a few reasonable interpretations, stick with the interpretation that you think is most likely.
- If the question still doesn’t make sense (e.g. if you’d need to click on a URL to understand it, or if it’s entirely unclear what the question means), then click the “This question does not make sense” checkbox at the top and submit the task.
  - This should be rare, so use this sparingly.
What should I do if the answer to the question depends on when it was asked?
- In this case, please be charitable when judging answers with respect to when the question was asked – an answer is considered accurate if it was accurate at any point within the last 10 years.
  - E.g. valid answers to the question “Who is the current U.S. president” are Barack Obama, Donald Trump, and Joe Biden.
What should I do if I only see one answer?
- If you only see one answer, you’ll be asked to provide absolute ratings for that answer (very bad, bad, neutral, good, or very good) instead of comparison ratings.
  - For the “usefulness overall” category, please calibrate your ratings such that “very bad” indicates an answer that is worse than not having an answer at all (e.g. due to being very misleading), “bad” indicates an answer that’s about as helpful as not having an answer, and higher ratings indicate useful answers with varying degrees of quality.

G TriviaQA evaluation

Although WebGPT was trained primarily to perform long-form question-answering, we were inter- ested to see how well it would perform short-form question-answering. To this end, we evaluated WebGPT on TriviaQA [Joshi et al., 2017], a dataset of short-form questions from trivia websites. For this evaluation, we used the WebGPT 175B BC model with a sampling temperature of 0.8 and no rejection sampling.

To address the mismatch between WebGPT’s long-form answers and the short-form answers expected by TriviaQA, we fine-tuned GPT-3 175B to answer TriviaQA questions conditioned on the output of WebGPT. Since this is a simple extraction task, and out of concern for test-train overlap [Lewis et al., 2020b], we used only 256 questions for this fine-tuning (with a batch size of 32 and a learning rate of 1.5 × 10⁻⁶). This was in addition to the 143 TriviaQA demonstrations on which the WebGPT

model was trained. As an ablation, we also fine-tuned GPT-3 175B in the same way, but without the

WebGPT output.

Our results are shown in Table 9, along with those of the best existing model, UnitedQA [Cheng et al., 2021]. We report results on the TriviaQA development set splits defined in Lewis et al. [2020b]. We perform slightly better than UnitedQA-E on questions with no test-train overlap, and slightly worse on questions with test-train overlap. We hypothesize that this difference is the result of WebGPT being trained on far fewer TriviaQA questions.

Table 9: TriviaQA development set accuracy (exact match scores).

Note that we use far more compute than UnitedQA, and also use live access to the web rather than only the corpus provided by Joshi et al. [2017] (although we still censor trivia websites in the same way for this evaluation). On the other hand, WebGPT was trained primarily to perform long-form question-answering, and so the transfer to the short-form setting is notable.

H Analysis of effect of question stance and reference point bias

In this section we investigate the impact of question “stance” (whether the question implicitly supports or refutes some relevant belief) on the model’s accuracy, and on its tendency to support or refute that belief in its answer. We also probe the model’s bias towards “assuming” a certain cultural reference point with an example culturally dependent question (“What does a wedding look like?”).

H.1 Effect of question stance on factual accuracy and answer stance

We ran a small experiment to investigate the impact of question stance on the model’s answers. Inspired by TruthfulQA [Lin et al., 2021], we chose 10 well-known conspiracy theories and 10 common misconceptions for this experiment. For each conspiracy theory or misconception, we wrote three questions, each taking one of three stances: one expressing skepticism around the implicit belief, one neutral about the implicit belief, and one affirming the implicit belief. This resulted in the 60 questions given in Table 10. We collected answers to these questions for the three compute-efficient WebGPT models (see Section 5.2), and used these answers to look for a couple of different effects.

Factual accuracy. First, we examined whether the stance of the question impacts the model’s factual accuracy. To do this, we labelled each answer as accurate or inaccurate, by fact-checking any central or specific claims in the answer, and labeling the answer as inaccurate if a significant number (more than around 25%) of those claims could not be easily verified. Our results are given in Figure 11. We found suggestive evidence that, across model sizes, questions that affirm an implicit belief in a conspiracy or misconception tend to elicit inaccurate answers from the model more often than questions that are framed in a neutral or skeptical way. While our experiment had too small of a sample size for us to draw definitive conclusions, it demonstrates the model’s potential to misinform users who have erroneous beliefs in ways that reinforce those beliefs.
Answer stance. Second, we studied whether the model mirrors the stance of the question in the content of its response. To do this, we labelled each answer on whether it explicitly refutes the implicit belief or explicitly affirms the implicit belief. Note that in some cases it is possible for an answer to affirm the belief in the conspiracy theory or misconception while remaining factually accurate, by including appropriate caveats. If an answer initially affirms the belief but then reverses its stance, saying for example “but this is a myth”, then we consider it to have refuted the belief. Our results are given in Figure 12. We found that all the models tended to refute the implicit beliefs more often than they affirmed them, and that this effect increased with model size. However, we did not find any clear evidence that the stance of the question has any effect on this behavior.

Given the small scale of this experiment, it would be informative to see further research on the effect of question stance on model answers. We remark that humans exhibit sensitivity to the framing of questions [Chong and Druckman, 2007]. In addition to this, it would be useful to study the effects of various other factors, such as the training data collection methodology, the relative degree of skepticism, neutrality or affirmation in the questions, the relative volumes of skeptical or affirming sources on the web, and whether the questions themselves appear in the training data or on the web.

H.2 Reference point bias

Rather than having a strong stance, some questions may reveal very little information about the user, but the model may nevertheless assume a certain cultural reference point. We refer to this as reference point bias. To probe this phenomenon, we conducted a simple case study, in which we analyzed 64 answers from the WebGPT 175B BC model to the following question: “What does a wedding look like?”.

In response to this question, the model tended to assume a Western, and often specifically an American, point-of-view. Out of the 64 answers, 20 included the word “America” or “American”, and only 4 focused on a specific, named culture other than American: Vietnamese (1); Indian (1); and Croatian

(2). While 8 of 64 responses noted that there is no standard wedding, all but one of these still also included at least one detail typical of a Western and often American wedding. And 2 of the these 8 – including the answer with the highest reward model score – noted that there is no standard or typical American wedding.

Figure 11: Results of experiment on effect of question stance on factual accuracy.

Figure 12: Results of experiment on effect of question stance on answer stance.

The assumption of a Western and often American reference point in this case may be influenced by the data the model has seen during pre-training, by Internet search data, by the viewpoints represented by the contractors we worked with, and by our research team. When asked specifically “What does a Vietnamese wedding look like?”, the model usually generates responses pertaining to Vietnamese weddings, but these come up rarely in response to a generic question. When in doubt, the model defaults to assuming a Western or American viewpoint in this case.

Furthermore, we also noticed that the model often makes other assumptions that exclude or erase some identities in response to this question, for example by assuming that a couple consists of a male groom and female bride and assuming that a bride’s father walks her down an aisle. While our experiment was focused on probing its bias towards a Western or American point of view, we encourage further interdisciplinary research in these and other areas of bias.

Table 10: Questions used to study the effect of question stance on the model’s answers. Each of the top 10 topics refers a well-known conspiracy theory, and each of the bottom 10 topics refers to a common misconception. For each topic, we wrote a question with a skeptical stance, a question with neutral stance, and a question with an affirming stance.

I Predicting rejection sampling performance

It is helpful to be able to predict human preference of answers produced using rejection sampling (best-of-n). To do this, we evaluate answers using a validation reward model (trained on a separate dataset split), to try to account for the original reward model being overoptimized. For large n, the naive Monte Carlo estimator of the expected validation reward model score requires many model samples to produce accurate estimates. Here we describe an alternative estimator, which produces accurate estimates more efficiently.

Let Q be the distribution of questions, and given a question q, let A (q) be the distribution of answers produced by the model. Given a question q and an answer a (with references), let R^train (a q) be the original reward model score, and let R^val a q be the validation reward model score. Let n be the number of answers sampled when rejection sampling (i.e., the n in best-of-n).

To predict the Elo score corresponding to human preference for a given question q, we estimate

To predict the overall Elo score corresponding to human preference, we estimate

As shown in Figure 5, this predicts human preference well for n ≤ 64, although we expect it to overestimate human preference for sufficiently large n, as the validation reward model will eventually become overoptimized.

The simplest way to estimate R^pred (q) for a given question q is with a Monte Carlo estimator,by repeatedly sampling A₁, A₂, . . . , A_n ∼ A (q). However, this is very wasteful, since it takes n answers to produce each estimate, and moreover, answers are not re-used for different values of n. Instead, we sample A₁, A₂, . . . , A_N ∼ A q for some N ≥ n, and compute

which is an unbiased estimator of R^predq by linearity of expectation. This can be computed efficiently by sorting A₁, A₂, . . . , A_N by original reward model score to obtain S₁, S₂, . . . S_N with Rtrain (S q) ≤ ⋅ ⋅ ⋅ ≤ R^train (S q), and then computing

To estimate E_Q_∼_Q [R^pred (Q)] for n = 1, 2, . . . , n , we sample a number of different questions Q ∼ Q, sample A₁, A₂, . . . , A_N ∼ A (Q) for some N ≥ n_max for each question Q, and for each n take the empirical average of the above estimator over the different questions.

J References for example answer and alternative answers

In Table 2, we gave an answer produced by our 175B best-of-64 model to a randomly-chosen question from the ELI5 test set: “Why did we decide that certain words were “bad” and shouldn’t be used in social settings?”. Here we give the full text of the references for that answer, along with an answer produced by one of our human demonstrators, and the reference answer from the ELI5 dataset. Further samples are available at https://openaipublic.blob.core.windows.net/ webgpt-answer-viewer/index.html.

Table 11: Full text of the references found by our model for the answer given in Table 2.

Table 12: An answer and references produced by one of our human demonstrators to the same question as the one in Table 2.

Table 13: The reference answer from the ELI5 dataset for the question given in Table 2.

K Comparison dataset release details

We are releasing a dataset of all comparisons that were marked as suitable for reward modeling by the end of the project, totalling 19,578 comparisons. It can be downloaded from https:// openaipublic.blob.core.windows.net/webgpt-answer-viewer/comparisons.jsonl.

Each item in the dataset is a pair of records, each of which has the following fields:

Question: The text of the question, together with the name of the dataset from which it was taken and a unique ID. The question is the same for each record in the pair.
Quotes: The extracts found during browsing, together with the title of the page on which the extract was found, constructed from the HTML title and domain name of the page.
Answer: The final answer composed using the quotes.Tokens: The prefix that would have been given to the model in the final step of the episode, and the completion given by the model or human. The prefix is made up of the question and the quotes, with some truncation, and the completion is simply the answer. Both are tokenized using the GPT-2 tokenizer. The concatenation of the prefix and completion is the input used for reward modeling.
Score: The strength of the preference for the answer as a number from −1 to 1. The two scores in each pair sum to zero, and an answer is preferred if and only if its score is positive. For reward modeling, we treat scores of 0 as soft 50% labels, and all other scores as hard labels (using only their sign).

WebGPT: Browser-assisted question-answering with human feedback