Skip to main content
Uncategorized

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Shahriar Golchin, Mihai Surdeanu

Department of Computer Science

University of Arizona Tucson, AZ, USA

{golchin,msurdeanu}@arizona.edu

Aug 16, 2023

Abstract

Data contamination, i.e., the presence of test data from down-stream tasks in the training data of large language models (LLMs), is a potential major issue in understanding LLMs’ effectiveness on other tasks. We propose a straight forward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination in individual instances that are drawn from a small random sample; using this information, our approach then assesses if an entire dataset partition is contaminated. To estimate contamination of individual instances, we employ “guided instruction:” a prompt consisting of the dataset name, partition type, and the initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM’s output either exactly or closely matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE or BLEURT) is statistically significantly better with the guided instruction vs. a general instruction that does not include the dataset and partition name. The second idea marksadatasetascontaminatedifaclassifierbasedonGPT-4 with in-context learning prompting marks multiple instances as contaminated. Our best method achieves an accuracy be-tween92% and100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

1 Introduction

The rise of Transformer networks (Vaswani et al. 2017) has spurred the development of large language models (LLMs), marking a new epoch in Natural Language Processing (NLP). This shift has led to an extensive range of LLMs (Touvron et al. 2023a;b; Biderman et al. 2023; Ko¨pf et al. 2023;Chunget al. 2022;Penedoet al. 2023,inter-alia) which excel in various professional and academic bench-marks(Banget al. 2023;Bubecket al. 2023).Their superior performance is primarily attributed to the massive web data consumed by these billion/trillion-parameter LLMs during training. However, the impressive LLM performance observed on many downstream tasks (e.g., summarization, natural language inference, text classification) may be inflated due to data contamination, i.e., the presence of test data from these downstream tasks in the training data of LLMs. Guaranteeing lack of contamination is not trivial due to two potential sources of contamination: directly from ingesting the official version of a dataset (easier to control), and indirectly through duplicated data found somewhere on the web (nearly impossible to control).1 The potential of data contamination is especially relevant for closed models such as the GPT-3/3.5 family (Brown et al. 2020) and GPT-4 (OpenAI 2023; Bubeck et al. 2023), and, needless to say, raises questions on the validity of evaluations conducted so far (Chang et al. 2023; Zhu et al. 2023; Bordt and von Luxburg 2023;Ray 2023; Penedo et al. 2023).

To address this issue, we propose an inexpensive and robust approach to automatically detect data contamination for a given dataset partition. Intuitively, our method starts by identifying potential contamination in individual instances that are drawn from a small random sample of the corresponding dataset partition (we used samples of 10 instances in this work). Using this information, our approach then assesses if an entire dataset partition is contaminated.

More formally, to identify contamination of individual instances we employ a “guided instruction:” an LLM prompt that integrates distinct identifiers from the source dataset from which the reference instance originates. Such information includes the dataset name, its partition (e.g., train, test, or validation), and a randomly selected portion of the reference instance, complemented by its label when relevant. With these signals in the prompt, we instruct the LLM to finish the given partial instance. To understand if an entire dataset partition is contaminated, we propose two heuristics. The first heuristic states that a partition is likely to be contaminated if the average overlap score between generated and reference instances (as measured by ROUGE-L (Lin 2004) or BLEURT (Sellam, Das, and Parikh 2020)) observed with the guided instruction, is statistically significantly larger than the one measured with a general instruction, which does not include the dataset and partition name. The second heuristic labels a partition as contaminated if a classifier based on GPT-4 with in-context learning (ICL; Brown et al. (2020)) marks at least one instance as an ex-act match with the reference or at least two instances as near-exact matches, where near-exact matches indicate in-stances that exhibit considerable semantic and lexical alignment with the reference instance.

The primary contributions of this paper are as follows:

(1)We propose a novel data contamination detection method for LLMs that is inexpensive and robust. As indicated above, our method combines a “guided instruction” to complete instances randomly drawn from the investigated dataset and several heuristics to generalize from instance- to partition-level contamination decisions.

(2) We evaluate our proposed method in 28 distinct scenarios, using two state-of-the-art LLMs—GPT-4 and GPT-3.5—seven different datasets employed in classification, summarization, and natural language inference (NLI) tasks. For each dataset, we investigate potential data contamination in both the train and test splits (or validation, if the test partition is not publicly available with its instance labels). Our evaluation indicates that our best method is the one that combines the guided instruction to complete individual in-stances with the GPT-4 ICL classifier. This method obtains an accuracy that ranges between 92% and 100% for the two language models when compared with dataset contamination labels produced by a human domain expert.

(3) Our analysis indicates that GPT-4 showed evidence of contamination within the test segments for AG News (Zhang,Zhao, and LeCun 2015),WNLI (Wang et al. 2018), and XSum (Narayan, Cohen, and Lapata 2018) datasets. This finding supports the observation that data contamination is a serious issue that must be considered in downstream evaluations using LLMs.

2 Related Work

Despite its importance, the topic of data contamination is not as thoroughly examined as its closely related field, data memorization (Carlini et al. 2023; Kandpal, Wallace, and Raffel 2022; Carlini et al. 2021; Razeghi et al. 2022). Among the limited investigations focusing specifically on data contamination in LLMs, we find notable examples in Radford et al. (2019)and Brown et al. (2020)on GPT-2 and GPT-3, respectively. They used high-order n-grams (e.g., 13-gram)to detect overlapping content between the training data and the evaluated dataset. Most research subsequent to Brown et al. (2020) adopted similar methods for detecting data contamination (Touvron et al. 2023b; Du et al. 2022; Chowdheryetal.2022;Weietal.2021).However, the scope of existing research has been predominantly confined to model providers, and it encounters specific limitations, particularly when applied to closed-source LLMs. These limitations primarily involve the need for access to pretraining data (Brownet al. 2020;Du et al. 2022;Wei et al.2021),the requirement for substantial computational resources (Touvron et al. 2023b), or the need for extensive manual labor (Chowdhery et al. 2022). Our approach aims to overcome these hurdles, enabling the assessment of data contamination in scenarios where the training data is either not openly accessible or when significant computational hardware is not available despite having access to the training data.

The work of Sainz et al. (2023)represents the most pertinent effort to tackle data contamination in LLMs. This effort prompted ChatGPT (alsoknownasGPT-3.5-turbo)to generate initial instances from the different splits of a dataset. The underlying assumption here is that if an LLM can reproduce these instances, it must have been trained using that particular split. However, while this approach is prompt-sensitive, our research shows that this method can often prove unreliable and subject to failure due to the sparsity introduced by the request to reproduce the initial instances of the dataset split. Throughout this paper, we refer to this approach as “Did-ChatGPT-Cheat?,” taking inspiration from the title of the referenced blog post.

In this paper, we propose a novel technique for detecting data contamination in LLMs. In a nutshell, our approach instructs the LLM to produce the second segment of a reference instance when its initial part is presented in the in-put prompt. We subsequently assess the output for exact or nearly exact matches. To extrapolate contamination across the entire dataset partition, we use evaluation measures such as ROUGE, BLEURT, GPT-4 evaluation, and human judgment. When tested on seven datasets using GPT-4 and GPT-3.5, our method provides the trace of leaked dataset splits.

3 Approach

In our approach, we operate under a core assumption that we lack direct access to the training data of the LLMs. Given this premise, our detection strategy for data contamination is anchored by two pivotal insights. First, we examineindivid-ualinstanceswithinadatasetpartitiontospotcontamination at the instance level. Second, given that LLMs are trained on grand-scale data, detecting contaminated instances can act as a signal of broader contamination. As a result, the associated partition can be labeled as being leaked to the LLM’s training data.

To discern contamination at the instance level, we focus on replicating instances by the LLM. In this context, exact replicas of instances serve as red flags for contamination in the corresponding partition, confirming these instances were part of the LLM’s training data. However, due to the inherent probabilistic behavior of LLMs, achieving perfect replicas is not always possible. Nevertheless, instances that are closely replicated have a twofold function: while they can offer insightful indications of potential contamination, the fact that many datasets draw from web-based sources implies that partial replicas can also arise by happenstance. This over-lap introduces uncertainty in drawing definitive conclusions about the underlying partition. Thus, it is essential to check for consistent and significant signs of contamination within the partition.

In the following sections, we first elaborate on our method and the necessary components to compel LLM into reproducing dataset instances. We then delve into the procedure for evaluating the contamination status of an entire partition based on these instances.

3.1 Detecting Instance-level Contamination

Our procedure identifies contamination at the instance level in four steps. First, we steer the LLM towards the (potentially-contaminated) dataset using a guided instruction that integrates the dataset name and the partition of interest. Next, to detect potential data contamination, we pro-vide an initial random segment of an instance and its label if it is available. The LLM is then instructed to complete it (see below for the instructions used). Lastly, we use a series of methods to assess how closely the LLM’s output, under guidance, matches the subsequent portion of the reference instance. An example of a guided versus a general instruction is depicted in Figure 1.

(1)Guided Instruction—A Means to Navigate the LLM’s Domain. By employing instruction-tuning on top of causal language modeling (CLM; Vaswani et al. (2017); Radford et al.(2018)),LLMs can be guided by human directives (Wei et al. 2022;Sanhet al. 2022;Chunget al. 2022).This serves asatoolforcontrollingtheLLM’sdomainusingnaturallan-guage.Thus,we forma guided instruction that incorporates the dataset and split names in the prompt, thereby directing the LLM towards the underlying dataset. A comprehensive list of all the instructions used in this study for different tasks/datasets can be found in Table 3 in Appendix A.

(2)Unraveling Data History with Causal Language Modeling. Primarily, data contamination occurs during the CLM phase since it constitutes the largest part of training in LLMs and utilizes web data. Without instruction tuning, an LLM only attempts to complete an input prompt based on data seen during the CLM training phase (Ouyang et al. 2022). Notable models that exhibit this behavior include GPT-2, GPT-3, and the LLaM A models (Touvron et al. 2023a;b). We, therefore, employ the next token prediction mechanism to trace data history. In particular, we feed the model the initial segment of a random dataset instance from a particular split, prompting it to finish the instance. For labeled instances, we integrate the corresponding labels in the input prompt. This reflects that if an instance was ingested during the LLM’s training phase, its label was ingested too.

For paired-instance datasets, we present the model with the initial sentence and its corresponding label. In the case of single-instance datasets, instances with multiple sentences are arbitrarily cut at the end of a complete sentence, whereas for instances containing a single (long) sentence, a random sentence fragment is eliminated. Finally, the LLM is tasked with finishing the provided initial part. Figure 1 shows this process for a paired-instance dataset.

Therefore, once the LLM is prompted with guided instruction, its output mirrors the subsequent segment of the reference instance, resulting in either an exact, near-exact, or inexact match. The criteria we use to categorize these matches are detailed in step 4.

(3) General Instruction—An Alternate Facet of Causal Language Modeling. We also formulate a general instruction to measure the impact of the guidance given in the guided instruction. This general instruction only requests the completion of the partial input instance without specifying the dataset or its partition. As a result, when using this general instruction, the generated sequence solely relies on the CLM training phase, akin to autoregressive models with-out instruction tuning. This enables us to establish a base-line for generated random replicas and assess how much the guided instruction influences the LLM-generated part of the input instance. We assess this influence in terms of over-lap, semantics, and structural similarity with the reference instance. This analysis is crucial as even when the output of LLM does not perfectly match the reference instance, it still enables us to detect potential signs of contamination.

(4) Measuring Instance-level Contamination.

BLEURT & ROUGE: To quantify the overlap between a generated and a reference instance we employ ROUGE (Lin 2004)and BLEURT (Sellam, Das, and Parikh 2020) scores. While ROUGE assesses lexical similarity, BLEURT gauges the semantic relevance and fluency of the resulting sequence with respect to the reference instance.

GPT-4 Evaluation: While both BLEURT and ROUGE quantify the overlap between the generated and reference instances, they fall short of pinpointing near-exact matches. To bridge this gap, we adopt ICL prompting (Brown et al. 2020)) to dictate the detection of exact/near-exact matches based on human judgment. Specifically, this method entails presenting a few representative examples of exact and near-exact matches—sourced from human evaluation—within the input prompt and then proceeding to assess all other in-stances. We chooseGPT-4 for this task as it requires no specialized prompting technique (Bubeck et al. 2023), enhancing the reliability of its results. A visual representation of the ICL prompt used in our study can be seen in Figure 2 (Appendix C). Also, detailed examples, including their ROUGE and BLEURT scoresaccompaniedbybothhumanandGPT-4 evaluations, are provided in Table 4 (Appendix D).

3.2 Detecting Partition-level Contamination

To generalize from instance-level contamination scores to partition-level discrete decisions (i.e., the partition is/is not contaminated)we take advantage of two observations:

Idea 1: A dataset is likely to be contaminated if the aver-age overlap score (as measured by ROUGE or BLEURT) observed with the guided instruction is significantly larger than the one measured with the general instruction. The motivation behind this idea is that, since the only difference between the two instructions is that the guided instruction contains the dataset and partition names, the improvement can only be explained through contamination.

Idea 2: A dataset is likely to be contaminated if GPT-4 detects at least one exact match or at least two near-exact matches. The intuition behind this idea is that even a small contaminated part of the sample of instances is likely to be indicative of a larger dataset leak.

We propose two algorithms based on these ideas, with each algorithm implementing one of the ideas respectively.

Algorithm 1: A dataset partition is labeled as contaminated if the overlap score (as provided by BLEURT or ROUGE)

Figure 1: An example of a guided (left) and general (right) instruction employed for a paired-instance dataset. In this example, using GPT-4, the guided instruction results in an exact match, whereas the general instruction does not.

with the guided instruction on a sample of k = 10 instances is significantly better than the score with the general instruction under a non-parametric boots trap resampling test.2

The advantage of this algorithm is that it is non-parametric, i.e., we do not need to decide on an arbitrary threshold on the ROUGE or BLEURT scores to indicate contamination. However, its drawback is that even a significant increase in overlap may still come from generated instances that a human would not consider an exact or near-exact match. Algorithm 2 addresses this limitation.

Algorithm 2: A dataset partition is labeled as contaminated if GPT-4 labels at least one instance as an exact match or at least two instances as near-exact matches in a sample of k = 10 instances. All the instances in this sample are generated using the guided instruction.

We evaluate both these algorithms next.

4    Experimental Setup

Data: Our study employs seven datasets derived from various tasks, namely classification, summarization, and NLI. The datasets in question involve IMDB (Maas et al. 2011), AG News (Zhang, Zhao, and LeCun 2015), Yelp Full Re-views (Zhang, Zhao, and LeCun 2015), SAMSum (Gliwa et al. 2019), XSum (Narayan, Cohen, and Lapata 2018), WNLI (Wang et al. 2018), and RTE (Wang et al. 2019). In order to ensure a comprehensive experimental setup, all our experiments are carried out on both the training and test/-validation splits of the aforesaid datasets. We make use of the publicly available divisions, working with the training and test splits for each. However, for the last two datasets, only the validation splits were publicly accessible with their labels. Considering our research’s emphasis on pinpointing data contamination with minimal dataset instances, the re-source constraints, and our intention to facilitate the replication of this approach by other researchers, we randomly chose 10 instances from each split for our experiments.

Setting: We use snapshots of GPT-3.5 and GPT-4 from June 13, 2023—specifically gpt-3.5-turbo-0613 and gpt-4-0613—both accessed via the OpenAI API, as our foundational LLMs. To obtain deterministic results, we set the temperature to zero, and capped the maximum completion length at 500 tokens. Contrarily, our comparative method uses the chat user interface (UI), which we also leveraged for conducting the experiment under this method. Specifically, we used the UI versions of GPT-4 andGPT-3.5 that were released on July 20, 2023.

Human Evaluation: We under take a human evaluation, led by a domain expert, to characterize contamination by identifying both exact matches and near-exact matches of individual instances. The term “exact matches” is self-explanatory; “near-exact matches” are completions by the LLM that, while not identical, show considerable over lap and maintain significant semantic and structural similarity to the reference instance. To generalize from individual instances to entire partitions, the human expert followed the following rule: a partition is flagged as contaminated if the instance-based evaluation identified at least one exact match or at least two near-exact matches.

Evaluation Metrics: In our analysis, the computation of the BLEURT score varies based on the structure of the dataset/instance, as this metric hinges on the fluency and quality of the generated sequence. For single-instance datasets, where individual instances are randomly cut off mid-sentence and then completed by the LLM, we join the model-produced continuation to the severed reference instance and then calculate the BLEURT score. Conversely, for instances from paired-instance and multi-sentence single-instance datasets, the BLEURT score is computed solely for the newly produced sequence. We highlight that our BLEURT score computations use the most re-cent checkpoint provided, i.e., BLEURT-20 (Sellam, Das, and Parikh 2020; Pu et al. 2021; Devlin et al. 2019). On the other hand, regardless of the dataset/instance type, the ROUGE score calculation exclusively pertains to the portions of the text finished by the LLM. This is due to the score’s dependency on statistical attributes rather than semantic consistency.

Comparative Framework: The most relevant approach to detect data contamination in LLMs is Did-ChatGPT-Cheat?

Table 1: Overall accuracy at detecting contamination across 14 dataset partitions based on the human judgments. The two proposed algorithms are evaluated for GPT-4 (top half) and GPT-3.5 (bottom half).

(Sainz et al. 2023), against which we contrast our proposed method. Unlike our method, which employs a binary scale to determine dataset contamination, the comparison approach includes a “suspicious” category. This designation is invoked when the LLM, upon being asked to generate the first instances of a specific partition in the dataset, outputs characteristic attributes such as data format, IDs, or other dataset-specific details instead of the actual initial instances. If the model, on the other hand, fails to produce these characteristics, it is deemed “not contaminated.”

5 Results and Discussion

Table 1 lists the overall accuracy of the proposed algorithms in 28 distinct settings: two language models (GPT-4 and GPT-3.5) × 14 dataset partitions coming from seven datasets. Table 2 provides a detailed breakdown of each algorithm per dataset partition and language model. Within this table, partition-level contamination is represented in the following ways: a single tick(X) signifies the presence of at least one exact match, whereas a double tick (XX)indicates the detection of at least two near-exact matches. In contrast, the cross sign (×)denotes that neither of the afore mentioned conditions were met. In the context of the Did-ChatGPT-Cheat? method, this cross sign indicates that the model’s output does not contain any specific information about the initial instances of the dataset partition upon the request to generate them. For the same method, the question mark (?) highlights partitions that are deemed suspicious. We draw the following observations from these experiments:

(1) Comparing the output of the guided instruction against that produced by the general instruction (Algorithm 1) proves to be an effective strategy for identifying data contamination in LLMs. The effectiveness of this strategy is highlighted by its ability to pinpoint either exact or near-exact matches in half of the evaluated settings (7 out of 14) when using GPT-4 as the underlying model. More-over, in cases where the partition remains uncontaminated, the consistency between average BLEURT and ROUGE-L scores—drawn from both general and guided instructions— substantiates our assertion. This consistency suggests that establishing a baseline using random matches from general instruction is valid. Additionally, these outcomes further confirm that guided instruction effectively steers the LLM’s domain towards the respective dataset, achieving notably higher average BLEURT and ROUGE-L scores when the dataset partition exhibits contamination.

Further, we note that in the context of GPT-4, Algorithm 1 using the BLEURT score successfully labels data contamination in 11/14 settings, whereas ROUGE-L does so in 13/14 settings. However, when applied to GPT-3.5, Algorithm 1 struggles to differentiate between contaminated and clean partitions: BLEURT does so in 9/14 cases, while ROUGE-L manages in only 7/14 scenarios.

(2) Using Algorithm2 and assessments with GPT-4 through ICL prompting, we find that this approach aligns closely with human evaluations. Specifically, in experiments run on GPT-4 and GPT-3.5, the match rates are 14/14 and 13/14, respectively. These accuracies are higher than any produced byAlgorithm1andmaintainconsistencyacrossthetwolan-guage models.

(3) Upon assessing the results of our comparative approach, Did-ChatGPT-Cheat?, we discover that this method invariably labels partitions as “suspicious” for all scenarios involving GPT-4. This label provides no insight into the potential contamination of a split; hence, no division received the correct classification based on human evaluation (0/14). For GPT-3.5, based on the defined criteria for uncontami-natedpartitions,thismethoddetects11/13settingscorrectly. This result supports our observation that detecting contamination at the instance level (before generalizing to the partition level) is a more robust strategy.

(4) Last but not least, the human evaluation reveals that the train and test/validation splits of both the AG News and WNLI datasets were included in GPT-4’s training data. However, for IMDB and RTE, only the training portions were incorporated, while for XSum, only the test split was leaked. For GPT-3.5, the only data exposure was the test partition of the XSum dataset. This finding confirms that, despite their creators’ efforts, today’s LLMs have ingested NLP datasets. We hope that this observation will inform the design of better scientific experiments in the NLP space in the future.

6    Conclusion

We proposed a methodology to detect data contamination in LLMs, assuming the lack of access to their training data. Our approach began by pinpointing data contamination at the in-stance level. This was achieved by prompting the LLM to produce the replica of the secondary segment of a dataset instance given its initial random segment, dataset name, and partition type, a process we called “guided instruction.” From here, we adopted a set of rules to generalize from instance-level to broader partition-level contamination. This involved leveraging statistically significant differences from BLEURT and/or ROUGE-L scores between guided and general instructions, as well as evaluations from GPT-4 using in-context learning prompting.

Table2: An assessment of our proposed method in contrast to the comparative method, with the evaluations rooted in BLEURT and ROUGE-L scores, along with the decisions from GPT-4 and human judgments. The evaluations are performed on 10 instancesrandomlydrawnfromeachsplitofaparticulardataset,withGPT-4(tophalf)andGPT-3.5 (bottom half) serving as the LLMs that are investigated. Asterisks indicate statistically significant differences between the guided and general instructions.

Our evaluation spanned 28 different settings, including seven datasets along with their respective train and test/validation partitions for both GPT-4 and GPT-3.5. Our findings indicated that while the replication technique via guided instruction is notably effective, the most accurate approach that was closely aligned with human judgments for detecting data contamination was the few-shot in-context learning prompt with GPT-4, integrated a few exemplary instances from human assessment in the input prompt. This method yielded a success rate in pinpointing data contamination across 14/14 scenarios for GPT-4 and 13/14 for GPT-3.5.

A    A Complete List of All Guided and General Instructions

Table 3 presents a thorough collection of all the guided and general instructions employed throughout our study. These instructions utilize placeholders, each of which is replaced with pertinent information as follows: {dataset name} is substituted with the name of the dataset; {split name} is replaced with the title of the split; {input} stands in for the preliminary part of the dataset instance cut at the tail randomly, or in the case of NLI-based datasets, it encompasses the whole initial sentence; and {label} is exchanged with the label corresponding to the incomplete input instance.

B Statistical Analysis: Bootstrap Resampling

We examine the statistical significance of results stemming from guided versus general instructions. A bootstrap re-sampling technique, involving 10,000 samples in the re-sampling process, is employed for this investigation (Efron 1979; Efron and Tibshirani 1993; Efron 2003). We concentrate on the alternative hypothesis that posits guided instructions produce outcomes closer to reference instances than those generated from general instructions, as evaluated by

Table 3: A comprehensive list of all guided and general instructions used in our experiments.

fluency, quality, and similarity. The performance metrics utilized here are BLEURT and ROUGE-L scores. We regard the ROUGE-L and BLEURT scores as statistically significant if the p-values ≤ 0.05. We highlight the statistically significant results by marking them with an asterisk in Table 2.

C Few-shot In-context Learning Prompt

Figure 2 showcases the few-shot ICL prompt employed to evaluate the model-generated candidate against the reference text usingGPT-4.Withinthis prompt, we presentGPT-4 with one precise match and four exemplary instances of near-exact matches, all pre-labeled by human evaluation. These examples guide GPT-4 in discerning the difference between near-exact and inexact matches, in line with human assessment.

D    Illustrations of Exact, Near-exact, and Inexact Matches

Displayed in Table 4 are examples of exact, near-exact, and inexact replicas of the reference instance when guided instruction and GPT-4 are used. This table also includes computed metrics such as ROUGE-L, BLEURT, and results from human and GPT-4 evaluations. In addition, Table 5 showcases comparative outcomes for the same examples using general instruction.

E Detailed Description of Datasets

IMDB Movie ReviewsDataset.TheIMDBMovieReviews dataset is a balanced corpus of 50,000 movie reviews used for sentiment analysis tasks. It is split evenly into 25,000 training and 25,000 testing reviews, each further balanced for positiveand negativesentiments. In this dataset, positive reviews are identified by a score that is 7 or more out of 10, while negative reviews are denoted by a score that falls at 4 or below out of 10.

AG News Dataset. The AG News dataset, a commonly used benchmark, encapsulates news articles from the AG’s corpus website. It is neatly divided into four categorical classes, namely world, sports, business, and science/technology. The dataset contains 496,835categorized news articles from 2,000 news sources. For each class, the AG News dataset furnishes 30,000training and 1,900 test samples.

Yelp Dataset. The dataset is sourced from the Yelp Dataset Challenge conducted in 2015, containing a massive number of 1,569,264samples, all of which include review texts. This dataset is the foundation for two distinct classification tasks. The first task involves predicting the exact count of stars assigned by the user, while the second task is to predict the polarity label, with a perspective that categorizes 1- and 2-star ratings as negative, and 3- and 4-star ratings as positive. For the full-scale star rating prediction, the dataset includes 130,000trainingsamples and 10,000testing samples for each star category. Similarly, the polarity-based dataset comprises 280,000 training samples along with 19,000 test samples, distributed among each polarity category.

Recognizing Textual Entailment (RTE) Dataset. The Recognizing Textual Entailment (RTE) dataset originates from a succession of annual textual entailment challenges. These datasets were combined by the authors of the bench mark using data from four different editions: RTE1 (Dagan, Glick-man, and Magnini 2005), RTE2 (Haim et al. 2006), RTE3 (Giampiccolo et al. 2007), and RTE5 (Bentivogli et al. 2009). The examples within these datasets were primarily formulated using text from news and Wikipedia sources. To maintain consistency, all these datasets were adapted into a two-class split. For those datasets that initially consisted of three classes, the categories of ”neutral” and ”contradiction” were combined to form a single class termed ”not entailment”. The RTE dataset combined has 2,490 examples for training, 277 examples for validation, and 3,000 examples for testing.

Winograd Natural Language Inference (WNLI) Dataset. The WNLI (Winograd Natural Language Inference) dataset is a benchmark for natural language understanding tasks, particularly for evaluating coreference resolution and pronoun disambiguation in context. The dataset is derived from the original Winograd Schema Challenge (Levesque,Davis, and Morgenstern 2012) and contains sentence pairs where a pronoun needs to be resolved by determining whether it refers to the same entity as the previous sentence. While the dataset has a balanced training set between two classes, the test set is imbalanced, with 635 training examples, 146 testing examples, and 71 validation examples.

SAMSum Dataset. The SAMSum dataset, compiled by the Samsung R&D Institute in Poland, comprises around 16,000 English messenger-style conversations with summaries. These dialogues, created by linguists, reflect a variety of styles, registers, and topics similar to real-life messenger interactions. Each conversation is annotated with a third-person summary and categorized based on the number ofutterances,rangingfrom3-30.The dataset primarily consists of two-person dialogues.

Extreme Summarization (XSum) Dataset. The Extreme Summarization (XSum) dataset serves as an evaluation dataset for abstractive single-document summarization systems. Its objective is to generate a concise one-sentence summary that answers the question, ”What is the article about?”. The dataset comprises 226,711 news articles, each accompanied by a one-sentence summary. These articles were collected from BBC articles spanning the years 2010 to 2017 and cover a wide range of domains, including news, politics, sports, weather, business, technology, science, health, family, education, entertainment, and arts. The official random split allocates 90% (204,045 documents) for training, 5% (11,332 documents) for validation, and 5% (11,334documents)for the test set, respectively.

References

Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; Do, Q. V.; Xu, Y.; and Fung, P. 2023. A Multitask, Multilingual, Multi-modal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. arXiv:2302.04023.

Table 4: Examples of exact, near-exact, and inexact matches along with their respective BLEURT and ROUGE-L scores, and judgmentsfromGPT-4ICLandhumanevaluations.TheseexamplesaregeneratedbyGPT-4,astheunderlyinglanguagemodel.

Table 5: Completions generated by GPT-4 under general instruction for examples shown in Table 4.

Figure 2: A display of the few-shot ICL prompt utilized for instance-level data contamination detection using GPT-4. In this illustration, examples 1 through 4 remain constant within the prompt, while example 5 is updated with a new input reference and candidate for evaluation, depending on whether there is an exact, nearly exact, or inexact match.

Brown,T.;Mann,B.; Ryder,N.; Subbiah,M.;Kaplan,J. D.; Dhariwal,P.;Neelakantan,A.;Shyam,P.;Sastry,G.;Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Mod-els are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan,M.; andLin,H., eds.,Advances in Neural Information Processing Systems, volume33, 1877–1901. Curran Associates, Inc.

Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz,E.; Kamar,E.; Lee, P.; Lee, Y. T.; Li, Y.; Lundberg, S.;Nori,H.;Palangi,H.;Ribeiro,M.T.;andZhang,Y.2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712.

Carlini, N.; Ippolito, D.; Jagielski, M.; Lee, K.; Tramer, F.; and Zhang, C. 2023. Quantifying Memorization Across Neural LanguageModels. arXiv:2202.07646.

Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss,A.;Lee,K.;Roberts,A.;Brown,T.;Song,D.;Erlings-son,U.; Oprea,A.; and Raffel,C. 2021. Extracting Training Data from Large Language Models. arXiv:2012.07805.

Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Zhu, K.; Chen, H.; Yang, L.; Yi, X.; Wang, C.; Wang, Y.; Ye, W.; Zhang, Y.; Chang, Y.; Yu, P. S.; Yang, Q.; and Xie, X. 2023. A Survey on Evaluation of Large Language Models. arXiv:2307.03109.

Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra,

G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann,S.;etal.2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fe-dus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; Web-son, A.; Gu, S. S.; Dai, Z.; Suzgun, M.; Chen, X.; Chowdhery, A.; Castro-Ros, A.; Pellat, M.; Robinson, K.; Val-ter, D.; Narang, S.; Mishra, G.; Yu, A.; Zhao, V.; Huang, Y.; Dai, A.; Yu, H.; Petrov, S.; Chi, E. H.; Dean, J.; De-vlin, J.; Roberts, A.; Zhou, D.; Le, Q. V.; and Wei, J. 2022.       Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.

Dagan, I.; Glickman, O.; and Magnini, B. 2005. The pascal recognising textual entailment challenge. In Machine learning challenges workshop, 177–190.Springer.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova,K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for LanguageUnderstanding. arXiv:1810.04805.

Du, N.; Huang, Y.; Dai, A. M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A. W.; Firat, O.; et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, 5547–5569.PMLR.

Efron, B. 1979. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1): 1 – 26.

Efron, B. 2003. Second Thoughts on the Bootstrap. Statis-tical Science, 18(2): 135 – 140.

Efron, B.; and Tibshirani, R. J. 1993. An Introduction to the Bootstrap. Number 57 in Monographs on Statistics and Applied Probability. Boca Raton, Florida, USA: Chapman & Hall/CRC.

Giampiccolo, D.; Magnini, B.; Dagan, I.; and Dolan, B. 2007. The Third PASCAL Recognizing Textual Entailment Challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, 1–9. Prague: Association for Computational Linguistics.

Gliwa, B.; Mochol, I.; Biesek, M.; and Wawer, A. 2019. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization,70–79.Hong Kong, China: Association for Computational Linguistics.

Haim, R. B.; Dagan, I.; Dolan, B.; Ferro, L.; Giampiccolo, D.; Magnini, B.; and Szpektor, I. 2006. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment,volume 7, 785–794.

Kandpal, N.; Wallace, E.; and Raffel, C. 2022. Dedupli-cating Training Data Mitigates Privacy Risks in Language Models. arXiv:2202.06539.

Ko¨pf, A.; Kilcher, Y.; von Ru¨tte, D.; Anagnostidis, S.; Tam, Z.-R.; Stevens, K.; Barhoum, A.; Duc, N. M.; Stanley, O.; Nagyfi, R.; ES, S.; Suri, S.; Glushkov, D.; Dantuluri, A.; Maguire, A.; Schuhmann, C.; Nguyen, H.; and Mattick, A. 2023. OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv:2304.07327.

Levesque, H.; Davis, E.; and Morgenstern, L. 2012. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.

Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics.

Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts, C. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–150. Portland, Oregon, USA: Association for Computational Linguistics.

Narayan,S.; Cohen,S. B.; andLapata,M.2018. Don’tGive Me the Details, Just the Summary! Topic-Aware Convolu-tional Neural Networks for Extreme Summarization. ArXiv, abs/1808.08745.

OpenAI.2023. GPT-4TechnicalReport. arXiv:2303.08774.

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; Schulman, J.; Hilton, J.; Kelton, F.; Miller, L.; Simens, M.; Askell, A.; Welinder, P.; Christiano, P.; Leike, J.; and Lowe,R. 2022. Training language models to follow instructions with human feedback. arXiv:2203.02155.

Penedo, G.; Malartic, Q.; Hesslow, D.; Cojocaru, R.; Cap-pelli, A.; Alobeidli, H.; Pannier, B.; Almazrouei, E.; and Launay, J. 2023. The RefinedWeb dataset for Falcon LLM: out performing curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.

Pu, A.; Chung,H. W.; Parikh,A. P.; Gehrmann,S.; and Sel-lam, T. 2021. Learning compact metrics for MT. In Pro-ceedings of EMNLP.

Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.; et al. 2018. Improving language understanding by gener-ative pre-training.

Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever,I.;etal.2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.

Ray,P.P.2023. ChatGPT: A comprehensive review on back-ground, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems.

Razeghi, Y.; au2, R. L. L. I.; Gardner, M.; and Singh, S. 2022. Impact of Pretraining Term Frequencies on Few-Shot Reasoning. arXiv:2202.07206.

Sainz,O.; Campos,J. A.; Garc´ıa-Ferrero,I.; Etxaniz,J.; and Agirre, E. 2023. Did ChatGPT Cheat on Your Test? https: //hitz-zentroa.github.io/lm-contamination/blog/. Accessed: 2023-07-06.

Sanh, V.; Webson, A.; Raffel, C.; Bach, S. H.; Sutawika, L.; Alyafeai, Z.; Chaffin, A.; Stiegler, A.; Scao, T. L.; Raja, A.; Dey, M.; Bari, M. S.; Xu, C.; Thakker, U.; Sharma, S. S.; Szczechla,E.; Kim,T.; Chhablani,G.; Nayak,N.; Datta,D.; Chang, J.; Jiang, M. T.-J.; Wang, H.; Manica, M.; Shen, S.; Yong, Z. X.; Pandey, H.; Bawden, R.; Wang, T.; Neeraj, T.;

Rozen, J.; Sharma, A.; Santilli, A.; Fevry, T.; Fries, J. A.; Teehan, R.; Bers, T.; Biderman, S.; Gao, L.; Wolf, T.; and Rush, A. M. 2022. Multitask Prompted Training Enables Zero-ShotTask Generalization. arXiv:2110.08207.

Sellam, T.; Das, D.; and Parikh, A. 2020. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7881–7892. Online: Association for Computational Linguistics.

Tjong Kim Sang, E. F.; and De Meulder, F. 2003. In-troduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003,142–147.

Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozie`re, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lam-ple,G.2023a. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.

Touvron,H.;Martin,L.;Stone,K.;Albert,P.;Almahairi,A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; Bikel, D.; Blecher, L.; Ferrer, C. C.; Chen, M.; Cucu-rull, G.; Esiobu, D.; Fernandes, J.; Fu, J.; Fu, W.; Fuller, B.; Gao, C.; Goswami, V.; Goyal, N.; Hartshorn, A.; Hosseini, S.; Hou, R.; Inan, H.; Kardas, M.; Kerkez, V.; Khabsa, M.; Kloumann, I.; Korenev, A.; Koura, P. S.; Lachaux, M.-A.; Lavril, T.; Lee, J.; Liskovich, D.; Lu, Y.; Mao, Y.; Martinet, X.; Mihaylov, T.; Mishra, P.; Molybog, I.; Nie, Y.; Poul-ton,A.;Reizenstein, J.; Rungta,R.; Saladi,K.; Schelten,A.; Silva, R.; Smith, E. M.; Subramanian, R.; Tan, X. E.; Tang, B.; Taylor, R.; Williams, A.; Kuan, J. X.; Xu, P.; Yan, Z.; Zarov,I.;Zhang,Y.;Fan,A.;Kambadur,M.;Narang,S.;Ro-driguez, A.; Stojnic, R.; Edunov,S.; and Scialom, T. 2023b. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-tention Is All You Need. arXiv:1706.03762.

Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv preprint arXiv:1905.00537.

Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353– 355. Brussels, Belgium: Association for Computational Linguistics.

Wei, J.; Bosma, M.; Zhao,V. Y.; Guu,K.; Yu,A. W.; Lester, B.; Du, N.; Dai, A. M.; and Le, Q. V. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

Wei, J.; Bosma, M.; Zhao,V. Y.; Guu,K.; Yu,A. W.; Lester, B.; Du, N.; Dai, A. M.; and Le, Q. V. 2022. Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652.

Weischedel, R.; Palmer, M.; Marcus, M.; Hovy, E.; Pradhan, S.; Ramshaw, L.; Xue, N.; Taylor, A.; Kaufman, J.; Fran-chini, M.; et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23: 170.

Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi,A.;Cistac,P.;Rault,T.;Louf,R.;Funtowicz,M.;Davi-son,J.;Shleifer,S.;vonPlaten,P.;Ma,C.;Jernite,Y.;Plu,J.; Xu, C.; Scao, T. L.; Gugger,S.; Drame, M.; Lhoest, Q.; and Rush, A. M. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural Language Processing: System Demonstrations,38–45.Online:Associationfor Computational Linguistics.

Zhang, X.; Zhao, J. J.; and LeCun, Y. 2015. Character-level Convolutional Networks for Text Classification. In NIPS.

Zhu, Y.; Zhang, P.; ul Haq, E.; Hui, P.; and Tyson, G. 2023. Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks. ArXiv, abs/2304.10145.