A key factor behind the recent progress in natural language processing is the development of large-scale text corpora used to train increasingly large language models. These datasets have grown from just a gigabytes to hundreds of gigabytes over the past few years(Chelba et al., 2013; Xue et al., 2020; Graff et al., 2003; Brown et al., 2020). Because it is so expensive to perform manual review on nearly-terabyte-scale datasets, they are lower quality than smaller, more curated datasets. These data issues have implications far beyond metrics like perplexity or validation loss, as learned models reflect the biases present in their training data Bender et al. (2021); Wallace et al. (2019); Sheng et al. (2020). As a result, quantitatively and qualitatively understanding the datasets themselves is a research challenge in its own right Dodge et al. (2021).
We show that one particular type of bias, duplicated training examples, is pervasive: of the sequences in several common NLP datasets are repeated multiple times. While naive deduplication is straightforward (and the datasets we consider already perform some naive form of deduplication), performing thorough deduplication at scale is both computationally challenging and requires sophisticated techniques.
We propose two scalable techniques to detect and remove duplicated training data. Exact substring matching identifies verbatim strings that are repeated. This allows us to identify cases where only part of a training example is duplicated (§4.1). Approximate full document matching uses hash-based techniques Broder (1997) to identify pairs of documents with high -gram overlap (§4.2).
We identify four distinct advantages to training on datasets that have been thoroughly deduplicated.
Over of tokens emitted unprompted from a model trained on standard datasets (e.g., C4) are part of a memorized sequence (See §6.2)—even though the 1.5 billion parameter model is much smaller than the 350GB dataset it was trained on. By deduplicating the training dataset we reduce the rate of emitting memorized training data by a factor of .
Train-test overlap is common in non-deduplicated datasets. For example, we find a 61-word sequence111“by combining fantastic ideas, interesting arrangements, and follow the current trends in the field of that make you more inspired and give artistic touches. We’d be honored if you can apply some or all of these design in your wedding. believe me, brilliant ideas would be perfect if it can be applied in real and make the people around you amazed!” in C4 (Raffel et al., 2020) that is repeated times verbatim in the training dataset and times in the validation set (
of the samples in each dataset). This train-test set overlap not only causes researchers to over-estimate model accuracy, but also biases model selection towards models and hyperparameters that intentionally overfit their training datasets.
Training models on deduplicated datasets is more efficient. Processing a dataset with our framework requires a CPU-only linear-time algorithm. And so because these datasets are up to smaller, even including the deduplication runtime itself, training on deduplicated datasets directly reduces the training cost in terms of time, dollar, and the environment Bender et al. (2021); Strubell et al. (2019); Patterson et al. (2021).
Deduplicating training data does not hurt perplexity: models trained on deduplicated datasets have no worse perplexity compared to baseline models trained on the original datasets. In some cases deduplication reduces perplexity by up to
. Further, because recent LMs are typically limited to training for just a few epochsRadford et al. (2019); Raffel et al. (2020), by training on higher quality data the models can reach higher accuracy faster.
To summarize, data duplication offers significant advantages and no observed disadvantages. In the remainder of this paper we present our text deduplication framework in §4
, and study the extent of duplicate content in common NLP datasets (e.g., C4, Wiki-40B, and LM1B) in §5. We then examine the impact of deduplication on test perplexity (§6.1) and on the frequency of emitting memorized content (§6.2
). Finally, we analyze to what extent perplexity on existing, released models are skewed as a result of overlap between the train and test/validation splits (§6.3).
2 Related Work
Large language model datasets.
While we believe our results are independent of model architecture, we perform our analysis on Transformer-based decoder-only language models (Vaswani et al., 2017) trained for open-ended text generation. These current state-of-the-art models are trained on internet text. For example, the GPT-2 family of models Radford et al. (2019) is trained on WebText, a dataset of web documents highly ranked on Reddit—however this dataset was not made available publicly. A common dataset starting point is CommonCrawl, an index of public webpages. Among the models trained on CommonCrawl include GPT-3 Brown et al. (2020) with the addition of book datasets, GROVER Zellers et al. (2019) on a restricted subset filtered to news domains called RealNews, and T5 Raffel et al. (2020) on a cleaned version of common crawl called C4. Other models are trained on more curated Internet sources—for example Guo et al. (2020) used high quality processed Wikipedia text from 40 different languages to train monolingual 141.4M parameter language models. Non-English models necessarily use different datasets; Zeng et al. (2021) for instance introduced PANGU-, a family of models with up to 200B parameters that were trained on a non-public corpus of cleaned and filtered Chinese-language documents from CommonCrawl and other sources. Since many of these datasets are not public, we deduplicate three that are: Wiki-40B, C4, and RealNews–as well as the One Billion Word Language Model Benchmark (Chelba et al., 2013), a smaller dataset commonly used for evaluation.
Contamination of downstream tasks.
When models are trained on datasets constructed by crawling the Internet, it is possible the model will train on the test set of downstream target tasks. For example, Radford et al. (2019, §4) performed a post-hoc analysis to identify 8-gram overlaps between GPT-2’s training set and datasets used for evaluation, and Dodge et al. (2021) analyzed C4 and found that up to 14.4% of test examples for various standard tasks were found verbatim (normalizing for capitalization and punctuation) in the dataset. A more proactive approach removes contaminated data. Trinh and Le (2018, Appendix B) removed documents from their CommonCrawl-based train set that overlapped substantially with the commonsense reasoning used for evaluation. And GPT-3 (Brown et al., 2020, §5) did the reverse and removed downstream evaluation examples from their training data by conservatively filtering out any train set examples with a 13-gram overlap with any evaluation example. Up to of tasks were flagged as potentially contaminated.
In our research, we do not focus on the impact of duplicate text in pretrained models on downstream benchmark tasks; instead we address how duplicate text in the LM training and validation sets impacts model perplexity and the extent to which generated text included memorized content.
Memorizing Train Sets.
The risks of data memorization, for example the ability to extract sensitive data such as valid phone numbers and IRC usernames, are highlighted by Carlini et al. (2020). While their paper paper identifies 604 samples that GPT-2 emitted from its training set, we show that over
of the data most models emit is memorized training data. In computer vision, memorization of training data has been studied from various angles for both discriminative and generative models(e.g. Arpit et al., 2017; Webster et al., 2019; Feldman and Zhang, 2020; Stephenson et al., 2021; Teterwak et al., 2021).
Duplicate text in training data.
The Book Corpus (Zhu et al., 2015), which was used to train popular models such as BERT, has a substantial amount of exact-duplicate documents according to Bandy and Vincent (2021). Allamanis (2019) shows that duplicate examples in code datasets cause worsened performance on code understanding tasks.
3 Language Modeling Datasets
We analyze the presence of duplicate text in four datasets of varying sizes that have been used for training natural language generation systems, producing general-purpose pre-trained models, and for language model benchmarking. While this paper restricts itself to English datasets, we expect that non-English datasets suffer from similar issues and could likewise benefit from de-duplication.
consists of multi-lingual cleaned Wikipedia text (Guo et al., 2020). We take the English portion, which contains 2.9M Wikipedia pages with an average length of 768 BPE tokens. The dataset creators do not indicate any deduplication was performed aside from removing redirect-pages (e.g., “sunflower” to “Helianthus”).
One-Billion Word benchmark (LM1B)
contains 30M sentences of news commentary (Chelba et al., 2013). Unlike the other datasets we analyze, LM1B’s examples are one sentence long rather than multi-sentence documents. The average example length is 32 BPE tokens. While this dataset is extremely standard for benchmarking language models, Radford et al. (2019, Sec 4) note it has 13.2% overlap of the test set with the train set.
Colossal Cleaned Common Crawl (C4)
is made up of 360M web documents, with an average length of 486 BPE tokens (Raffel et al., 2020). C4 was introduced as a pre-training dataset for T5, a set of encoder-decoder models which have been widely used in fine-tuned downstream tasks. The dataset was previously deduplicated in a more sophisticated process than the prior two datasets. Each paragraph was hashed and paragraphs resulting in hash collisions were removed. This was followed by a pass that removed placeholder text, code, and prohibited words. See Dodge et al. (2021) for a detailed breakdown of the source text in C4.
is a subset of the Common Crawl consisting of articles from news domains (Zellers et al., 2019). It contains 31M documents with average length 793 BPE tokens. RealNews was de-duplicated by inserting a hash of the first 100 characters of each document into a bloom filter and then excluding any example whose hash matched an example already added to the dataset. Like C4, examples with duplicate URLs were excluded.
4 Methods for Identifying Duplicates
The simplest technique to find duplicate examples would be to perform exact string matching between all example pairs, but as we will show, this is insufficient. We introduce two complementary methods for performing deduplication. First, using a suffix array Manber and Myers (1993), we remove duplicate substrings from the dataset if they occur verbatim in more than one example. Second, we use MinHash (Broder, 1997), an efficient algorithm for estimating the -gram similarity between all pairs of examples in a corpus, to remove entire examples from the dataset if they have high -gram overlap with any other example.
We consider a dataset as a collection of examples . Each of these examples is itself a sequence of tokens: .
4.1 Exact Substring Duplication
Due to the diversity of possibilities in human language, it is rare for the same idea to be expressed identically in multiple documents unless one expression is derived from the other, or both are quoting from a shared source. This observation motivates deduplicating exact substrings. We call our approach ExactSubstr. When two examples and share a sufficiently long substring (that is, a substring for which ), that substring is removed from one of them. Based on statistical analyses (§4.1.3), we select tokens as the minimum matching substring length. A breakdown of the computation needed for this approach can be found in Appendix B.
4.1.1 Suffix Arrays
This exact-substring-matching criterion, while conceptually simple, is computationally prohibitive with naive (quadratic) all-pair matching. To solve this problem, we concatenate all the examples of the entire dataset into a giant sequence , and construct a Suffix Array of . A suffix array (Manber and Myers, 1993) is a representation of a suffix tree (Weiner, 1973) that can be constructed in linear time in (Kärkkäinen and Sanders, 2003) and allows for efficient computation of many substring queries—and in particular allows us to identify duplicated training examples in linear time. Suffix arrays have been used widely in NLP for applications such as efficient TF-IDF computation (Yamamoto and Church, 2001) and document clustering (Chim and Deng, 2007).
The Suffix Array for a sequence is a lexicographic ally-ordered list of all suffixes contained in the sequence. Formally,
For example, the suffixes of the sequence “banana” are (“banana”, “anana”, “nana” “ana”, “na”, “a”) and so the suffix array is the sequence (6 4 2 1 5 3).
Suffix arrays are often preferable to suffix trees because, while asymptotically less efficient for some types of queries, they are ten to a hundred times more memory efficient Manber and Myers (1993) requiring just 8 bytes per input token.
4.1.2 Parallel Substring matching
After constructing , it is straightforward to identify duplicated training examples. Suppose that the sequence was repeated exactly twice in the training dataset at positions and , that is, . Then the indices will occur adjacent to each other in the suffix array .
Finding all repeated sequences is therefore a matter of linearly scanning the suffix array from beginning to end and looking for sequences that share a common prefix of at least some threshold length. Any satisfying sequences are recorded. This algorithm is embarrassingly parallel, and so we can efficiently process the dataset.
4.1.3 Setting a threshold of duplicates
The final question that remains to be answered is how long a substring match must be before we count it as a duplicate. In Figure 1, we plot the frequency of substring matches within the four datasets we will consider. For each substring of length , we compute the probability that there exists another sequence of length identical to this one; formally:
We choose tokens as the threshold to be conservative: the “bend in the knee” occurs at tokens, and manual inspection of length- matches found no false positives. We then doubled this value to have an exceptionally large margin for error.
|Wiki-40B||\n_START_ARTICLE_\nHum Award for Most Impactful Character \n_START_SECTION_\nWinners and nominees\n_START_PARAGRAPH_\nIn the list below, winners are listed first in the colored row, followed by the other nominees. […]||\n_START_ARTICLE_\nHum Award for Best Actor in a Negative Role \n_START_SECTION_\nWinners and nominees\n_START_PARAGRAPH_\nIn the list below, winners are listed first in the colored row, followed by the other nominees. […]|
|LM1B||I left for California in 1979 and tracked Cleveland ’s changes on trips back to visit my sisters .||I left for California in 1979 , and tracked Cleveland ’s changes on trips back to visit my sisters .|
|RealNews||KUALA LUMPUR (Reuters) - Roads in Southeast Asia have been getting a little louder lately as motorcycle makers, an aspiring middle class and easy bank credit come together to breed a new genus of motorcyclists – the big-bike rider. […]||A visitor looks at a Triumph motorcycle on display at the Indonesian International Motor Show in Jakarta September 19, 2014. REUTERS/Darren Whiteside\n KUALA LUMPUR (Reuters) - Roads in Southeast Asia have been getting a little […] big-bike rider. […]|
|C4||Affordable and convenient holiday flights take off from your departure country, "Canada". From May 2019 to October 2019, Condor flights to your dream destination will be roughly 6 a week! Book your Halifax (YHZ) - Basel (BSL) flight now, and look forward to your "Switzerland" destination!||Affordable and convenient holiday flights take off from your departure country, "USA". From April 2019 to October 2019, Condor flights to your dream destination will be roughly 7 a week! Book your Maui Kahului (OGG) - Dubrovnik (DBV) flight now, and look forward to your "Croatia" destination!|
4.2 Approximate Matching with MinHash
We also perform approximate deduplication based on matching entire examples. This method, which we call NearDup, is a good complement to the exact substring matching, especially for web crawl text, as it handles the very common case of documents being identical except for interspersed templated fields (such as the last row of Table 1).
MinHash (Broder, 1997) is an approximate matching algorithm widely used in large-scale deduplication tasks (Versley and Panchenko, 2012; Gabriel et al., 2018; Gyawali et al., 2020), including to deduplicate the training set for a large Chinese-language LM (Zeng et al., 2021). Given two documents and , the main idea is to represent each document by its respective set of -grams and . We can then use hash functions to quickly approximate the Jaccard Index (Jaccard, 1912):
If the Jaccard Index between and is sufficiently high, it is likely that documents are approximate matches of each other. To efficiently approximate the Jaccard index, MinHash constructs document signatures by sorting each of the -grams via a hash function, and then keeping only the smallest hashed -grams. There are multiple ways to construct estimators of the Jaccard index from these kinds of signatures (Cohen, 2016).
In our implementation, we use 5-grams and a signature of size 9,000. The probability that two documents are considered a potential match is
where and are user-settable parameters to control the strength of the filter. See Appendix A for more details.
For each pair of documents identified as a potential match, more computationally expensive similarity metrics can be employed as a subsequent filtering step. In particular, we identify two documents as duplicates if they are matched by the MinHash algorithm and their edit similarity is greater than 0.8. The edit similarity between token sequences and is defined as:
5 Deduplication Results
We deduplicate each of the four datasets with both of our two techniques. When text was duplicated across multiple data splits, we prioritized keeping a copy in the test or validation set and removing it from the train set.
|% train examples with||% valid with|
|dup in train||dup in valid||dup in train|
|% train tokens with||% valid with|
|dup in train||dup in valid||dup in train|
5.1 Amount of Text Removed
With NearDup, we found that the web-scrape datasets contain between 3.04% (on C4) to 13.63% (on RealNews) near duplicates (Table 2). Near-duplicate text is much less common in Wiki-40B, forming only 0.39% of the train set.222Most duplicates we saw were automatically generated pages, such as the outcomes of sports games. This shows the strength of manual curation for creating high-quality datasets. In C4, the majority (1.8M) of near-duplicate clusters consisted of just a single pair of examples that matched against each other, but there were 280 clusters with over 5,000 examples in them (Figure 2), including one cluster of size 250,933.
On average with ExactSubstr, we remove more total content than with NearDup (despite ExactSubstr not removing any examples outright)—for example removing of the tokens in C4. The exception is LM1B, where ExactSubstr removes less data than NearDup. On investigation, we find this is due to the fact that LM1B documents are significantly shorter: of all documents are under 50 tokens, and so are not even candidates for potential matches even if the entire sequence matched verbatim. We find that both NearDup and ExactSubstr remove similar content— of the training examples that NearDup removes from C4 have at least one verbatim length- match found by ExactSubstr.
5.2 Properties of Duplicated Text
While the authors of both RealNews and C4 explicitly attempted deduplication during dataset construction, the methods were insufficient to capture the more subtle types of duplicate text commonly found on the internet. In C4 and Wiki-40B, we qualitatively observe that much of the text identified as near-duplicated is computer-generated. The text is identical except for the names of places, businesses, products, dates, and so on. Because these examples frequently differ by just a few words at a time, deduplication strategies relying on exact string matching would fail to identify a match. Example duplicate pairs from each dataset can be found in Table 1 (more examples in the Appendix).
For RealNews and LM1B, which are both derived from news sites, we observe that many near-duplicates occur because the same news article appears on multiple news sites with slightly different formatting. For example, in LM1B, there is one example that starts “MINEOLA , N.Y. - New York officials say […]” and another that starts “( AP ) - New York officials say […]”. The two examples are otherwise identical.
5.3 Train / Test Set Leakage
Both deduplication methods identify overlap between the train set and the validation set (Table 2
). For example, 4.6% of the C4 validation set and 14.4% of the RealNews validation set examples had an approximate duplicate in their respective training sets. Such duplication is problematic since it could cause evaluation metrics to be unfairly inflated for models that are better at memorizing their train sets. We evaluate the effect of this leakage on publicly released models in Section6.3.
6 Impact on Trained Models
We trained 1.5B parameter “XL", decoder-only, Transformer-based language models similar to GPT-2, on C4-Original, C4-NearDup, and C4-ExactSubstr, respectively. We use the T5 codebase and model architecture from Raffel et al. (2020)
, and each model was trained for about two epochs on its respective dataset. To better understand the amount of variance in the perplexities of trained models, we also trained three different random seeds of the 110M parameter “base" model for each of the above three datasets—for a total of nine base-sized models.
For all experiments, we used a Byte Pair Encoding (BPE) vocabulary trained on C4-NearDup with a budget of 50K tokens, which resulted in a vocabulary the same size as GPT-2’s. We trained with a maximum sequence length of 512 tokens (for longer documents, we randomly extracted subsequences of this length.) Further training details can be found in Appendix C.
6.1 Model Perplexity
We computed the perplexity of our trained models on the validation sets of LM1B and Wiki-40B, and on subsets of the C4 validation set (Figure 3). For the base size, we observe that all models have similar perplexity on the original C4 validation set and on validation set examples that were identified as unique (no near-duplicate in either train or validation). However, both models trained on deduplicated data have significantly higher perplexity on validation set examples that have duplicates in the training set than the model trained on the original C4. ExactSubstr-deduplicated results in higher perplexity than NearDup-deduplicated. These trends holds true for the XL sized model as well. While this may suggest ExactSubstr duplication results in models least overfit on the train set, note that both of these techniques have used separate duplicate thresholds and a different choice of thresholds could change the results.
When evaluating on the validation sets of LM1B and Wiki-40B, we found that models trained on NearDup-deduplicated C4 consistently achieved lowest perplexity (for LM1B eval with base models, see Appendix Figure 7). ExactSubstr deduplication decreases perplexity of the XL model by almost 3 points perplexity on Wiki-40B which is much larger than the variation of about 1 point perplexity we observed in the base models. This is despite seeing fewer tokens of training data overall.
Lastly, we note all our XL models achieved <35 perplexity on LM1B, which is less than the 42.16 perplexity reported for the 1.5B GPT-2 using a vocabulary the same size as ours.
6.2 Generated Text
Data duplication has the effect of biasing the trained LM towards particular types of examples. This can contribute to a lower diversity of generations, and increased likelihood that the generated content is copied from the training data (Carlini et al., 2020). For our generation experiments, we use top- random sampling with and experiment with prompted and unprompted generation.
|Model||1 Epoch||2 Epochs|
We first evaluate memorization tendencies in the case where the model is asked to generate text without any prompt sequence. We generate 100,000 samples, each up to 512 tokens in length (examples provided in the Appendix). For each generated token, we say the token is memorized if it is part of a 50-token substring that is exactly contained in the training data. On XL-Original, over 1% of the generated tokens belong to memorized sub-sequences (see Table 4). This is more memorization than XL-ExactSubstr or XL-NearDup. Some example subsequences that were copied verbatim from the train set can be found in Table 8 in the Appendix.
In most real use cases, language model generation is controlled by providing a prompt for the model to continue. We experiment with four possible prompt sources: training examples identified by ExactSubstr as having near-duplicates in the train set (train dup), training examples identified as unique (train unique), validation set examples with a near-duplicate in the train set (valid in train), and valid examples identified as unique across all splits (valid unique). We select the first 32 tokens of each example as the prompt, which means we can evaluate the fraction of generations which are near-duplicates with the ground-truth continuation for the prompt (Figure 4). When the prompt comes from duplicate examples in the train set, XL-Original reproduces the groundtruth continuation over 40% of the time. XL-ExactSubstr and XL-NearDup still copy the groundtruth more often when the prompt comes from a duplicate example than when the prompt comes from a unique example, suggesting that more stringent deduplication may be necessary to remove memorization tendencies entirely.
6.3 Impact on Existing Models
Train-test leakage does not just impact models trained on C4. In Table 5, we show that whether or not an evaluation example has a near-duplicate in the train set has a significant impact on model perplexity for two standard models: Transformer-XL (Dai et al., 2019), which was trained on LM1B, and GROVER (Zellers et al., 2019), which was trained on RealNews. For Transformer XL, the perplexity halves on examples identified as near-duplicates. For GROVER, the difference in perplexities is present in both the 124M and 1.5B parameter models but is not quite as stark as for Transformer XL.
Existing models also suffer from the problem of generating text from their train sets. We find that of the tokens in the official release of 25k GROVER-Mega outputs333gs://grover-models/generation_examples/generator=mega~dataset=p0.90.jsonl are part of verbatim matches in RealNews of at least length . Likewise, more than 5% of the tokens in ~200k sequences outputted by GPT-Neo 1.3B (Black et al., 2021) are part of a token matches of its training data, the Pile (Gao et al., 2020).
The focus of this paper is on the datasets used to train language models. While recent work focused on documenting the potential harms that could arise from problematic datasets Bender and Friedman (2018); Gebru et al. (2020), less work has been done to quantitatively analyze properties of real language modelling datasets, like Dodge et al. (2021) has done for C4. Our paper provides analysis on one particular axis, that of data duplication.
Our experiments measured what could be quantified: the amount of duplicate content in common datasets, the effect of deduplication on trained model perplexity, and the reduction of memorized content in trained models through deduplication. We do not focus on the nature of the data being removed by deduplication or memorized by LMs.
Privacy is an important subject for future work, as memorization could have significant privacy consequences. We use the following interpretation of privacy: if a model reveals information about examples in its training data beyond what is revealed about examples not in its training data, this is a privacy violation (Shokri et al., 2017).444 Another interpretation of privacy focuses on the sensitivity of the data involved, when a model is trained on and able to reproduce personal identifiers or other forms of "private data." Our definition is more expansive. Training on standard datasets that have not yet been deduplicated results in models that are particularly sensitive to examples that happened to be repeated multiple times, and this has negative privacy implications. For instance, it could violate a person’s expectations of privacy if their publicly available personal data appeared in a different, surprising context. In addition, downstream applications of LMs, such as the game AI Dungeon555https://play.aidungeon.io/, in most cases should not output memorized content like adverts for real-world products.
We stress that in our experiments, we do not distinguish between undesired memorized text (such as phone numbers), innocuous memorized text (common phrases), and text we may want to be memorized (such as a quote by a public figure), and instead treat all instances of the LM generating text that closely matches the training set as problematic. While we qualitatively observed that much of the identified memorized content was relatively innocuous, a more systematic study of the risks associated with the detected memorization was beyond the scope of this work.
We also do not investigate the negative consequences of deduplication. Some language tasks explicitly require memorization, like document retrieval or closed-book question answering. Also, text that gives attribution is often duplicated across documents, so removing duplicate substrings could correspond to removing just the attribution, which could result in models that learn the content without its attached attribution. Deduplication is also not sufficient to remove privacy-sensitive data like bank passwords and medical records which should never be used in training data.
Ultimately, whether memorization is a desired property of a language model, or else risky and unwanted, depends both on the nature of the text that has been memorized and on the downstream applications of the trained model. However, because the trend has been towards creating datasets and models that are application-agnostic, we encourage researchers to think carefully about the limitations of the data they have collected and the how the model’s intended usage constrains what should be part of the training set. Developing techniques to memorize or forget specific sequences depending on the end application is a promising research direction.
We encourage future language model research to perform dataset deduplication, either by training on the deduplicated datasets we release, using the deduplication tools we release, or following our approach to deduplicate datasets with new tools.
The exact technique used to perform deduplication is less important than doing stringent deduplication in the first place. On the whole, deduplication does not harm, and sometimes improves, model perplexity, despite the fact that the deduplicated datasets are smaller, and thus, faster to train on. It is especially important that there are no duplicates between the training and testing sets, because overlap here explicitly encourages selecting models that memorize the training data. Lastly, deduplication helps to reduce the privacy concerns around language models memorizing their training data.
We are grateful to the many researchers whose technical help, feedback, and discussions shaped this project: Jacob Austin, Samy Bengio, Olivier Bousquet, James Bradbury, Fernando Diaz, Mark Diaz, Noah Fiedel, Jonathan Frankle, David Grangier, Stefanie Karp, David Mimno, Gaurav Mishra, Michael Mozer, Sharan Narang, Alex Passos, Adam Roberts, Hanie Sedghi, Jascha Sohl-dickstein, David So, Florian Tramer, and Yun William Yu. We are also grateful to the Google Brain women who have given us continuous support.
Each of the authors on this paper significantly contributed to the final results.
Katherine trained the models used in the paper, built and ran the eval and text generation pipelines, contributed significantly to writing, analysis, and project organization and management.
Daphne ran the approximate matching data deduplication pipelines, extracted prompts and evaluation datasets, ran eval pipelines, and contributed significantly to planning, writing, and analysis.
Andrew wrote the code to perform deduplication with approximate matching, helped evaluate energy expenditure, and helped with analysis.
Chiyuan helped generate plots and contributed to project scoping, writing, and data analysis.
Chris offered mentorship and guidance throughout the project and contributed to writing.
Doug offered mentorship and guidance throughout the project and contributed to writing.
Nicholas wrote the suffix array implementation, ran all ExactSubstr deduplication experiments, contributed significantly to planning, writing, and analysis, as well as scoping the project.
The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153. Cited by: §2.
- A closer look at memorization in deep networks. In International Conference on Machine Learning, pp. 233–242. Cited by: §2.
- Addressing "documentation debt" in machine learning research: a retrospective datasheet for bookcorpus. External Links: Cited by: §2.
- Data statements for natural language processing: toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6, pp. 587–604. External Links: Cited by: §7.
- On the dangers of stochastic parrots: can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, pp. 610–623. External Links: Cited by: item 3, §1.
GPT-Neo: large scale autoregressive language modeling with mesh-tensorflowExternal Links: Cited by: §6.3.
- On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp. 21–29. Cited by: §1, §4.2, §4.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33, Cited by: §1, §2, §2.
- Extracting training data from large language models. External Links: Cited by: §2, §6.2.
- One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. Cited by: §1, §2, §3.
- A new suffix tree similarity measure for document clustering. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, New York, NY, USA, pp. 121–130. External Links: Cited by: §4.1.1.
- Min-hash sketches: a brief survey. External Links: Cited by: §4.2.
- Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §6.3.
- Documenting the english colossal clean crawled corpus. arXiv preprint arXiv:2104.08758. External Links: Cited by: §2.
- Documenting the english colossal clean crawled corpus. External Links: Cited by: §1, §3, §7.
What neural networks memorize and why: discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems, Cited by: §2.
- Identifying and characterizing highly similar notes in big clinical note datasets. Journal of Biomedical Informatics 82, pp. 63–69. External Links: Cited by: §4.2.
- The Pile: an 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027. Cited by: §6.3.
- Datasheets for datasets. External Links: Cited by: §7.
English gigaword. Linguistic Data Consortium, Philadelphia 4 (1), pp. 34. Cited by: §1.
- Wiki-40b: multilingual language model dataset. In LREC 2020, External Links: Cited by: §2, §3.
- Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 901–910. Cited by: §4.2.
- The distribution of the flora in the alpine zone.. New phytologist 11 (2), pp. 37–50. Cited by: §4.2.
- Simple linear work suffix array construction. In International colloquium on automata, languages, and programming, pp. 943–955. Cited by: §4.1.1.
Space efficient linear time construction of suffix arrays.
Annual Symposium on Combinatorial Pattern Matching, pp. 200–210. Cited by: Appendix B.
- Connected components at scale via local contractions. External Links: Cited by: §4.2.
- Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22 (5), pp. 935–948. Cited by: §4.1.1, §4.1.1, §4.
- Linear suffix array construction by almost pure induced-sorting. In 2009 data compression conference, pp. 193–202. Cited by: Appendix B.
- Carbon emissions and large neural network training. External Links: Cited by: Table 6, Appendix D, item 3.
- Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: item 4, §2, §2, §3.
Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Cited by: item 2, item 4, §2, §3, §6.
- Adafactor: adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. Cited by: Appendix C.
- Towards controllable biases in language generation. arXiv preprint arXiv:2005.00268. Cited by: §1.
- Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §7.
- On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations, Cited by: §2.
Energy and policy considerations for deep learning in nlp. External Links: Cited by: item 3.
Understanding invariance via feedforward inversion of discriminatively trained classifiers. In International Conference on Machine Learning, pp. 10225–10235. Cited by: §2.
- A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847. Cited by: §2.
- Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §2.
- Not just bigger: towards better-quality web corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 44–52. Cited by: §4.2.
- Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125. Cited by: §1.
Detecting overfitting of deep generative networks via latent recovery.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 11265–11274. External Links: Cited by: §2.
- Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pp. 1–11. Cited by: §4.1.1.
- MT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934. Cited by: §1.
- Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27 (1), pp. 1–30. Cited by: §4.1.1.
- Defending against neural fake news. arXiv preprint arXiv:1905.12616. Cited by: §2, §3, §6.3.
- PanGu-: large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369. Cited by: §2, §4.2.
- Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27. Cited by: §2.
Appendix A Further Details on NearDup
For our MinHash based deduplication method, documents are first space tokenized, then each consecutive 5-gram is hashed using tabulation hashing. The set of these hashes is the signature for the document. For each element in a document’s signature, the element is hashed using other hash functions. The minimum hashed element for each of the hash functions is stored. These minimum hashes are then partitioned into buckets, with hashes per bucket. These hashes are augmented into a single value, then if two documents have the same value in at least one bucket, they’ll be marked as a potential match. The probability that two documents are considered a potential match is equal to
where is the Jaccard index between the two documents. For document pairs that were identified as potential matches, we computed their actual Jaccard index, and if that was above 0.8, we computed their edit similarity. Document pairs with edit similarity higher than 0.8 were identified as duplicates. After some experimentation, we chose to use , and , so , so as to make sure a collision at the desired Jaccard index threshold of 0.8 had a high probability of occurring
We also tested an alternative configuration—filtering to document pairs with Jaccard index of at least 0.9 and edit similarity of at least 0.9. In this case, we used , , and . Figure 5 shows the histogram of Jaccard similarities and edit similarities for all document pairs which collided in min-hash space, for our chosen configuration (blue) and for the alternative configuration (orange). This allows us verify if the threshold chosen has few comparisons around the chosen threshold, then we’ve likely captured the majority of actual near duplicates above that threshold. To verify that yourself, look at the left hand tails of the distributions. Since both 0.8 and 0.9 begin to vanish at the same point (in spite of the fact that the two thresholds are optimized for accuracy around different thresholds), we feel comfortable saying that we’re capturing the majority of actual near duplicates.
Let be the number of documents and be the maximal number of tokens in a document. Edit similarity has a worst case complexity of , so the worst case complexity is
since , , and are all . The left term is the complexity of grouping by the signatures, and the right represents the pathological worst case of all documents falling into the same buckets.
The highly distributed NearDup implementation we employed is one used for large-scale production tasks at Google. On the English C4 dataset, the algorithm consumed approximately 41.5 kWh of energy. Note that our choices of and were designed to produce very high recall, and with different parameters, the algorithm could be made much more energy efficient while producing similar results.
Appendix B Further Details on ExactSubstr
Parallel linear time construction.
We build a parallelized linear time suffix array algorithm. As a building block, we make black-box use of the SA-IS algorithm for constructing a suffix array in linear time Nong et al. (2009); Ko and Aluru (2003). Unfortunately, this algorithm is not easily parallelized directly, so we introduce a simple divide and conquer approach to parallelizing the array construction.
We build our implementation in Rust and extend an existing suffix array library666https://github.com/BurntSushi/suffix with three modification. The first two are straightforward implementation differences: we modify the code to allow datasets larger than GB, and we remove the requirement that strings parse as valid UTF-8 sequences in favor of raw byte sequences. Our third change is more significant: we re-implement the algorithm so that we can stream the suffix array itself off disk.
Parallel partial suffix array construction.
Our divide and conquer suffix array construction algorithm starts by partitioning the dataset into different “splits” with SA-IS run over independently on each split in parallel. This algorithm still requires work but runs in wall-clock time. This gives us separate suffix arrays .
Given two suffix arrays and for two sequences and it’s not completely trivial to construct a single suffix array for because of the boundary conditions. Instead, we don’t build the data but rather let for some greater than the longest substring match. Then we build the arrays on and . To merge the arrays together we can remove the items from the first array after index and merge-sort insert them into the second.
Parallel merge of partial suffix arrays.
We now merge these separate arrays together into a single suffix array , Consider the simpler case of two partial suffix arrays and that we would like to merge together. We can achieve this by letting index and index . Each iteration of the algorithm then pushes into if and otherwise, repeating until and . To generalize to splits, we need only replace the single comparison above with a min-heap requiring work on each iteration.
Observe that in the general case this algorithm is where is the length of the dataset, is the average length of a prefix match, and is the number of splits. It is therefore incorrect to call this algorithm linear time in the general case, for ours it is. Because the length of the longest match is bounded above by the length of the longest sequence, as long as the size of the dataset is independent of the length of the longest sequence in the dataset, this algorithm remains efficient.
Again, we can parallelize this operation among simultaneous jobs (in practice we set as the number of threads on our machine). In the case, job processes , choosing the bounds of by binary searching into so that . The case where is identical except that we repeat this over all partial suffix arrays.
We run our algorithm on a single VM on the cloud with cores and GB of memory. Our algorithm is efficient, for example processing the Wiki-40B training set ( million examples containing GB of text) in minutes wall-clock time ( CPU-hours of work). The GB C4 dataset takes under 12 hours (wall-clock) to build a suffix array; although we are still memory constrained and so this corresponds to CPU-hours. Once the suffix array has been constructed, it takes under an hour to deduplicate the C4 dataset.
Note that this algorithm still requires that the dataset itself fits in memory (so that we can efficiently index in arbitrary positions), but we do not need to fit the entire suffix array into memory. This is fortunate since our suffix array requires an space overhead. For example, the suffix array for the GB C4 is TB.
Compared to the cost of training a language model on this dataset, the additional work required to deduplicate the training dataset is negligible.
Appendix C Further Details on Model Training
Each model was trained for about two epochs. Since both C4-Original and C4-ExactSubstr contain approximately 365M examples, we performed 152K steps with a batch size of 4800 (or approximately 2 epochs). C4-NearDup contains approximately 350M examples, we performed 146K steps (or approximately 2 epochs). On a 128-core TPU v3 pod slice, XL models trained on C4-Original and C4-ExactSubstr took approximately 131 hours (5.5 days) to train, while the XL model trained on C4-NearDup took approximately 126 hours to train. Like T5, models were trained with the Adafactor optimizer (Shazeer and Stern, 2018). A constant learning rate of 0.01 was used for the base models and 0.001 for the XL models.
The 1.5B parameter XL models had 24 layers, each with 32 attention heads. The model embedding size was 2,048, the feed forward layers had a hidden size of 5,120, and the key/value dimension size for the attention heads 64. The 110M parameter base models had 12 layers, each with 12 attention heads. The model embedding size was 768, the feed forward layers had a hidden size of 2,048, and the key/value dimension size for the attention heads 64.
Appendix D Energy Consumption
|T5 11B||XL-Original XL-ExactSubstr||XL-NearDup||Base-Original Base-ExactSubstr||Total Inference|
|TPU v3 cores||512||128||128||64||64|
|Training time (days)||20||5.47||5.26||3||0.80|
We trained for approximately 131 hours or 5.5 days on a 128-core TPU v3. The approximate deduplicated dataset is 3.9% smaller than the original dataset and trains in 63 hours/epoch, saving us around 5 hours of compute time for the two epochs. The XL-Originalmodel was trained in North America where the XL-ExactSubstr and XL-NearDup were trained in Taiwan. We used data from Patterson et al. (2021) to estimate amount of energy used in training these models by computing the amount of /hour/core and multiplying by our usage (see Table 6 for how we computed these values). For simplicity, we use estimates from Taiwainese datacenters as an estimate. We estimate training 2 epochs of XL-Original and XL-ExactSubstr uses . XL-NearDup is trained for fewer steps and we estimate uses . Training each base model was approximately 3 days on a 64-core TPU v3 pod slice which uses an estimated .
In addition to model training, evaluation and inference were performed on 64-core TPU v3 pod slices. Generating 100,000 sequences from the XL models takes approximately 0.64 hours. We generated 100,000 sequences for each of five types of prompts for two checkpoints of the model for a total of 1M sequences per model. This took approximately 19.2 hours. We estimate generating 3M sequences uses .
|Due to high demand, we have yet to critique this request. That said, we assure that the review will be produced in due time by our dilligent and unwavering staff in a professional manner. This site is highly regarded amongst its peers in terms of speed and reliability, so feel free to check us out!||Due to a heavy overflow, we have not been able to critique this request. That said, we assure that the review will be produced in due time by our dilligent and unshakable staff in a professional manner. This site is highly regarded amongst its peers in terms of efficiency and reliability, so feel free to visit!|
|Need Pop Tacos parking? You can reserve parking near Pop Tacos with SpotHero. Find low rates without parking coupons by booking a guaranteed spot online. Avoid circling, getting ticketed or running out to feed your meter. Search our parking map, compare parking rates and reserve a discounted parking spot today. Happy parking, and enjoy your meal at Pop Tacos!||Il Sole parking. Reserve parking near Il Sole in NYC.\nYou can reserve parking near Il Sole with SpotHero. Find low rates without parking coupons by booking a guaranteed spot online. Avoid circling, getting ticketed or running out to feed your meter. Search our parking map, compare parking rates and reserve a discounted parking spot today. Happy parking, and enjoy your meal at Il Sole!|
|This item was available on Vinyl 7" but is now sold out on all formats, sorry. Take a look at what else we have in by Jumbo, check out some related artists, head over to our new releases or knock yourself out reading our latest music news & album reviews.\n2nd single edn of 550.||This item was available on CD but is now sold out on all formats, sorry. Take a look at what else we have in by Sirconical, Misty Dixon, Various, check out some related artists, head over to our new releases or knock yourself out reading our latest music news & album reviews.\nTwisted Nerve comp mini album.|
|Here is all the information you need about "No One Killed Jessica" on American Netflix. Details include the date it was added to Netflix in the USA, any known expiry dates and new episodes/seasons, the ratings and cast etc. So scroll down for more information or share the link on social media to let your friends know what you’re watching.||Here is all the information you need about "A Land Imagined" on Netflix in the UK. Details include the date it was added to UK Netflix, any known expiry dates and new episodes/seasons, the ratings and cast etc. So scroll down for more information or share the link on social media to let your friends know what you’re watching.|
|8 + 8 = Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.||Math question * 7 + 1 = Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.|
|Long Island College Hospital is committed to providing outstanding patient care in the Brooklyn, NY area, but before you commit to Long Island College Hospital for a Endometrial Ablation make sure you compare and shop other medical facilities. It may save you hundreds (in some cases thousands) of dollars. View a Endometrial Ablation cost comparison for Brooklyn and Request a Free Quote before you make a decision.||Morristown Memorial Hospital is committed to providing outstanding patient care in the Morristown, NJ area, but before you commit to Morristown Memorial Hospital for a Breast Ultrasound make sure you compare and shop other medical facilities. It may save you hundreds (in some cases thousands) of dollars. View a Breast Ultrasound cost comparison for Morristown and Request a Free Quote before you make a decision.|
|Text||Freq in C4|
|HD wallpaper. This wallpaper was upload at April 19, 2019 upload by admin in.You can download it in your computer by clicking resolution image in Download by size:. Don’t forget to rate and comment if you interest with this wallpaper.||40,340|
to the address posted below. Include our failure information form,a packing slip with your Company name, contact person, and Email address or phone number. Upon receipt of your repair, we\’ll inspect it and then contact you with a quote or evaluation notice. Normal turn aro
und for repair is 5 to 7 business days, with "Rush Repair" available.
|is a great place to begin your search. Whether you are a first-time home buyer or you are already familiar with the home buying process, you can be assured that you have the best tools and the perfect agent available to help with your||5,358|
|pics at these awesome group starting P letter. Desktop wallpapers were first introduced way back in the 1980s and have gained immense popularity since then. It is possible to come across more than 80 million sites on the web offering some sort of wallpaper.||848|
|flowers will let them know you’re thinking of them and wishing them well. Cheerful yellow flowers bring their own sunshine and will get right to work on lifting spirits, and a colorful vase will bring loads of smiles to friends and visitors! Get Well flower arrangements from||479|
|our premier 24 hour emergency* plumbing and heating solutions. We realise that when your heating fails or pipes and drains leak it can cause havoc with your routine and even cause damage to your property. When a plumbing problem occurs that requires an immediate response we provide qualified local plumbers throughout||56|
|is to remove all images that violate copyrights. Please contact us to request that images be removed or to assign proper credit. The images displayed on this site may be used for Free or educational purposes only. If you would like to use any of the images displayed on this site for any other purpose, please obtain permission from the owner. www.||48|
|list of fishing locations, providing interactive maps that show each location’s GPS coordinates, nearby facilities (like restaurants, gas stations, marinas and fishing shops), their current and forecasted weather and, if available, their water conditions.\nFind any of the 8||5|
|. Dyer, Ph.D., is an internationally renowned author and speaker in the field of self-development. He’s the author of 30 books, has created many audio programs and videos, and has appeared on thousands of television and radio shows.||5|
|Generated Text||Freq in C4|
, you’ll need to be knowledgeable to make the very best decisions. We will make sure you know what can be expected. We take the surprises from the picture by giving accurate and thorough information. You can start by talking about your task with our client service staff when
you dial 888-353-1299. We’ll address all of your questions and arrange the initial meeting. We work closely with you through the whole project, and our team can show up promptly and prepared.
|then Waterside Lodge are well equipped for the task. Our fully equipped family sized lodges offer a comfortable luxurious stay for a fantastic price, giving you beautiful views of the lakes and the surrounding countryside. Offering luxurious self-catering holidays in our fully featured Scandinavian holiday lodges. Perfectly located to explore the beaches, coastline. All of our lodges are sized for 6 people and are furnished to the highest standards to ensure you have a stay like no other. At Waterside Lodge the stay itself is only half of the package, Waterside lodge is situated closely to the Heritage Coast which makes our lodges the perfect stay for anyone wanting to get away and have a relaxing countryside break from the city. Whilst you stay with us be sure to take advantage of all the activities Waterside Lodge has to offer. Such as the use of our on-site fishing lakes for the keen fisherman, free internet access, outside relaxation areas, comfortable lounges and much more.||571|
|you are only looking to find rent to own homes in your city or are open to exploring all kinds of rent to own home listings, our database does it all. One of the best aspects of iRentToOwn.com is that, besides options to rent to buy a house, it has numerous other categories of home sale options. These include bank foreclosure homes, pre-foreclosure homes, short sales, HUD/government foreclosures, auction homes and owner-financing/FSBO (For Sale By Owner) homes. With help from the convenient search features offered by our site, shoppers are able to find their ideal lease to own home, real estate company, and more in South||51|
|, IL employs journeyman as licensed to work by themselves, without direct supervision, installing wiring, outlets and fixtures. Our journeyman also does service work, troubleshooting when a breaker fails or a light stops working. Our journeyman does not offer permits that must be issued by our master. Our journeyman follows our master’s plans and directions. Our journeyman’s responsibilities will vary based on the work that needs to be done. Our journeymen are skilled with residential, commercial and industrial installations and repairs.ust work from six years as an apprentice, under direct supervision of our master, and pass a journeyman test. This person also must have some classroom education on the National Electrical Code and fundamental electricity in a technical school a program affiliated with the National Joint Apprenticeship Training Council. Journeyman training combines hands-on work with education on basic electricity.||6|
|combustion process of a petrol engine is never perfect. Dangerous gases, such as nitrogen oxide, carbon monoxide and hydrocarbons will arise and it is the job of the catalytic converter to reduce these to safer emissions. These cat converters can fail by becoming clogged, or if the engine has bad exhaust valves or the plugs fail, causing unburned fuel to overheat the converter. Mettam’s Mufflers can resolve these issues with your Karr||5|
,ANDREW Find the ancestral town: Many a researcher is stuck behind records that say, BIRTHPLACE: IRELAND without saying where in Ireland, or whatever other country. Remember that your immigrant ancestor’s siblings probably were born in the same ancestral town, so check all o
f their records, too. Around 1900, the Roman Catholic churches reported marriages to the churches where the persons were baptised, and before the wedding, they would require a baptismal certificate from that church, without marriage notations, to make sure that the persons were no
t already married, ordained, or whatever, and were free to marry. Do check the Catholic records especially for ex loco and the home town. If your ancestor’s sister had a daughter who generated a marriage or death record saying, MOTHER’S BIRTHPLACE: and the exact town, then y
ou know where to start searching for records that will confirm it is your ancestor’s home town. BEWARE: Just because you find a family with the same names does not mean they are the same family, as they could very well be an unrelated family from a different town in the same an
cestral country. The webmaster has learned this. One clue was that one family was still having babies in Potenza city, Italy while the other was having babies in Colorado, U.S.A.
|will not want to search for Power Washing companies in Wyoming on an extensive basis. The service personnel will be at your doorsteps through online or phone booking. The power wash solutions offered by us are matchless and you can compare with others in Winfield, IL. The power wash services offered by us are very economical. Gutter brightener will be applied which will be followed by cleaning through double scrub. The cleaning will be done by using a soft bristle brush. The bond and contaminants will be released in an effortless manner.||1|
Z3 Plus are valid in all major cities of India like Delhi, Gurgaon, Noida, Mumbai, Chennai, Bangalore, Hyderabad, Kolkata, Pune, Ahmedabad, Coimbatore, Lucknow, Trichy, Madurai, Trivandrum, Mysore, Jaipur, Chandigarh, Pondicherry, Bhopal, Patna, Bhubaneswar, Amritsar, Cochin,
Allahabad, Srinagar, New Delhi, Surat, Ludhiana, Navi Mumbai, Ghaziabad, Bengaluru, Indore, Nagpur, Thane, Agra, Meerut, Ranchi. The delivery feasibility and charges may be varying, hence for them please check with the particular seller or store.
|RealNews Url||# Total||Frac Dups||C4 Url||# Total||Frac Dups|
Appendix E More Results
Table 7 shows several examples of pairs of documents in C4 whose edit distance is close to our chosen edit similarity threshold of 0.8. Table 8 shows substrings which were identified by ExactSubstr as being in C4 more than once. Table 9 shows several examples of unprompted generations which were identified as memorized are shown.
Distribution of memorization.
Figure 6 shows the distribution in memorization amount over all generated sequences when using four types of prompting: train example with duplicates in train, train examples without any duplicates, validation examples with duplicates in train, and validation examples without any duplicates.
URLs with many duplicates.
Table 10 shows the URLs had the largest proportion of examples identified by NearDup as near-duplicates. For C4, these tend to be websites that sell many similar products and thus have a large amount of templated text. For RealNews, content aggregators seem especially common.
NearDup cluster sizes.
Table 12 gives the size in BPE tokens and in examples of each dataset before and after deduplication. Because most datasets were already deduplicated of exact matches during their creation, ExactSubstrdeduplication does not actually remove any examples.
Perplexity on LM1B.
|Duplicate Train Prompts||35.88%||34.34%||3.34%||3.15%||5.71%||4.67%|
|Unique Train Prompt||0.42%||0.41%||0.42%||0.41%||0.22%||0.23%|
|Duplicate Test Prompt||16.27%||15.32%||1.61%||1.52%||0.34%||0.25%|
|Unique Test Prompt||0.25%||0.22%||0.21%||0.23%||0.03%||0.08%|
|Final train set size in tokens||Final train set size in examples|