CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

11/01/2019 ∙ by Guillaume Wenzek, et al. ∙ 13

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pre-trained text representations have brought significant performance gains on many natural language processing tasks [15]. Since the introduction of Transformers [18] and BERT [3], we have a seen a steady improvement in the quality of these pre-trained models, mainly driven by increasing the size of the pre-training corpora [16, 19, 10]. Nonetheless, the size only does not guarantee better models and the quality of the data has to be preserved, which has lead to the use of ad-hoc datasets created by concatenating existing high-quality data sources like Wikipedia. Unfortunately, such datasets cannot be replicated as easily for low-resources languages, as many have much smaller curated datasets such as Wikipedia.

In this paper, we present a data collection pipeline that allows to gather massive monolingual corpora of high quality in a variety of languages, including many low-resource ones. The principles of our pipeline are general and we show the results of its application to data collected by the Common Crawl project.111 Common Crawl is a massive non-curated dataset of webpages in many languages, mixed together in temporal snapshots of the web. Our pipeline performs standard document deduplication and language identification similar to grave2018learning, but differs in two ways: first, we preserve the document-level structure to allow for the training of paragraph-level representations like BERT [3] ; second, we add an optional monolingual filtering step that selects documents that are close to high quality sources, like Wikipedia. This is achieved by training a language model on the targeted sources and use the perplexity as a scoring function for documents. Our pipeline can be applied to any number of Common Crawl snapshots and takes 8.5 hours to process per snapshot on 5000 CPU cores. For example, the dataset obtained by pre-processing the February 2019 snapshot is composed of 1.5 billions documents in 174 languages. There are 700 millions filtered documents in English alone, corresponding to 532 billions tokens. That is 160 times bigger than the data used in devlin2018bert.

This paper is organized as follows: we first present the Common Crawl corpora, followed by our overall pipeline to filter high quality documents from it. We then describe additional tools that can be used to tailor the filtering to a targeted corpora. Finally, we give in depth statistics about the dataset obtained from pre-processing a single Common Crawl snapshot. The pipeline and the tools are publicly

2 Related work

Preprocessing of massive datasets for training text representations has been developed in the context of word embeddings, such as word2vec [13], GloVe [14] or fastText [12]. In particular, our pipeline follows the fastText pipeline of grave2018learning where Common Crawl is split into monolingual datasets using a language identifier based on fastText [6].

Common Crawl has been used in the context of language modeling to evaluate  -gram statistics [1]. More recently, baevski2019cloze pre-trained a BERT-like model on Common Crawl as preprocessed in grave2018learning. In general, progress in sentence representations has been observed by increasing the size of the pre-training corpora [19, 11, 17]. In particular, and concurrently to our work, raffel2019exploring used a large scale dataset based on Common Crawl to train text representations. Existing work using web based datasets have been using English specific preprocessing, such as keeping URLs shared on Reddit or using hand-crafted filtering rules. As opposed to these approaches, our pipeline can easily be applied to many languages other than English. Closer to this work, suarez2019asynchronous has improved the pipeline of grave2018learning, showing that large monolingual corpora can be extracted from Common Crawl rapidly even with limited resources. Our work follows a similar pipeline with an additional step to select high-quality documents.

Figure 1: We show the whole pipeline for downloading and processing one snapshot of Common Crawl. First we download all the wet files and compute the paragraph hashes that we group and save into binary files. Then we process every document of the wet files independently: we deduplicate the paragraph using the binary files, we do a language identification and compute language model perplexity score. Finally, we regroup the documents into json files by language and perplexity score. The steps of the pipeline indicated with dashed arrows are parallelisable.

3 Methodology

Every month, Common Crawl releases a snapshot of the web obtained by randomly exploring and sampling URLs. Each webpage is made available different formats: raw (WARC), UTF-8 text (WET), and meta-data (WAT). There is little content overlap between monthly snapshots. The complete archive consists of petabytes of data collected over years of web crawling. The webpages are crawled from the whole web without restriction; they come in many different languages and in the quality of the text varies greatly. The Common Crawl represents a rich resource for monolingual data that comprises a large variety of domains, yet poses challenges due to the large quantity of noisy text.

Here we describe our the methodology used to fetch, deduplicate and filter the Common Crawl data. We focus on preprocessing the text (WET) format of the common crawl snapshots. Our pre-processing pipeline consists of several steps that we describe in this section. An overview of the pipeline is illustrated in figure 1.

3.1 Preprocessing

Each snapshot contain between 20 and 30TB of uncompressed plain text, corresponding to approximately 3 billion web pages (for instance the Feb. 2019 snapshot contains 24TB of data). We download and process each snapshot independently. For each snapshot, we regroup WET files into shards of GB each. This makes up for 1600 shards for Feb. 2019 crawl. These shards are saved into a JSON file where one entry corresponds to one web page.

3.2 Deduplication

The first step of our pipeline consists in removing duplicated paragraphs across the different web pages in a snapshot, as they represent of the text. We first normalize each paragraph by lower-casing all characters, replacing numbers by a placeholder (i.e. ) and removing all Unicode punctuation and accent marks.

Then, the deduplication is done in two independent steps. First, for every shard, we compute a hash code for each paragraph and save them into a binary file. We use the first 64-bits of SHA-1 digits of the normalized paragraphs as the key. Then, we deduplicate every shard by comparing it with either , a subset or all of the binary files.

The impact of this choice is discussed in 4 These steps are independent for each shard and can thus be distributed. In addition to removing web copies, this step gets rid of a lot boilerplate such as navigation menus, cookie warnings and contact information. In particular, it removes significant amount of English content from webpages in other languages. This makes the language identification, which is the next step of our pipeline, more robust.

Figure 2: Number of tokens per language for the Feb. 2019 snapshot after deduplication. We display the histogram with logarithmic scale.

3.3 Language identification

The second step of our pipeline consists in splitting data per language. Following grave2018learning, we use the language classifier from fastText 

[7, 4]. The fastText language identifier was trained on Wikipedia, Tatoeba and SETimes. It uses characters -grams as features, and the hierarchical softmax. It supports 176 languages and outputs a score for each of them in the range. It processes

k documents per second on a single CPU core. For every web page we compute the most probable language, and the corresponding classifier score. If this score is higher than 0.5, we classify the document in the corresponding language. Otherwise, the language is not clearly identified, and we discard the corresponding page.

3.4 LM filtering

At this step of the pipeline, there are still documents with low quality content. A way to filter out these samples, is to compute a score of similarity of a web page with a targeted domain such as Wikipedia. In this paper, we propose to use the perplexity of a language model trained on the targeted domain as the quality score.

More precisely, for each language, we train a sentence piece tokenizer [8] and a language model on data from the targeted domain. We use a -gram Kneser-Ney model as implemented in the KenLM library [5] because of its efficiency to process large quantity of data. Then, we tokenize each page in our dataset, with our sentence piece tokenizer and compute the perplexity of each paragraph using our language model. The lower the perplexity, the closer the data is to the targeted domain. At the end of this step, each language is split into three even parts , and , corresponding to the perplexity score. In section 5 we show perplexity distributions for one snapshot of Common Crawl.

We have trained sentence piece and Kneser-Ney language models on Wikipedia for languages. We make these models publicly available in the repository. We also provide code to train sentence piece and Kneser-Ney language models and compute the terciles thresholds if the user wants to use other data to filter Common Crawl.

3.5 Reproducing results without the pipeline

Reconstructing the dataset by running our pipeline requires a lot of resources and time. Together with the release of the pipeline, we provide a tool to efficiently reproduce the results of this work. This tool builds on a file containing URLs of webpages and reconstructs the final output of our pipeline from this file.

4 Ablation study

In this section, we discuss the impact of several design choices in our pipeline on the resulting datasets.

4.1 Order of LID and deduplication steps

Contrarily to [4], we have chosen to deduplicate the data before language identification, because a lot of English boilerplate, such as cookie warnings, is present in pages of other languages. A significant amount of this noisy data is removed by deduplication which allows for better language identification. This is particularly important for some low resource languages. In Figure 3 we report the relative increase in number of documents when doing ”deduplication then LID” instead of ”LID then deduplication”. We observe that a lot of low resource language documents were mis-classified before deduplication (generally to English), or discarded because no language could be identified.

Figure 3:

Impact of doing ”Deduplication then LID” rather than ”LID then Deduplication”. Y-axis shows per language-ratio of number of documents between the two methods. X-axis is the number of documents found for each language using LID scores obtained after deduplication. Low resources languages benefits the more from doing ”Deduplication then LID” Stats estimated on 1% of Feb. 2019 snapshot.

4.2 Impact of the amount of deduplication

For deduplication, we can compare paragraphs hashes shard by shard, across N shards or across the whole snapshot (1600 shards). The higher N, the higher the number of documents removed and the more RAM the algorithm will use. We show in 4 the amount of data remaining (percentage of number of characters) for one shard of the snapshot Feb. 2019 after deduplication across 1, 2, 5, 10, 20, 50 and 100 shards. After deduplication across 1 shard, there is 42% of characters remaining and 28% across 100 shards. Loading hashes from 50 represents 1.5B unique hashes, making up 13.5GB on disk. Using a memory efficient we can fit those into 40GB of RAM. In 5 we show how the RAM increase when we try to load more hashes in memory. We found 50 shards to be a reasonable trade-off and are therefore running the deduplication on blocks corresponding to of the corpus.

Figure 4: Amount of data remaining after deduplication with different fraction of the dataset. These statistics are computed on one shard.
Figure 5: RAM usage when loading hashes from different fraction of the dataset. Computed on one shard.

4.3 Benchmarking

The pipeline is massively parallelizable but still has to run in two steps because of the deduplication which requires to compare billions of documents paragraphs. In our case we chose shards of 5GB as the smallest unit of parallelisation. One dump is divided in 1600 shards, each containing around documents. Computing the hashes of paragraphs is done at about doc/s on one CPU core, while downloading the files at the same time. This means that one shard of about documents is done in min. We compute all the hashes in minutes on CPUs. In one pass, the next step removes duplicates, and performs language identification, sentence piece tokenization, language modeling and splitting based on language. Each shard creates 3 files for the top 48 languages for which we have a LM, and one file for each other language where we don’t have a LM. Each of those processing require a significant amount of RAM but the memory can be shared across processes since it is read only. This step is significantly longer than the previous one. We allocate 17 processes to one shard. The master process is responsible for downloading the data and distributing the raw documents to the 16 workers as well as writings the results to disk. The worker threads process around , processing the whole shard in about minutes. Removing the duplicated parapgraphs takes of the time. This step is computationally less expensive than the following ones but is done on all the data, as opposed to the next steps which are only applied to the deduplicated data. The language identifier takes of CPU time, sentence piece and the LM . Finally we regroup the files produced at the previous steps in chunks of  5Gb. This can be run in parallel for each output file, and since gzip archive can be concatenated without being decompressed first it’s very fast and runs in matter of minutes. The total processing time is about 9 hours using 5000 CPU cores for one snapshot.

5 Metrics about the resulting dataset

In this section, we report statistics corresponding to the corpus obtained after applying our pipeline on the Feb. 2019 snapshot of Common Crawl.

5.1 Statistics per language

After preprocessing it, we get TB of compressed documents in languages. In table References, we give the sizes of each monolingual corpora for the languages for which we have more than documents. We also compute the number of tokens and sentences for each language, and report them in Figure 2. The tokens were obtained by using the Sentence Piece tokenizer that was used in our preprocessing pipeline. The sentences were split using Moses. The three largest languages are English (en) with 532B tokens, Russian (ru) with 101B tokens and Chinese (zh) with 92B tokens. We obtained 11 languages with more than 10B tokens, and 27 languages with more than 1B tokens. In terms of documents, the three largest languages are English (en) with 706M documents, Russian (ru) with 167M and German (de) with 105M. There are 12 languages with more than 10M documents and 29 languages containing more than 1M documents. Common Crawl is also a good source for lower resource languages. For example Afrikaans (af), Gujarati (gu), Khmer (km) and Burmese (my) contains respectively 160MB, 190MB, 154MB and 440MB of data. In comparison Wikipedia contains 103MB, 88MB, 71MB and 153MB of data for these languages. And more resources are available through the 60 dumps of Common Crawl. These numbers could probably be improved by increasing the recall of the LID model for low-resource languages.

Figure 6: Number of documents per language for the Feb. 2019 snapshot after deduplication. We display the histogram with logarithmic scale. We display statistics for 25 languages only. All statisctics are available in table References

5.2 Statistics from the language model

Figure 7: Histogram of language model perplexities for the Feb. 2019 Common Crawl snapshot. The two histograms correspond to English, which is the largest dataset, and Gujarati which is a low-resource language. Vertical lines correspond to perplexity thresholds applied to split the corpus in head/middle/tail.

We found that perplexity was a relative good proxy for quality. Journalistic and well written content ends up in the head of our dataset. Some documents which contained a lot of keywords list passes through deduplication and LID but receive a high perplexity. Some documents despite being valid text ends up in the tail because they have a vocabulary very different from Wikipedia. This includes blog comments with spoken-like text, or very specialized forums with specific jargon. We decided to not remove content based on the LM score because we think that some of it could be useful for specific applications.

Some languages have very spiked distribution of perplexity while others are more spread out. We postulate that this is rather due to the variance in the Wikipedia sizes used for training the LM than to some language having less high-quality content. Therefore we decided to use different perplexity thresholds for each language. The thresholds have been picked to split the corpus in 3 parts of equal size. In Figure 

7 we show the perplexity distribution for two languages English and Gujarati using their respective LM. English LM was trained on 534M of text while Gujarati was trained on only 12M.

5.3 Training models on this dataset

We assess the quality of the resulting dataset by learning unsupervised word and sentence representations through fastText and BERT models. For fastText, we train 300-dimensional word embeddings on the head, middle and tail subsets of the English and Polish CommonCrawl corpora, sorted by document perplexity. We evaluate these on standard semantic and syntactic analogy datasets  [13]. We observe in Table 1 a steady increase in performance as we go from the tail to the head of the dataset, confirming the positive impact of our filtering method based on document perplexity.

English Polish
Total Sem Syn Total Sem Syn
head 77.9 81.2 75.3 65.3 66.5 64.1
mid. 74.2 79.0 70.4 62.8 62.7 63.0
tail 62.0 68.1 57.3 59.9 59.8 60.1
Table 1: Impact of corpus quality on the quality of fastText word embeddings. We evaluate on semantic and syntactic similarity datasets.

We also train BERT models on the English (en), Russian (ru), Chinese (zh) and Urdu (ur) languages, using either the Wikipedia corpora or our new CommonCrawl datasets. For these languages, we use respectively 16G, 5G, 1.1G and 106M of raw Wikipedia data (full datasets), and we cap the head CommonCrawl data to 21G, 21G, 17G, 2.2G for English, Russian, Chinese and Urdu. That is, we consider roughly the same amount of data for English, but increase the amount of data for Russian, Chinese and Urdu. We train a BERT-BASE architecture [3] on each of these corpora, without next sentence prediction (NSP) as in  [9]. For better comparison, we early-stop all our models after two days of training on 16 Volta32 GPUs, and use the exact same number of steps for each model. We evaluate each model on the XNLI [2] corpus by using the training data in each language. Results presented in Table 2 indicate that BERT-BASE models trained on CommonCrawl outperform identical models trained on Wikipedia by 3.3% on average. With the same amount of data for English, the BERT-BASE model trained on our corpus outperforms the one trained on the Wikipedia. For low-resource languages like Urdu (ur), the Wikipedia dataset being too small, the model pretrained on Wikipedia obtains similar performance than a randomly initialized model. Using our corpus instead, we obtain a 7 points improvement in accuracy, which demonstrates how our filtered corpus can enable language model pretraining for low-resource languages.

en ru zh ur
Wiki 82.8 73.3 77.0 57.3 72.6
CC 85.0 76.4 77.9 64.3 75.9
Table 2: XNLI dev accuracy for English, Russian, Chinese and Urdu ( for average) for BERT-BASE models trained either on Wikipedia or CommonCrawl. The additional data provided by our pipeline alleviates the lack of resources in most languages and enables representation learning for low-resource languages such as Urdu.

6 Conclusion

In this paper, we present a pipeline to create curated monolingual corpora in more than languages. We preprocess Common Crawl by following the pipeline of [4], with the differences that we preserve the structure of documents and filter the data based on their distance to Wikipedia. This improves the quality of the resulting dataset and allows for the training of multilingual text level representations like XLM [9].



  • [1] C. Buck, K. Heafield, and B. Van Ooyen (2014) N-gram counts and language models from the common crawl.. In LREC, Vol. 2, pp. 4. Cited by: §2.
  • [2] A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: evaluating cross-lingual sentence representations. In Proc. EMNLP, Cited by: §5.3.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §1, §5.3.
  • [4] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov (2018)

    Learning word vectors for 157 languages

    arXiv preprint arXiv:1802.06893. Cited by: §3.3, §4.1, §6, CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data.
  • [5] K. Heafield (2011) KenLM: faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Cited by: §3.4.
  • [6] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016) compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §2.
  • [7] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §3.3.
  • [8] T. Kudo (2018)

    Subword regularization: improving neural network translation models with multiple subword candidates

    arXiv preprint arXiv:1804.10959. Cited by: §3.4.
  • [9] G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §5.3, §6.
  • [10] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §1.
  • [11] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • [12] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin (2017) Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405. Cited by: §2, CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data.
  • [13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Adv. NIPS, Cited by: §2, §5.3.
  • [14] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proc. EMNLP, Cited by: §2.
  • [15] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1.
  • [16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1.
  • [17] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv preprint arXiv:1910.10683. Cited by: §2.
  • [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Adv. NIPS, Cited by: §1.
  • [19] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §2.