CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

04/28/2023
by   Michał Turski, et al.
0

In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl, as PDF files are the most canonical types of documents as considered in document understanding. We analysed extensively all of the steps of the pipeline and proposed a solution which is a trade-off between data quality and processing time. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining. The dataset and tools published with this paper offer researchers the opportunity to develop even better multilingual language models.

READ FULL TEXT
research
03/07/2023

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

As language models grow ever larger, the need for large-scale high-quali...
research
01/17/2022

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

The need for raw large raw corpora has dramatically increased in recent ...
research
11/01/2019

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Pre-training text representations have led to significant improvements i...
research
08/26/2021

SAUCE: Truncated Sparse Document Signature Bit-Vectors for Fast Web-Scale Corpus Expansion

Recent advances in text representation have shown that training on large...
research
11/29/2022

ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

ClueWeb22, the newest iteration of the ClueWeb line of datasets, provide...
research
10/20/2022

Automatic Document Selection for Efficient Encoder Pretraining

Building pretrained language models is considered expensive and data-int...
research
08/17/2022

DeeperDive: The Unreasonable Effectiveness of Weak Supervision in Document Understanding A Case Study in Collaboration with UiPath Inc

Weak supervision has been applied to various Natural Language Understand...

Please sign up or login with your details

Forgot password? Click here to reset