ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

11/29/2022
by   Arnold Overwijk, et al.
0

ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier ClueWeb corpora, the ClueWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, ClueWeb22 includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text to lower the barrier to entry. Many of these signals have been widely used in industry but are available to the research community for the first time at this scale.

READ FULL TEXT
research
11/07/2022

Technical Report on Web-based Visual Corpus Construction for Visual Document Understanding

We present a dataset generator engine named Web-based Visual Corpus Buil...
research
05/15/2021

A Large Visual, Qualitative and Quantitative Dataset of Web Pages

The World Wide Web is not only one of the most important platforms of co...
research
02/18/2021

Robust PDF Document Conversion Using Recurrent Neural Networks

The number of published PDF documents has increased exponentially in rec...
research
04/28/2023

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

In recent years, the field of document understanding has progressed a lo...
research
02/01/2022

WebFormer: The Web-page Transformer for Structure Information Extraction

Structure information extraction refers to the task of extracting struct...
research
10/19/2012

Exploiting Locality in Searching the Web

Published experiments on spidering the Web suggest that, given training ...
research
01/04/2013

Similarity Assessment through blocking and affordance assignment in Textual CBR

It has been conceived that children learn new objects through their affo...

Please sign up or login with your details

Forgot password? Click here to reset