The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus

12/18/2021
by   Aleksandra Piktus, et al.
0

In order to address the increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web scale knowledge, lack of structure, inconsistent quality, and noise. To this end, we propose a new setup for evaluating existing KI-NLP tasks in which we generalize the background corpus to a universal web snapshot. We repurpose KILT, a standard KI-NLP benchmark initially developed for Wikipedia, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the Internet. We find that despite potential gaps of coverage, challenges of scale, lack of structure and lower quality, retrieval from Sphere enables a state-of-the-art retrieve-and-read system to match and even outperform Wikipedia-based models on several KILT tasks - even if we aggressively filter content that looks like Wikipedia. We also observe that while a single dense passage index over Wikipedia can outperform a sparse BM25 version, on Sphere this is not yet possible. To facilitate further research into this area, and minimise the community's reliance on proprietary black box search engines, we will share our indices, evaluation metrics and infrastructure.

READ FULL TEXT
research
11/06/2020

Corpora Compared: The Case of the Swedish Gigaword Wikipedia Corpora

In this work, we show that the difference in performance of embeddings f...
research
10/14/2022

A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

Foundational Hebrew NLP tasks such as segmentation, tagging and parsing,...
research
12/26/2018

DBpedia NIF: Open, Large-Scale and Multilingual Knowledge Extraction Corpus

In the past decade, the DBpedia community has put significant amount of ...
research
03/07/2016

A matter of words: NLP for quality evaluation of Wikipedia medical articles

Automatic quality evaluation of Web information is a task with many fiel...
research
04/02/2019

The Tower of Babel Meets Web 2.0: User-Generated Content and its Applications in a Multilingual Context

This study explores language's fragmenting effect on user-generated cont...
research
04/10/2023

WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus

In this paper, we introduce a new NLP task – generating short factual ar...
research
12/21/2018

Wikipedia Text Reuse: Within and Without

We study text reuse related to Wikipedia at scale by compiling the first...

Please sign up or login with your details

Forgot password? Click here to reset