WebIE: Faithful and Robust Information Extraction on the Web

05/23/2023
by   Chenxi Whitehouse, et al.
0

Extracting structured and grounded fact triples from raw text is a fundamental task in Information Extraction (IE). Existing IE datasets are typically collected from Wikipedia articles, using hyperlinks to link entities to the Wikidata knowledge base. However, models trained only on Wikipedia have limitations when applied to web domains, which often contain noisy text or text that does not have any factual information. We present WebIE, the first large-scale, entity-linked closed IE dataset consisting of 1.6M sentences automatically collected from the English Common Crawl corpus. WebIE also includes negative examples, i.e. sentences without fact triples, to better reflect the data on the web. We annotate  25K triples from WebIE through crowdsourcing and introduce mWebIE, a translation of the annotated set in four other languages: French, Spanish, Portuguese, and Hindi. We evaluate the in-domain, out-of-domain, and zero-shot cross-lingual performance of generative IE models and find models trained on WebIE show better generalisability. We also propose three training strategies that use entity linking as an auxiliary task. Our experiments show that adding Entity-Linking objectives improves the faithfulness of our generative IE models.

READ FULL TEXT

page 14

page 15

research
12/05/2017

Neural Cross-Lingual Entity Linking

A major challenge in Entity Linking (EL) is making effective use of cont...
research
11/21/2019

Entity Extraction with Knowledge from Web Scale Corpora

Entity extraction is an important task in text mining and natural langua...
research
04/16/2021

Improving Zero-Shot Multi-Lingual Entity Linking

Entity linking – the task of identifying references in free text to rele...
research
01/19/2022

CM3: A Causal Masked Multimodal Model of the Internet

We introduce CM3, a family of causally masked generative models trained ...
research
10/04/2017

Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl

We present DepCC, the largest to date linguistically analyzed corpus in ...
research
01/14/2021

Better Together – An Ensemble Learner for Combining the Results of Ready-made Entity Linking Systems

Entity linking (EL) is the task of automatically identifying entity ment...
research
10/21/2020

Linking Entities to Unseen Knowledge Bases with Arbitrary Schemas

In entity linking, mentions of named entities in raw text are disambigua...

Please sign up or login with your details

Forgot password? Click here to reset