CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

04/12/2018
by   Colin Lockard, et al.
0

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90

READ FULL TEXT
research
02/18/2021

WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web

Relation extraction is used to populate knowledge bases that are importa...
research
05/24/2022

PLAtE: A Large-scale Dataset for List Page Web Extraction

Recently, neural models have been leveraged to significantly improve the...
research
04/19/2017

Global Relation Embedding for Relation Extraction

Recent studies have shown that embedding textual relations using deep ne...
research
08/21/2018

Neural Relation Extraction via Inner-Sentence Noise Reduction and Transfer Learning

Extracting relations is critical for knowledge base completion and const...
research
01/07/2021

Simplified DOM Trees for Transferable Attribute Extraction from the Web

There has been a steady need to precisely extract structured knowledge f...
research
08/29/2017

Navigating the Data Lake with Datamaran: Automatically Extracting Structure from Log Datasets

Organizations routinely accumulate semi-structured log datasets generate...
research
11/19/2015

Knowledge Base Population using Semantic Label Propagation

A crucial aspect of a knowledge base population system that extracts new...

Please sign up or login with your details

Forgot password? Click here to reset