Biographical: A Semi-Supervised Relation Extraction Dataset

05/02/2022
by   Alistair Plum, et al.
0

Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developed Biographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set. Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2019

Towards Open-Domain Named Entity Recognition via Neural Correction Models

Named Entity Recognition (NER) plays an important role in a wide range o...
research
06/23/2020

NLPContributions: An Annotation Scheme for Machine Reading of Scholarly Contributions in Natural Language Processing Literature

We describe an annotation initiative to capture the scholarly contributi...
research
04/14/2022

FREDA: Flexible Relation Extraction Data Annotation

To effectively train accurate Relation Extraction models, sufficient and...
research
04/16/2015

Towards a relation extraction framework for cyber-security concepts

In order to assist security analysts in obtaining information pertaining...
research
08/21/2017

Scientific Information Extraction with Semi-supervised Neural Tagging

This paper addresses the problem of extracting keyphrases from scientifi...
research
08/18/2020

An Annotated Corpus of Webtables for Information Extraction Tasks

Information Extraction is a well-researched area of Natural Language Pro...
research
05/15/2020

A Scientific Information Extraction Dataset for Nature Inspired Engineering

Nature has inspired various ground-breaking technological developments i...

Please sign up or login with your details

Forgot password? Click here to reset