Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

12/14/2022
by   Diego Alves, et al.
0

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2021

Namesakes: Ambiguously Named Entities from Wikipedia and News

We present Namesakes, a dataset of ambiguously named entities obtained f...
research
10/23/2020

UNER: Universal Named-Entity RecognitionFramework

We introduce the Universal Named-Entity Recognition (UNER)framework, a 4...
research
12/14/2022

Building and Evaluating Universal Named-Entity Recognition English corpus

This article presents the application of the Universal Named Entity fram...
research
03/02/2017

DAWT: Densely Annotated Wikipedia Texts across multiple languages

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia ...
research
03/17/2017

Global Entity Ranking Across Multiple Languages

We present work on building a global long-tailed ranking of entities acr...
research
09/14/2019

Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set

Wikipedia is a great source of general world knowledge which can guide N...
research
04/27/2021

Named Entity Recognition and Linking Augmented with Large-Scale Structured Data

In this paper we describe our submissions to the 2nd and 3rd SlavNER Sha...

Please sign up or login with your details

Forgot password? Click here to reset