DaMuEL: A Large Multilingual Dataset for Entity Linking

06/15/2023
by   David Kubeša, et al.
0

We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA license at https://hdl.handle.net/11234/1-5047.

READ FULL TEXT
research
05/21/2022

Named Entity Linking on Namesakes

We propose a simple and practical method of named entity linking (NEL), ...
research
02/05/2023

TempEL: Linking Dynamically Evolving and Newly Emerging Entities

In our continuously evolving world, entities change over time and new, p...
research
05/17/2019

Distant Learning for Entity Linking with Automatic Noise Detection

Accurate entity linkers have been produced for domains and languages whe...
research
03/02/2017

DAWT: Densely Annotated Wikipedia Texts across multiple languages

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia ...
research
11/13/2018

An Analysis of the Semantic Annotation Task on the Linked Data Cloud

Semantic annotation, the process of identifying key-phrases in texts and...
research
03/13/2017

High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data

The Entity Disambiguation and Linking (EDL) task matches entity mentions...
research
05/04/2020

Understanding Scanned Receipts

Tasking machines with understanding receipts can have important applicat...

Please sign up or login with your details

Forgot password? Click here to reset