DAWT: Densely Annotated Wikipedia Texts across multiple languages

03/02/2017
by   Nemanja Spasojevic, et al.
0

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic. We also present the methodology used to generate the dataset which enriches Wikipedia markup in order to increase number of links. In addition to the main dataset, we open up several derived datasets including mention entity co-occurrence counts and entity embeddings, as well as mappings between Freebase ids and Wikidata item ids. We also discuss two applications of these datasets and hope that opening them up would prove useful for the Natural Language Processing and Information Retrieval communities, as well as facilitate multi-lingual research.

READ FULL TEXT

page 3

page 8

research
06/15/2023

DaMuEL: A Large Multilingual Dataset for Entity Linking

We present DaMuEL, a large Multilingual Dataset for Entity Linking conta...
research
12/14/2022

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

With the ever-growing popularity of the field of NLP, the demand for dat...
research
01/22/2023

Ensemble Transfer Learning for Multilingual Coreference Resolution

Entity coreference resolution is an important research problem with many...
research
09/14/2023

MMEAD: MS MARCO Entity Annotations and Disambiguations

MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource...
research
02/12/2019

WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks

Wikipedia articles contain multiple links connecting a subject to other ...
research
02/03/2017

Named Entity Evolution Analysis on Wikipedia

Accessing Web archives raises a number of issues caused by their tempora...
research
04/24/2017

Recognizing Descriptive Wikipedia Categories for Historical Figures

Wikipedia is a useful knowledge source that benefits many applications i...

Please sign up or login with your details

Forgot password? Click here to reset