DAWT: Densely Annotated Wikipedia Texts across multiple languages

by   Nemanja Spasojevic, et al.

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic. We also present the methodology used to generate the dataset which enriches Wikipedia markup in order to increase number of links. In addition to the main dataset, we open up several derived datasets including mention entity co-occurrence counts and entity embeddings, as well as mappings between Freebase ids and Wikidata item ids. We also discuss two applications of these datasets and hope that opening them up would prove useful for the Natural Language Processing and Information Retrieval communities, as well as facilitate multi-lingual research.



There are no comments yet.


page 3

page 8


WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks

Wikipedia articles contain multiple links connecting a subject to other ...

Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set

Wikipedia is a great source of general world knowledge which can guide N...

Named Entity Evolution Analysis on Wikipedia

Accessing Web archives raises a number of issues caused by their tempora...

Wikipedia2Vec: An Optimized Implementation for Learning Embeddings from Wikipedia

We present Wikipedia2Vec, an open source tool for learning embeddings of...

Fast Linking of Mathematical Wikidata Entities in Wikipedia Articles Using Annotation Recommendation

Mathematical information retrieval (MathIR) applications such as semanti...

Geo-Text Data and Data-Driven Geospatial Semantics

Many datasets nowadays contain links between geographic locations and na...

Recognizing Descriptive Wikipedia Categories for Historical Figures

Wikipedia is a useful knowledge source that benefits many applications i...

Code Repositories


Datasets opened by Lithium Technologies | Klout

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.