ParaNames: A Massively Multilingual Entity Name Corpus

02/28/2022
by   Jonne Sälevä, et al.
0

This preprint describes work in progress on ParaNames, a multilingual parallel name resource consisting of names for approximately 14 million entities. The included names span over 400 languages, and almost all entities are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. Our resource is released on GitHub (https://github.com/bltlab/paranames) under a Creative Commons license (CC BY 4.0).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset