Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

01/28/2022
by   Silvia Severini, et al.
0

Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/28/2018

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Much work in Natural Language Processing (NLP) has been for resource-ric...
research
02/28/2022

ParaNames: A Massively Multilingual Entity Name Corpus

This preprint describes work in progress on ParaNames, a multilingual pa...
research
07/26/2018

Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora

This study improves the performance of neural named entity recognition b...
research
03/15/2022

Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically ...
research
11/21/2022

L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages

The monolingual Hindi BERT models currently available on the model hub d...
research
04/28/2017

Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages

We present SuperPivot, an analysis method for low-resource languages tha...
research
06/11/2018

Learning Multilingual Topics from Incomparable Corpus

Multilingual topic models enable crosslingual tasks by extracting consis...

Please sign up or login with your details

Forgot password? Click here to reset