Building Low-Resource NER Models Using Non-Speaker Annotation

06/17/2020
by   Tatiana Tsygankova, et al.
0

In low-resource natural language processing (NLP), the key problem is a lack of training data in the target language. Cross-lingual methods have had notable success in addressing this concern, but in certain common circumstances, such as insufficient pre-training corpora or languages far from the source language, their performance suffers. In this work we propose an alternative approach to building low-resource Named Entity Recognition (NER) models using "non-speaker" (NS) annotations, provided by annotators with no prior experience in the target language. We recruit 30 participants to annotate unfamiliar languages in a carefully controlled annotation experiment, using Indonesian, Russian, and Hindi as target languages. Our results show that use of non-speaker annotators produces results that approach or match performance of fluent speakers. NS results are also consistently on par or better than cross-lingual methods built on modern contextual representations, and have the potential to further outperform with additional effort. We conclude with observations of common annotation practices and recommendations for maximizing non-speaker annotator performance.

READ FULL TEXT
research
10/13/2022

CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation

Named entity recognition (NER) suffers from the scarcity of annotated tr...
research
08/29/2018

Neural Cross-Lingual Named Entity Recognition with Minimal Resources

For languages with no annotated resources, unsupervised transfer of natu...
research
06/30/2021

Cross-lingual alignments of ELMo contextual embeddings

Building machine learning prediction models for a specific NLP task requ...
research
08/23/2019

A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers

Most state-of-the-art models for named entity recognition (NER) rely on ...
research
05/30/2021

How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages

Despite the recent advancements of attention-based deep learning archite...
research
11/11/2020

CalibreNet: Calibration Networks for Multilingual Sequence Labeling

Lack of training data in low-resource languages presents huge challenges...
research
11/09/2020

CLAR: A Cross-Lingual Argument Regularizer for Semantic Role Labeling

Semantic role labeling (SRL) identifies predicate-argument structure(s) ...

Please sign up or login with your details

Forgot password? Click here to reset