Feature-Dependent Confusion Matrices for Low-Resource NER Labeling with Noisy Labels

10/14/2019
by   Lukas Lange, et al.
0

In low-resource settings, the performance of supervised labeling models can be improved with automatically annotated or distantly supervised data, which is cheap to create but often noisy. Previous works have shown that significant improvements can be reached by injecting information about the confusion between clean and noisy labels in this additional training data into the classifier training. However, for noise estimation, these approaches either do not take the input features (in our case word embeddings) into account, or they need to learn the noise modeling from scratch which can be difficult in a low-resource setting. We propose to cluster the training data using the input features and then compute different confusion matrices for each cluster. To the best of our knowledge, our approach is the first to leverage feature-dependent noise modeling with pre-initialized confusion matrices. We evaluate on low-resource named entity recognition settings in several languages, showing that our methods improve upon other confusion-matrix based methods by up to 9

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/02/2018

Training a Neural Network in a Low-Resource Setting on Automatically Annotated Noisy Data

Manually labeled corpora are expensive to create and often not available...
research
03/18/2020

Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá

The lack of labeled training data has limited the development of natural...
research
03/28/2019

Handling Noisy Labels for Robustly Learning from Self-Training Data for Low-Resource Sequence Labeling

In this paper, we address the problem of effectively self-training neura...
research
09/04/2021

Data Augmentation for Cross-Domain Named Entity Recognition

Current work in named entity recognition (NER) shows that data augmentat...
research
06/22/2020

Dirichlet-Smoothed Word Embeddings for Low-Resource Settings

Nowadays, classical count-based word embeddings using positive pointwise...
research
08/26/2019

Low-Resource Name Tagging Learned with Weakly Labeled Data

Name tagging in low-resource languages or domains suffers from inadequat...
research
09/26/2019

On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages

Recent work has validated the importance of subword information for word...

Please sign up or login with your details

Forgot password? Click here to reset