On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages

09/26/2019
by   Yi Zhu, et al.
0

Recent work has validated the importance of subword information for word representation learning. Since subwords increase parameter sharing ability in neural models, their value should be even more pronounced in low-data regimes. In this work, we therefore provide a comprehensive analysis focused on the usefulness of subwords for word representation learning in truly low-resource scenarios and for three representative morphological tasks: fine-grained entity typing, morphological tagging, and named entity recognition. We conduct a systematic study that spans several dimensions of comparison: 1) type of data scarcity which can stem from the lack of task-specific training data, or even from the lack of unannotated data required to train word embeddings, or both; 2) language type by working with a sample of 16 typologically diverse languages including some truly low-resource ones (e.g. Rusyn, Buryat, and Zulu); 3) the choice of the subword-informed word representation method. Our main results show that subword-informed models are universally useful across all language types, with large gains over subword-agnostic embeddings. They also suggest that the effective use of subwords largely depends on the language (type) and the task at hand, as well as on the amount of available data for training the embeddings and task-based models, where having sufficient in-task data is a more critical requirement.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/27/2019

A Morpho-Syntactically Informed LSTM-CRF Model for Named Entity Recognition

We propose a morphologically informed model for named entity recognition...
research
04/16/2019

A Systematic Study of Leveraging Subword Information for Learning Word Representations

The use of subword-level information (e.g., characters, character n-gram...
research
03/18/2020

Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá

The lack of labeled training data has limited the development of natural...
research
08/28/2018

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Much work in Natural Language Processing (NLP) has been for resource-ric...
research
07/26/2019

LINSPECTOR WEB: A Multilingual Probing Suite for Word Representations

We present LINSPECTOR WEB, an open source multilingual inspector to anal...
research
10/14/2019

Feature-Dependent Confusion Matrices for Low-Resource NER Labeling with Noisy Labels

In low-resource settings, the performance of supervised labeling models ...
research
08/29/2023

Taxonomic Loss for Morphological Glossing of Low-Resource Languages

Morpheme glossing is a critical task in automated language documentation...

Please sign up or login with your details

Forgot password? Click here to reset