A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages

04/10/2019
by   Ronald Cardenas, et al.
4

Unsupervised part of speech (POS) tagging is often framed as a clustering problem, but practical taggers need to ground their clusters as well. Grounding generally requires reference labeled data, a luxury a low-resource language might not have. In this work, we describe an approach for low-resource unsupervised POS tagging that yields fully grounded output and requires no labeled training data. We find the classic method of Brown et al. (1992) clusters well in our use case and employ a decipherment-based approach to grounding. This approach presumes a sequence of cluster IDs is a `ciphertext' and seeks a POS tag-to-cluster ID mapping that will reveal the POS sequence. We show intrinsically that, despite the difficulty of the task, we obtain reasonable performance across a variety of languages. We also show extrinsically that incorporating our POS tagger into a name tagger leads to state-of-the-art tagging performance in Sinhalese and Kinyarwanda, two languages with nearly no labeled POS data available. We further demonstrate our tagger's utility by incorporating it into a true `zero-resource' variant of the Malopa (Ammar et al., 2016) dependency parser model that removes the current reliance on multilingual resources and gold POS tags for new languages. Experiments show that including our tagger makes up much of the accuracy lost when gold POS tags are unavailable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2021

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

We show that unsupervised sequence-segmentation performance can be trans...
research
10/18/2022

Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging

Part-of-Speech (POS) tagging is an important component of the NLP pipeli...
research
07/14/2022

Learning to translate by learning to communicate

We formulate and test a technique to use Emergent Communication (EC) wit...
research
06/08/2021

A Falta de Pan, Buenas Son Tortas: The Efficacy of Predicted UPOS Tags for Low Resource UD Parsing

We evaluate the efficacy of predicted UPOS tags as input features for de...
research
08/26/2019

Low-Resource Name Tagging Learned with Weakly Labeled Data

Name tagging in low-resource languages or domains suffers from inadequat...
research
11/12/2020

Exploiting Cross-Dialectal Gold Syntax for Low-Resource Historical Languages: Towards a Generic Parser for Pre-Modern Slavic

This paper explores the possibility of improving the performance of spec...
research
05/05/2022

Quantifying Language Variation Acoustically with Few Resources

Deep acoustic models represent linguistic information based on massive a...

Please sign up or login with your details

Forgot password? Click here to reset