Part-of-Speech Tagging on an Endangered Language: a Parallel Griko-Italian Resource

06/11/2018
by   Antonis Anastasopoulos, et al.
0

Most work on part-of-speech (POS) tagging is focused on high resource languages, or examines low-resource and active learning settings through simulated studies. We evaluate POS tagging techniques on an actual endangered language, Griko. We present a resource that contains 114 narratives in Griko, along with sentence-level translations in Italian, and provides gold annotations for the test set. Based on a previously collected small corpus, we investigate several traditional methods, as well as methods that take advantage of monolingual data or project cross-lingual POS tags. We show that the combination of a semi-supervised method with cross-lingual transfer is more appropriate for this extremely challenging setting, with the best tagger achieving an accuracy of 72.9 we use to collect sentence-level annotations over the test set, we achieve improvements of more than 21 percentage points.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2016

Learning when to trust distant supervision: An application to low-resource POS tagging using cross-lingual projection

Cross lingual projection of linguistic annotation suffers from many sour...
research
05/01/2017

Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary

Cross-lingual model transfer is a compelling and popular method for pred...
research
07/25/2019

Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER

Manually annotated corpora for low-resource languages are usually small ...
research
10/23/2020

KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi

Recent progress in text classification has been focused on high-resource...
research
11/21/2018

The Best of Both Worlds: Lexical Resources To Improve Low-Resource Part-of-Speech Tagging

In natural language processing, the deep learning revolution has shifted...
research
10/17/2022

Improving Low-Resource Cross-lingual Parsing with Expected Statistic Regularization

We present Expected Statistic Regularization (ESR), a novel regularizati...
research
03/12/2019

Bootstrapping Method for Developing Part-of-Speech Tagged Corpus in Low Resource Languages Tagset - A Focus on an African Igbo

Most languages, especially in Africa, have fewer or no established part-...

Please sign up or login with your details

Forgot password? Click here to reset