Language Identification for Austronesian Languages

06/09/2022
by   Jonathan Dunn, et al.
0

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2019

Short Text Language Identification for Under Resourced Languages

The paper presents a hierarchical naive Bayesian and lexicon based class...
research
12/11/2020

Discriminating Between Similar Nordic Languages

Automatic language identification is a challenging problem. Discriminati...
research
04/13/2018

Automatic Language Identification System for Hindi and Magahi

Language identification has become a prerequisite for all kinds of autom...
research
06/17/2023

Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation

Correctly identifying multiword expressions (MWEs) is an important task ...
research
11/29/2018

Tuplemax Loss for Language Identification

In many scenarios of a language identification task, the user will speci...
research
05/30/2017

A Low Dimensionality Representation for Language Variety Identification

Language variety identification aims at labelling texts in a native lang...
research
11/05/2021

Developing Successful Shared Tasks on Offensive Language Identification for Dravidian Languages

With the fast growth of mobile computing and Web technologies, offensive...

Please sign up or login with your details

Forgot password? Click here to reset