Improved Text Language Identification for the South African Languages

11/01/2017
by   Bernardt Duvenhage, et al.
0

Virtual assistants and text chatbots have recently been gaining popularity. Given the short message nature of text-based chat interactions, the language identification systems of these bots might only have 15 or 20 characters to make a prediction. However, accurate text language identification is important, especially in the early stages of many multilingual natural language processing pipelines. This paper investigates the use of a naive Bayes classifier, to accurately predict the language family that a piece of text belongs to, combined with a lexicon based classifier to distinguish the specific South African language that the text is written in. This approach leads to a 31 language detection error. In the spirit of reproducible research the training and testing datasets as well as the code are published on github. Hopefully it will be useful to create a text language identification shared task for South African languages.

READ FULL TEXT

page 2

page 3

page 5

research
11/18/2019

Short Text Language Identification for Under Resourced Languages

The paper presents a hierarchical naive Bayesian and lexicon based class...
research
01/12/2017

LanideNN: Multilingual Language Identification on Character Window

In language identification, a common first step in natural language proc...
research
10/09/2018

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

We address fine-grained multilingual language identification: providing ...
research
06/17/2023

Multilingual Multiword Expression Identification Using Lateral Inhibition and Domain Adaptation

Correctly identifying multiword expressions (MWEs) is an important task ...
research
03/09/2021

Comparing Approaches to Dravidian Language Identification

This paper describes the submissions by team HWR to the Dravidian Langua...
research
10/15/2019

Language Identification on Massive Datasets of Short Message using an Attention Mechanism CNN

Language Identification (LID) is a challenging task, especially when the...
research
02/11/2021

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Language Identification is the task of identifying a document's language...

Please sign up or login with your details

Forgot password? Click here to reset