Short Text Language Identification for Under Resourced Languages

11/18/2019
by   Bernardt Duvenhage, et al.
0

The paper presents a hierarchical naive Bayesian and lexicon based classifier for short text language identification (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 official South African languages some of which are similar languages. The algorithm is compared to recent approaches using test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) shared tasks' datasets. Remaining research opportunities and pressing concerns in evaluating and comparing LID approaches are also discussed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2017

Improved Text Language Identification for the South African Languages

Virtual assistants and text chatbots have recently been gaining populari...
research
06/09/2022

Language Identification for Austronesian Languages

This paper provides language identification models for low- and under-re...
research
01/13/2017

LIDE: Language Identification from Text Documents

The increase in the use of microblogging came along with the rapid growt...
research
09/30/2016

Discriminating Similar Languages: Evaluations and Explorations

We present an analysis of the performance of machine learning classifier...
research
07/15/2015

Language discrimination and clustering via a neural network approach

We classify twenty-one Indo-European languages starting from written tex...
research
02/11/2021

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Language Identification is the task of identifying a document's language...
research
03/09/2021

Comparing Approaches to Dravidian Language Identification

This paper describes the submissions by team HWR to the Dravidian Langua...

Please sign up or login with your details

Forgot password? Click here to reset