LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

05/23/2023
by   Milind Agarwal, et al.
0

Knowing the language of an input text/audio is a necessary first step for using almost every natural language processing (NLP) tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, most of the world's 7000 languages are not supported by current systems. This lack of representation affects large-scale data mining efforts and further exacerbates data shortage for low-resource languages. We take a step towards tackling the data bottleneck by compiling a corpus of over 50K parallel children's stories in 350+ languages and dialects, and the computation bottleneck by building lightweight hierarchical models for language identification. Our data can serve as benchmark data for language identification of short texts and for understudied translation directions such as those between Indian or African languages. Our proposed method, Hierarchical LIMIT, uses limited computation to expand coverage into excluded languages while maintaining prediction quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2021

StoryDB: Broad Multi-language Narrative Dataset

This paper presents StoryDB - a broad multi-language dataset of narrativ...
research
08/25/2023

Ngambay-French Neural Machine Translation (sba-Fr)

In Africa, and the world at large, there is an increasing focus on devel...
research
05/15/2023

Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling Corrector

The progress of Natural Language Processing (NLP), although fast in rece...
research
01/27/2017

Comparative Study Of Data Mining Query Languages

Since formulation of Inductive Database (IDB) problem, several Data Mini...
research
10/21/2022

AfroLID: A Neural Language Identification Tool for African Languages

Language identification (LID) is a crucial precursor for NLP, especially...
research
02/16/2019

Exploring Language Similarities with Dimensionality Reduction Technique

In recent years several novel models were developed to process natural l...
research
09/29/2015

Tuned and GPU-accelerated parallel data mining from comparable corpora

The multilingual nature of the world makes translation a crucial require...

Please sign up or login with your details

Forgot password? Click here to reset