Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

05/25/2023
by   Yash Madhani, et al.
0

We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text which spans all 22 Indic languages. We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script. For native-script text, it has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID is the first LID for romanized text in Indian languages. Two major challenges for romanized text LID are the lack of training data and low-LID performance when languages are similar. We provide simple and effective solutions to these problems. In general, there has been limited work on romanized text in any language, and our findings are relevant to other languages that need romanized language identification. Our models are publicly available at https://github.com/AI4Bharat/IndicLID under open-source licenses. Our training and test sets are also publicly available at https://huggingface.co/datasets/ai4bharat/Bhasha-Abhijnaanam under open-source licenses.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2023

Phoenix: Democratizing ChatGPT across Languages

This paper presents our efforts to democratize ChatGPT across language. ...
research
07/02/2020

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

This paper describes the Dakshina dataset, a new resource consisting of ...
research
02/03/2023

Around the world in 60 words: A generative vocabulary test for online research

Conducting experiments with diverse participants in their native languag...
research
06/01/2023

A big data approach towards sarcasm detection in Russian

We present a set of deterministic algorithms for Russian inflection and ...
research
05/19/2021

Essay-BR: a Brazilian Corpus of Essays

Automatic Essay Scoring (AES) is defined as the computer technology that...
research
08/11/2022

Overview of CTC 2021: Chinese Text Correction for Native Speakers

In this paper, we present an overview of the CTC 2021, a Chinese text co...
research
09/01/2019

Topics to Avoid: Demoting Latent Confounds in Text Classification

Despite impressive performance on many text classification tasks, deep n...

Please sign up or login with your details

Forgot password? Click here to reset