Aksharantar: Towards building open transliteration tools for the next billion users

05/06/2022
by   Yash Madhani, et al.
0

We introduce Aksharantar, the largest publicly available transliteration dataset for 21 Indic languages containing 26 million transliteration pairs. We build this dataset by mining transliteration pairs from large monolingual and parallel corpora, as well as collecting transliterations from human annotators to ensure diversity of words and representation of low-resource languages. We introduce a new, large, diverse testset for Indic language transliteration containing 103k words pairs spanning 19 languages that enables fine-grained analysis of transliteration models. We train the IndicXlit model on the Aksharantar training set. IndicXlit is a single transformer-based multilingual transliteration model for roman to Indic script conversion supporting 21 Indic languages. It achieves state-of-the art results on the Dakshina testset, and establishes strong baselines on the Aksharantar testset released along with this work. All the datasets and models are publicly available at https://indicnlp.ai4bharat.org/aksharantar. We hope the availability of these large-scale, open resources will spur innovation for Indic language transliteration and downstream applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/12/2021

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

We present Samanantar, the largest publicly available parallel corpora c...
research
06/25/2021

XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Contemporary works on abstractive text summarization have focused primar...
research
07/04/2023

Transformed Protoform Reconstruction

Protoform reconstruction is the task of inferring what morphemes or word...
research
11/06/2021

Towards Building ASR Systems for the Next Billion Users

Recent methods in speech and language technology pretrain very LARGE mod...
research
09/09/2023

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

We introduce MADLAD-400, a manually audited, general domain 3T token mon...
research
11/19/2022

ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture

This paper introduces ArtELingo, a new benchmark and dataset, designed t...
research
03/13/2023

Instate: Predicting the State of Residence From Last Name

India has twenty-two official languages. Serving such a diverse language...

Please sign up or login with your details

Forgot password? Click here to reset