Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

07/02/2020
by   Brian Roark, et al.
0

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

We create publicly available language identification (LID) datasets and ...
research
05/04/2014

Analysis Tool for UNL-Based Knowledge Representation

The fundamental issue in knowledge representation is to provide a precis...
research
02/03/2023

Around the world in 60 words: A generative vocabulary test for online research

Conducting experiments with diverse participants in their native languag...
research
08/21/2017

Vector Space Model as Cognitive Space for Text Classification

In this era of digitization, knowing the user's sociolect aspects have b...
research
03/26/2018

Unsupervised Separation of Transliterable and Native Words for Malayalam

Differentiating intrinsic language words from transliterable words is a ...
research
10/20/2020

AutoMeTS: The Autocomplete for Medical Text Simplification

The goal of text simplification (TS) is to transform difficult text into...
research
02/12/2020

Unsupervised Separation of Native and Loanwords for Malayalam and Telugu

Quite often, words from one language are adopted within a different lang...

Please sign up or login with your details

Forgot password? Click here to reset