Log In Sign Up

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

by   Brian Roark, et al.

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages


page 1

page 2

page 3

page 4


The WiLI benchmark dataset for written language identification

This paper describes the WiLI-2018 benchmark dataset for monolingual wri...

Around the world in 60 words: A generative vocabulary test for online research

Conducting experiments with diverse participants in their native languag...

Analysis Tool for UNL-Based Knowledge Representation

The fundamental issue in knowledge representation is to provide a precis...

AutoMeTS: The Autocomplete for Medical Text Simplification

The goal of text simplification (TS) is to transform difficult text into...

Vector Space Model as Cognitive Space for Text Classification

In this era of digitization, knowing the user's sociolect aspects have b...

Unsupervised Separation of Transliterable and Native Words for Malayalam

Differentiating intrinsic language words from transliterable words is a ...

Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech

This study investigates whether the phonological features derived from t...