DeepAI
Log In Sign Up

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

07/02/2020
by   Brian Roark, et al.
0

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages

READ FULL TEXT

page 1

page 2

page 3

page 4

01/23/2018

The WiLI benchmark dataset for written language identification

This paper describes the WiLI-2018 benchmark dataset for monolingual wri...
02/03/2023

Around the world in 60 words: A generative vocabulary test for online research

Conducting experiments with diverse participants in their native languag...
05/04/2014

Analysis Tool for UNL-Based Knowledge Representation

The fundamental issue in knowledge representation is to provide a precis...
10/20/2020

AutoMeTS: The Autocomplete for Medical Text Simplification

The goal of text simplification (TS) is to transform difficult text into...
08/21/2017

Vector Space Model as Cognitive Space for Text Classification

In this era of digitization, knowing the user's sociolect aspects have b...
03/26/2018

Unsupervised Separation of Transliterable and Native Words for Malayalam

Differentiating intrinsic language words from transliterable words is a ...
04/14/2022

Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech

This study investigates whether the phonological features derived from t...