Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

01/27/2021
by   Tania Chakraborty, et al.
0

Pronunciation modeling is a key task for building speech technology in new languages, and while solid grapheme-to-phoneme (G2P) mapping systems exist, language coverage can stand to be improved. The information needed to build G2P models for many more languages can easily be found on Wikipedia, but unfortunately, it is stored in disparate formats. We report on a system we built to mine a pronunciation data set in 819 languages from loosely structured tables within Wikipedia. The data includes phoneme inventories, and for 63 low-resource languages, also includes the grapheme-to-phoneme (G2P) mapping. 54 of these languages do not have easily findable G2P mappings online otherwise. We turned the information from Wikipedia into a structured, machine-readable TSV format, and make the resulting data set publicly available so it can be improved further and used in a variety of applications involving low-resource languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2021

Text Normalization for Low-Resource Languages of Africa

Training data for machine learning models can come from many different s...
research
08/23/2022

Bitext Mining for Low-Resource Languages via Contrastive Learning

Mining high-quality bitexts for low-resource languages is challenging. T...
research
04/28/2020

Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource Languages

Part-of-speech (POS) taggers for low-resource languages which are exclus...
research
05/28/2021

Bhāx1E63ācitra: Visualising the dialect geography of South Asia

We present Bhāx1E63ācitra, a dialect mapping system for South Asia built...
research
06/05/2023

Jambu: A historical linguistic database for South Asian languages

We introduce Jambu, a cognate database of South Asian languages which un...
research
02/25/2022

Mining Naturally-occurring Corrections and Paraphrases from Wikipedia's Revision History

Naturally-occurring instances of linguistic phenomena are important both...
research
01/23/2020

Uneven Coverage of Natural Disasters in Wikipedia: the Case of Flood

The usage of non-authoritative data for disaster management presents the...

Please sign up or login with your details

Forgot password? Click here to reset