A Corpus for Large-Scale Phonetic Typology

by   Elizabeth Salesky, et al.

A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality readings. Our corpus and scripts are publicly available for non-commercial use at https://voxclamantisproject.github.io.


page 13

page 14

page 15

page 16

page 19

page 20

page 21


CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

We present a free Japanese-French parallel corpus. It includes 15M align...

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

We introduce VoxPopuli, a large-scale multilingual corpus providing 100K...

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

The lack of large-scale datasets has been a major hindrance to the devel...

Shamela: A Large-Scale Historical Arabic Corpus

Arabic is a widely-spoken language with a rich and long history spanning...

Essay-BR: a Brazilian Corpus of Essays

Automatic Essay Scoring (AES) is defined as the computer technology that...

Toward More Meaningful Resources for Lower-resourced Languages

In this position paper, we describe our perspective on how meaningful re...

Large-scale diversity estimation through surname origin inference

The study of surnames as both linguistic and geographical markers of the...