A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

10/10/2017
by   P. Godard, et al.
0

Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2018

A small Griko-Italian speech translation corpus

This paper presents an extension to a very low-resource parallel corpus ...
research
01/13/2022

Speech Resources in the Tamasheq Language

In this paper we present two datasets for Tamasheq, a developing languag...
research
08/03/2020

Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics

Unsupervised spoken term discovery (UTD) aims at finding recurring segme...
research
10/11/2019

How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages

For language documentation initiatives, transcription is an expensive re...
research
08/07/2023

Universal Automatic Phonetic Transcription into the International Phonetic Alphabet

This paper presents a state-of-the-art model for transcribing speech in ...
research
06/22/2021

Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw

We present a number of low-resource approaches to the tasks of the Zero ...
research
02/28/2023

The 2022 NIST Language Recognition Evaluation

In 2022, the U.S. National Institute of Standards and Technology (NIST) ...

Please sign up or login with your details

Forgot password? Click here to reset