A small Griko-Italian speech translation corpus

07/27/2018
by   Marcely Zanon Boito, et al.
0

This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 20 minutes of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morphosyntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset is available online, aiming to encourage replicability and diversity in computational language documentation experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2017

A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Most speech and language technologies are trained with massive amounts o...
research
03/12/2019

Bootstrapping Method for Developing Part-of-Speech Tagged Corpus in Low Resource Languages Tagset - A Focus on an African Igbo

Most languages, especially in Africa, have fewer or no established part-...
research
06/07/2023

RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction

The detection of shouted speech is crucial in audio surveillance and mon...
research
08/03/2020

Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics

Unsupervised spoken term discovery (UTD) aims at finding recurring segme...
research
06/18/2018

Unsupervised Word Segmentation from Speech with Attention

We present a first attempt to perform attentional word segmentation dire...
research
03/30/2020

Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

For endangered languages, data collection campaigns have to accommodate ...
research
12/04/2019

A Resource for Computational Experiments on Mapudungun

We present a resource for computational experiments on Mapudungun, a pol...

Please sign up or login with your details

Forgot password? Click here to reset