Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech to Standard German Text Corpus

10/06/2020
by   Michel Plüss, et al.
0

We present a forced sentence alignment procedure for Swiss German speech and Standard German text. It is able to create a speech-to-text corpus in a fully automatic fashion, given an audio recording and the corresponding unaligned transcript. Compared to a manual alignment, it achieves a mean IoU of 0.8401 with a sentence recall of 0.9491. When applying our IoU estimate filter, the mean IoU can be further improved to 0.9271 at the cost of a lower sentence recall of 0.4881. Using this procedure, we created the Swiss Parliaments Corpus, an automatically aligned Swiss German speech to Standard German text corpus. 65 audio-text-pairs, resulting in 293 hours of training data. We have made the corpus freely available for download.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2019

LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition

We present a corpus of sentence-aligned triples of German audio, German ...
research
09/19/2019

A Corpus for Automatic Readability Assessment and Text Simplification of German

In this paper, we present a corpus for use in automatic readability asse...
research
05/02/2022

TuGeBiC: A Turkish German Bilingual Code-Switching Corpus

In this paper we describe the process of collection, transcription, and ...
research
09/02/2022

A New Aligned Simple German Corpus

"Leichte Sprache", the German counterpart to Simple English, is a regula...
research
06/11/2021

HUI-Audio-Corpus-German: A high quality TTS dataset

The increasing availability of audio data on the internet lead to a mult...
research
03/21/2021

SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German

Swiss German is a dialect continuum whose natively acquired dialects sig...
research
05/24/2022

Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel's Weekly Video Podcasts

We introduce the Merkel Podcast Corpus, an audio-visual-text corpus in G...

Please sign up or login with your details

Forgot password? Click here to reset