FT Speech: Danish Parliament Speech Corpus

05/25/2020
by   Andreas Kirkedal, et al.
0

This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data. We outline design considerations, including the preprocessing methods and the alignment procedure. To evaluate the quality of the corpus, we train automatic speech recognition systems on the new resource and compare them to the systems trained on the Danish part of Språkbanken, the largest public ASR corpus for Danish to date. Our baseline results show that we achieve a 14.01 WER on the new corpus. A combination of FT Speech with in-domain language data provides comparable results to models trained specifically on Språkbanken, showing that FT Speech transfers well to this data set. Interestingly, our results demonstrate that the opposite is not the case. This shows that FT Speech provides a valuable resource for promoting research on Danish ASR with more spontaneous speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2020

MLS: A Large-Scale Multilingual Dataset for Speech Research

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large mu...
research
10/24/2022

Investigating the effect of domain selection on automatic speech recognition performance: a case study on Bangladeshi Bangla

The performance of data-driven natural language processing systems is co...
research
04/06/2021

EasyCall corpus: a dysarthric speech dataset

This paper introduces a new dysarthric speech command dataset in Italian...
research
09/30/2019

DiPCo – Dinner Party Corpus

We present a speech data corpus that simulates a "dinner party" scenario...
research
06/15/2021

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis

This paper introduces RyanSpeech, a new speech corpus for research on au...
research
05/09/2021

English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System

Nowadays, research in speech technologies has gotten a lot out thanks to...
research
11/22/2021

Human-Machine Interaction Speech Corpus from the ROBIN project

This paper introduces a new Romanian speech corpus from the ROBIN projec...

Please sign up or login with your details

Forgot password? Click here to reset