Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

03/06/2022
by   Rita Frieske, et al.
1

Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.

READ FULL TEXT

page 4

page 5

research
12/07/2020

MLS: A Large-Scale Multilingual Dataset for Speech Research

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large mu...
research
12/13/2019

Common Voice: A Massively-Multilingual Speech Corpus

The Common Voice corpus is a massively-multilingual collection of transc...
research
06/20/2022

The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition

Building a usable radio monitoring automatic speech recognition (ASR) sy...
research
10/24/2022

Investigating the effect of domain selection on automatic speech recognition performance: a case study on Bangladeshi Bangla

The performance of data-driven natural language processing systems is co...
research
10/12/2022

Can we use Common Voice to train a Multi-Speaker TTS system?

Training of multi-speaker text-to-speech (TTS) systems relies on curated...
research
05/02/2023

Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding

Voice communication between air traffic controllers (ATCos) and pilots i...
research
09/19/2017

A Recorded Debating Dataset

This paper describes an audio and textual dataset of debating speeches, ...

Please sign up or login with your details

Forgot password? Click here to reset