HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

06/20/2023
by   Cihan Xiao, et al.
0

We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts. Such transcripts make the corpus suitable for speech translation research when there are significant differences between the spoken and written forms of the source language. Due to its large size, we are able to demonstrate competitive speech translation baselines on HK-LegiCoST and extend them to promising cross-corpus results on the FLEURS Cantonese subset. These results deliver insights into speech recognition and translation research in languages for which non-verbatim or “noisy” transcription is common due to various factors, including vernacular and dialectal speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2019

LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition

We present a corpus of sentence-aligned triples of German audio, German ...
research
02/09/2018

Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation

Recent works in spoken language translation (SLT) have attempted to buil...
research
03/29/2022

Representing `how you say' with `what you say': English corpus of focused speech and text reflecting corresponding implications

In speech communication, how something is said (paralinguistic informati...
research
03/07/2022

Creating Speech-to-Speech Corpus from Dubbed Series

Dubbed series are gaining a lot of popularity in recent years with stron...
research
04/06/2022

EMMT: A simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios

We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a datas...
research
12/23/2022

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

We investigate how humans perform the task of dubbing video content from...
research
01/13/2022

Speech Resources in the Tamasheq Language

In this paper we present two datasets for Tamasheq, a developing languag...

Please sign up or login with your details

Forgot password? Click here to reset