MuST-Cinema: a Speech-to-Subtitles corpus

02/25/2020
by   Alina Karakanta, et al.
0

Growing needs in localising audiovisual content in multiple languages through subtitles call for the development of automatic solutions for human subtitling. Neural Machine Translation (NMT) can contribute to the automatisation of subtitling, facilitating the work of human subtitlers and reducing turn-around times and related costs. NMT requires high-quality, large, task-specific training data. The existing subtitling corpora, however, are missing both alignments to the source language audio and important information about subtitle breaks. This poses a significant limitation for developing efficient automatic approaches for subtitling, since the length and form of a subtitle directly depends on the duration of the utterance. In this work, we present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles. The corpus is comprised of (audio, transcription, translation) triplets. Subtitle breaks are preserved by inserting special symbols. We show that the corpus can be used to build models that efficiently segment sentences into subtitles and propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.

READ FULL TEXT

page 2

page 5

page 6

research
01/07/2023

Building a Parallel Corpus and Training Translation Models Between Luganda and English

Neural machine translation (NMT) has achieved great successes with large...
research
06/01/2020

Is 42 the Answer to Everything in Subtitling-oriented Speech Translation?

Subtitling is becoming increasingly important for disseminating informat...
research
12/23/2022

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

We investigate how humans perform the task of dubbing video content from...
research
09/21/2022

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Speech translation for subtitling (SubST) is the task of automatically t...
research
06/07/2018

Multi-Source Neural Machine Translation with Missing Data

Multi-source translation is an approach to exploit multiple inputs (e.g....
research
01/27/2022

Learning How to Translate North Korean through South Korean

South and North Korea both use the Korean language. However, Korean NLP ...
research
08/11/2020

Revisiting Low Resource Status of Indian Languages in Machine Translation

Indian language machine translation performance is hampered due to the l...

Please sign up or login with your details

Forgot password? Click here to reset