SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

02/09/2022
by   Ioannis Tsiamas, et al.
2

Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. Speech translation datasets provide manual segmentations of the audios, which are not available in real-world scenarios, and existing segmentation methods usually significantly reduce translation quality at inference time. To bridge the gap between the manual segmentation of training and the automatic one at inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus. First, we train a classifier to identify the included frames in a segmentation, using speech representations from a pre-trained wav2vec 2.0. The optimal splitting points are then found by a probabilistic Divide-and-Conquer algorithm that progressively splits at the frame of lowest probability until all segments are below a pre-specified length. Experiments on MuST-C and mTEDx show that the translation of the segments produced by our method approaches the quality of the manual segmentation on 5 languages pairs. Namely, SHAS retains 95-98 the manual segmentation's BLEU score, compared to the 87-93 existing methods. Our method is additionally generalizable to different domains and achieves high zero-shot performance in unseen languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2021

Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct Speech Translation

The audio segmentation mismatch between training data and those seen at ...
research
03/29/2022

Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation

Speech segmentation, which splits long speech into short segments, is es...
research
09/21/2022

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Speech translation for subtitling (SubST) is the task of automatically t...
research
12/13/2017

A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks

Phonetic segmentation is the process of splitting speech into distinct p...
research
12/19/2022

SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations

Data scarcity is one of the main issues with the end-to-end approach for...
research
06/23/2021

Dealing with training and test segmentation mismatch: FBK@IWSLT2021

This paper describes FBK's system submission to the IWSLT 2021 Offline S...
research
08/05/2020

Contextualized Translation of Automatically Segmented Speech

Direct speech-to-text translation (ST) models are usually trained on cor...

Please sign up or login with your details

Forgot password? Click here to reset