Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

03/14/2023
by   Biao Fu, et al.
0

A popular approach to streaming speech translation is to employ a single offline model with a wait-k policy to support different latency requirements, which is simpler than training multiple online models with different latency constraints. However, there is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input. We demonstrate that speech representations extracted at the end of a streaming input are significantly different from those extracted from a complete utterance. To address this issue, we propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input. FAST includes a Future-Aware Inference (FAI) strategy that incorporates future context through a trainable masked embedding, and a Future-Aware Distillation (FAD) framework that transfers future context from an approximation of full speech to streaming input. Our experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks show that FAST achieves better trade-offs between translation quality and latency than strong baselines. Extensive analyses suggest that our methods effectively alleviate the aforementioned mismatch problem between offline training and online inference.

READ FULL TEXT

page 1

page 5

research
09/15/2021

UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation

This paper presents a unified end-to-end frame-work for both streaming a...
research
11/03/2020

Dynamic latency speech recognition with asynchronous revision

In this work we propose an inference technique, asynchronous revision, t...
research
06/01/2023

Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models

Recent work in speech-to-speech translation (S2ST) has focused primarily...
research
04/08/2022

Does Simultaneous Speech Translation need Simultaneous Models?

In simultaneous speech translation (SimulST), finding the best trade-off...
research
06/17/2021

Multi-mode Transformer Transducer with Stochastic Future Context

Automatic speech recognition (ASR) models make fewer errors when more su...
research
07/03/2023

Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation

Transformer models using segment-based processing have been an effective...
research
09/14/2023

DiariST: Streaming Speech Translation with Speaker Diarization

End-to-end speech translation (ST) for conversation recordings involves ...

Please sign up or login with your details

Forgot password? Click here to reset