UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation

09/15/2021 ∙ by Qianqian Dong, et al. ∙ ByteDance Inc. 0

This paper presents a unified end-to-end frame-work for both streaming and non-streamingspeech translation. While the training recipes for non-streaming speech translation have been mature, the recipes for streaming speechtranslation are yet to be built. In this work, wefocus on developing a unified model (UniST) which supports streaming and non-streaming ST from the perspective of fundamental components, including training objective, attention mechanism and decoding policy. Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST, and a better-learned trade-off for BLEU score and latency metrics for streaming ST, compared with end-to-end baselines and the cascaded models. We will make our codes and evaluation tools publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech translation (ST) aims at translating from source language speech into target language text, which is widely helpful in various scenarios such as conference speeches, business meetings, cross-border customer service, and overseas travel. There are two kinds of application scenarios, including the non-streaming translation and the streaming translation. The non-streaming models can listen to the complete utterances at one time and then generate the translation afterwards. While, the streaming models need to balance the latency and quality and generate translations based on the partial utterance, as shown in Figure 1.

Recently, end-to-end approaches achieve great progress in non-streaming ST.  Ansari et al. (2020) have shown that an end-to-end model achieves even better performance compared to the cascaded competitors. The training recipes including pre-training Weiss et al. (2017); Bérard et al. (2018); Livescu and Goldwater (2019); Bansal et al. (2019); Alinejad and Sarkar (2020); Stoian et al. (2020), multi-task learning Weiss et al. (2017); Bérard et al. (2018); Liu et al. (2019, 2020a, 2020b); Dong et al. (2021b), data augmentation Jia et al. (2019); Pino et al. (2019); Jia et al. (2019); Pino et al. (2019) play an important role in these success. However, attempts on end-to-end streaming ST are still not fully explored. Traditional streaming speech translation is usually formed by cascading a streaming speech recognition module with a streaming machine translation module Oda et al. (2014); Dalvi et al. (2018). Most of previous works focus on simultaneous text translation Gu et al. (2017b). Ma et al. (2019a) proposes a novel wait-k strategy based on the prefix-to-prefix framework, which is one of the mainstream research method of simultaneous text translation. For end-to-end streaming ST,  Ma et al. (2020b); Ren et al. (2020); Ma et al. (2021b) introduce the methodology of streaming machine translation into streaming speech translation and formalize the task, which are the first studies to propose simultaneous speech translation in the end-to-end way.

Figure 1: An illustration of streaming speech-to-text translation. ST models listen to the audio in source language, and generate tokens in target language.

To complete the picture of end-to-end ST, we propose an end-to-end model which unifies both streaming and non-streaming ST (UniST) with a strong focus on the streaming approach. More specifically, we highlight our innovations and findings as follows: a) we propose UniST, a new model serving speech translation in both streaming and non-streaming scenarios. b) we apply an acoustic encoder, a semantic encoder, and a translation decoder to UniST, where the acoustic encoder improve performance benefiting from the knowledge of the pre-trained model, and the semantic encoder soft align the information between audio and transcriptions to locate the boundary. c) we jointly train ASR and ST, streaming ST and non-streaming ST in one model in the multi-task way. d) we introduce adaptive decision policy for streaming inference. The policy can adaptively align morphemes in the waveform with the transcribed words, which improves streaming ST in both quality and latency. e) we validate UniST on the real-world dataset, where our model achieves the state-of-the-art results on the popular benchmark dataset, MuST-C both on non-streaming speech translation and streaming speech translation tasks.

2 Related Work

For speech translation, there are two main application scenarios, the offline speech-to-text translation and the simultaneous speech-to-text translation.

Non-streaming Speech-to-text Translation  Offline speech-to-text translation generates translation sentences taking the whole utterance as source input. Bérard et al. (2016) has given the first proof of the potential for end-to-end speech-to-text translation without using intermediate transcription. The training method based on pre-training Weiss et al. (2017); Bérard et al. (2018); Livescu and Goldwater (2019); Bansal et al. (2019); Alinejad and Sarkar (2020); Stoian et al. (2020); Dong et al. (2021a) can effectively use pre-trained models with better performance as initialization to speed up the convergence of the speech translation model. Multi-task learning Weiss et al. (2017); Bérard et al. (2018); Liu et al. (2020a) can more fully optimize the model parameters and improve the performance of the speech translation model with the aid of auxiliary tasks. Knowledge distillation has been proved to be efficient to learn from pre-trained models Liu et al. (2019, 2020b); Dong et al. (2021b).  Indurthi et al. (2020) applies meta-learning methods to speech translation tasks. Similarly,  Kano et al. (2017); Wang et al. (2020b) introduce curriculum learning methods, including different learning courses of increasing difficulty. To overcome data scarcity,  Jia et al. (2019); Pino et al. (2019) argumente data with pseudo-label generation, and  Bahar et al. (2019); Di Gangi et al. (2019b); McCarthy et al. (2020) introduce noise-based feature enhancement.  Zhang et al. (2020a)

proposes adaptive feature selection to eliminate uninformative features and improve performance.

Streaming Speech-to-text Translation  Streaming speech-to-text translation, also known as simultaneous speech translation, aims to translate audios into another languages before they are finished according to certain strategies Fügen et al. (2007). Traditional streaming speech translation is usually formed by cascading a streaming speech recognition module and a streaming machine translation module Oda et al. (2014); Dalvi et al. (2018). The speech recognition system continuously segments and recognizes the transcription of the audio segment, and then the machine translation system continuously translates the text segment output from the upstream. Most of previous work focus on simultaneous text translation Gu et al. (2017b)Gu et al. (2017b) learns a agent to decide when to read or write. Ma et al. (2019a) proposes a novel wait-k strategy based on the prefix-to-prefix framework to synchronize output after reading history tokens. Many following work propose many improvement strategies based on adaptive wait-k Zheng et al. (2019); Zhang et al. (2020b); Zhang and Zhang (2020) and efficient decoding Elbayad et al. (2020); Zheng et al. (2020a). Some monotonic attention methods Arivazhagan et al. (2019); Ma et al. (2019b); Schneider and Waibel (2020) have been proposed to model the monotonic alignment of input and output. Arivazhagan et al. (2020a, b) propose a re-translation strategy, allowing the model to modify the decoding history to improve the performance of streaming translation. Currently, there are relatively few researches on streaming end-to-end speech translation.  Ma et al. (2020b) proposes SimulST, which applies the wait-K method from streaming machine translation Ma et al. (2019a) into streaming speech translation.  Ren et al. (2020) proposes SimulSpeech, and proposes two methods of knowledge distillation to guide the training of the streaming model, and the connectionist temporal classification (CTC) decoding to segment the audio stream in real time. Zheng et al. (2020b) apply self-adaptive training to reduce latency in simultaneous ST by accommodating different source speech rates.

3 Proposed Method: UniST

3.1 Overview

Figure 2:

Overview of the proposed UniST. UniST consists of a masked acoustic model (MAM), a continuous integrate-and-file (CIF) module, and standard Transformer blocks. The MAM extracts features from the raw audio waveform. CIF learns a soft and monotonic alignment over the extracted features from MAM and outputs accumulated acoustic vectors as the input for down-streaming Transformer blocks.

The detailed framework of our method is shown in Figure 2. To be specific, the end-to-end model takes the original audio feature or raw data as input and generates the target text.

Problem Formulation  The speech translation corpus usually contains speech-transcription-translation triples . Specially, is a sequence of acoustic features. and represents the corresponding transcription in source language and the translation in target language respectively. Usually, the acoustic features is much longer than text sequences and , as the sampling rate of audio is usually above 10,000 Hz, and each word syllable (about 300 ms) will be recorded by thousands of sampling points.

Non-streaming Translation  The non-streaming end-to-end speech translation systems predicts translated sentences auto-regressively conditioned on the whole acoustic input :

(1)

where is the parameters of the learned ST model, and is the sentence length

Streaming Translation  The online scenario requires the ST model to translate instantly when speech audio streams in. The streaming translation model decodes a valid translated prefix given the first acoustic features:

(2)

where is the maximum number of words that can be translated from the incomplete audio .

Note that in our research scenario, we require that in streaming ST, the translated piece of the sentence shall not be modified afterwards, which is similar to simultaneous machine translationMa et al. (2019a),

3.2 Model Structure

UniST consists of a acoustic encoder, a semantic encoder and a translation decoder. Among them, the acoustic encoder is augmented with a adaptive boundary detector implemented by the monotonic alignment mechanism.

3.2.1 Acoustic Encoder

The conventional feature extractors (log Mel-filterbank, FBANK) face with reduced performance with insufficient training data San et al. (2021), which is especially the case in speech-to-text translation tasks. The FBANK also leads to potential information loss, and may corrupt long-term correlationsPardede et al. (2019).

To tackle such problems, we apply the recently-proposed self-supervised acoustic representation (SSAR) with pre-training Chen et al. (2020); Baevski et al. (2020) as the feature extractor for UniST instead of FBANK. SSAR learns the speech representation in a self-supervised fashion, which can alleviate the problem of ST corpus shortage, since SSAR requires only a large amount of unlabeled speech, which is easy to obtain. In this paper, we take Wav2Vec2 Baevski et al. (2020) as an example.

3.2.2 Monotonic Alignment Mechanism

Since the audio sequences usually have a much longer length than the text sequences, it is computationally demanding for the conventional encoder-decoder speech-to-text model to apply the global attention mechanism. With such high computational cost, it cannot support the requirements in streaming translation scenarios.

Therefore, this work introduces a monotonic alignment mechanism, continuous integrate-and-fire (CIF) Dong and Xu (2020) to relief drawbacks of existing end-to-end models.

Inspired by the integrate-and-fire (IF) model Lapique (1907); Abbott (1999)

, CIF is a differentiable version of IF, using continuous functions to simulate the spike-and-fire process that support back-propagation. Specifically, IF neuron forwardly integrates the stimulations from the streaming input signal, and updates its membrane potential accordingly. Once the accumulated potential reaches a presupposed threshold, IF fires a spikes and reset its accumulated potential. In speech-to-text models, CIF can utilize the spikes to locate the acoustic boundaries. We use the simplified version of the implementation in 

Yi et al. (2021), without introducing additional parameters except for a fully connected layer. Specifically, the last dimension of the hidden states is used to calculate , while the remaining dimensions are used to accumulate information. The formalized operations in CIF are:

(3)
(4)
(5)
(6)

where represents acoustic vectors with length of , represents accumulated acoustic vectors with length of , represents the predicted decoding length, and represents the -th segment of where the sum of exceeds 1.0. A L2-norm quantity loss is introduced to supervise CIF generating the correct number of final modeling units:

(7)

During inference, an extra rounding operation is applied on to simulate . Based on the matched sequence length, the accumulated acoustic vector is mapped back into the model size by a randomly initialized fully connected layer.

3.3 Multi-task Joint Training with Transformer

UniST uses Transformer Vaswani et al. (2017) with layers of encoders and layers of decoders to jointly fullfill the ST and ASR tasks. The encoders take the contracted acoustic vector from the CIF layer as the input, which aims to extract the semantic feature () of the input audio. The parameters of the Transformer are shared for two speech-to-text operations, i.e. ASR and ST.

To distinguish, we add two special task indicators at the beginning of the text as the BOS operator for decoding. For example, if the audio input for "Thank you ." is in English, for ASR, we use [en] as the BOS and decode = "[en] Thank you .". We add [De] at the start of German translation, thus is "[De] Danke ."

The cross entropy losses of ST and ASR are defined in Equation (8) and (9) respectively.

(8)
(9)

where the decoder probability

is calculated from the final softmax layer based on the output of the decoder.

The overall objective function is the weighted sum for all aforementioned losses:

(10)

In the following experimental sections, is set to 0.05, and are set to 1 by default.

3.4 Inference Strategies

As UniST support both offline and online speech translation scenarios introduced in Section 3.1, we here describe how the trained UniST is applied to these two scenarios.

3.4.1 Non-streaming Translation

For non-streaming speech translation, the UniST decodes the translated sentences in an auto-regressive manner. The translation process is basically similar to other offline ST models, where the UniST translate the source waveform with MAM via a Seq2Seq framework. The UniST first extracts audio feature from the raw audio and aligns it into integrated embedding via CIF module. The integrated embedding then served as the input embedding for the Transformer blocks, which decodes the translation auto-regressively.

3.4.2 Streaming Translation

Prefix Decision Previous work on online ST use wait-K policy for simultaneous decoding Ma et al. (2021a, 2020b). The policy originates from simultaneous machine translation Ma et al. (2019a), which always waits for K source tokens and then translates target tokens concurrently with the source streams in (i.e., the translated sentence is K tokens behind the source). As the source modality is audio rather than text in the speech-to-text scenario, researchers generally specify a certain time spans as the "token" for wait K policy.

Adaptive Decision The main drawbacks in Prefix Decision

strategy is that the length of time spans is fixed, while the speech speed of each speaker and the utterance length of each morpheme is distinct. A fixed time stride cannot guarantee that each time the model generates a new token, a proper length of the waveform is read in: when the time stride is small, the information obtained by each waveform reading is not adequate, resulting in a defected generation; on the contrary, when the time stride is too large, the latency becomes a performance bottleneck.

Input: The waveform sequence , the CIF model , wait lagging
Output: The translated sentence
1 initialization: the read waveform segment , the output sentence ;
2 while  is not EndOfSentence  do
3       calculate CIF integrated embedding ;
4       if  ;
5        then
             /* the waveform is finished */
             /* write new token */
6             ;
7             M.decoder.update() ;
8            
9       else if  ;
10        then
             /* read waveform */
11             ;
12             M.encoder.update()
13       else
             /* write new token */
14             ;
15             M.decoder.update() ;
16            
17      
18 end while
19return ;
Algorithm 1 Incremental Encoding-Decoding

We hence propose a new simultaneous translation strategy for online ST, namely the Incremental Encoding-Decoding strategy. Our policy adaptively decides when to write the new token according to the integrated embedding length of CIF(i.e.in Eq.( 6

) ). Since CIF scales up the acoustic information monotonically, the model can estimate the acoustic boundary for each morpheme in the audio, we use such integrate feature as a basis to tell whether the information carried by the waveform segment is sufficient.

Hence, we revise the drawbacks in the previous work that used fixed-length time span as the source “token” for the wait K policy. We propose our new decoding policy in Algorithm 1. During the online ST translation, the model shall decide whether to read new audio frames or translate a new word at any time, called READ/WRITE decision. We denote as the audio sub-sequence that the model has READ from the source and as the sentence prefix that has been already generated. We modify the wait K policy for UniST, where it makes the READ/WRITE decision according to the length difference between the CIF integrated embedding and the generated sentence . When the integrated embedding is word behind the generated , the UniST generates a new token (line 1) and updates decoder recursively, otherwise the model waits and reads the audio streaming(line 1), and updates the encoder states.

4 Experiments

4.1 Dataset and Preprocessing

Dataset MuST-C 111https://ict.fbk.eu/must-c/ (Di Gangi et al., 2019a) is a multilingual speech translation corpus with triplet data sources: source audio, transcripts, and text translations. To the best of our knowledge, MuST-C is currently the largest speech translation dataset available. It includes data from English TED talks with auto-aligned transcripts and translations at the sentence level. We mainly conduct experiments on English-German and English-French language pairs. And we use the dev and tst-COMMON sets as our development and test data, respectively.

Preprocessing For speech input, the 16-bit raw wave sequences are normalized by a factor of to the range of . For text input, on each translation pair, all texts (ST transcripts and translation) are pre-processed in the same way. Texts are case-sensitive. Punctuation is kept, split from words, and normalized. Non-print punctuation is removed. The sentences are then tokenized with Moses tokenizer 222https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl. Samples whose number of source or target tokens is over 250 and whose ratio of source and target text lengths is outside range is filtered out. For subword modeling, we use a unigram sentencepieceKudo and Richardson (2018) with a dictionary size of 10000. On each translation direction, The sentencepiece model is learned on all text data from ST corpora.

System
ASR (WER) 51.9 43.9 42.1 40.6 39.6 39.0 38.7 16.25
MT (BLEU) 17.11 19.68 22.80 24.93 26.44 27.20 27.83 31.28
Cascaded ST (BLEU) 9.72 11.24 12.74 13.92 14.59 15.22 15.51 17.60
Table 1: Results of Cascaded Systems on MuST-C tst-COMMON test set. For , the streaming model degrades into an offline model without beam search decoding strategy. Here the ASR model is based on SpeechTransformer Dong et al. (2018), and the MT model is based on Transformer Vaswani et al. (2017). Since the delays of cascaded system involving speech recognition and text translation modules are more complicated, latency metrics are not reported here. Note that cascaded ST is consisted of streaming ASR (wait-440ms,segment-40ms) and streaming MT system.

4.2 Model and Experimental Configuration

Model Configuration  For audio input, the Wav2Vec2 Module follows the base333https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt configuration in  Baevski et al. (2020). It uses parameters self-supervised pretrained on LibriSpeech audio data only. The subsequently shared Transformer module has a hidden dimension of 768 and 4 attention heads. The encoder is 8 layers, and the decoder is 6 layers.

Experimental Configuration  We use an Adam optimizer with , and 4k warm-up updates. We set the maximum training batch of the waveform audio token to be 3.2 million. We apply an inverse square root schedule algorithm for the learning rate. We average 10 consecutive checkpoints around the one of the best dev loss, and adopt a beam size of 5. We implement our models in FairseqOtt et al. (2019).

4.3 Evaluation

For offline translation, the model’s performance is mainly evaluated with quality metrics. While for streaming translation, ST model is evaluated by the latency-quality trade-off curves.

Quality Metrics The translation accuracy is quantified with detokenized BLEU (Papineni et al., 2002) using sacreBLEU 444https://github.com/mjpost/sacrebleu.

Latency Metrics Existing simultaneous translation works mainly focus on the latency evaluation of text translation, and have proposed computation unaware metrics, such as average proportion (AP) Cho and Esipova (2016), average latency (AL) Ma et al. (2019a), Continues Wait Length (CW) Gu et al. (2017a) and Differentiable Average Lagging (DAL) Cherry and Foster (2019). Ma et al. (2020a) extends the latency metrics of text translation into speech translation, including AL, AP and DAL. The latency metrics for streaming UniST is evaluated by AL, DAL and AP based on the SimulEval toolkit 555https://github.com/facebookresearch/SimulEval Ma et al. (2020a).

4.4 Experimental Results

4.4.1 Cascaded System

We build a streaming cascaded system as a baseline system by cascading a streaming speech recognition model and a text translation model. Note that the transcription generated by ASR system in the cascade streaming system is also uncorrectable. The results is shown in Table 1. The error accumulation problem of the cascade system still exists in the streaming model.

4.4.2 No-streaming Speech-to-text Translation

We compare the performance of our method and existing models on offline translation tasks. The result is shown in Table 2. For English-German language pair, joint training with the auxiliary ASR task has a performance gain of 0.8 BLEU for UniST. And CIF has an additional 1.5 performance gain to our method. The results show a consistent performance improvement on English-French language pair. This verifies the outstanding advantage of the monotonic soft attention mechanism of CIF in extracting acoustic representations.

Model MuST-C EN-X
EN-DE EN-FR
SpeechTransformer  Dong et al. (2018) 21.7 31.6
Transformer ST FairSeq  Wang et al. (2020a) 22.7 32.9
Transformer ST Espnet  Inaguma et al. (2020) 22.9 32.8
Adaptive Feature Seletion  Zhang et al. (2020a) 22.4 31.6
Wav2Vec2 + Transformer  Han et al. (2021) 22.3 34.3
STAST Liu et al. (2020b) 23.1 -
W-Transf Ye et al. (2021) 23.6 34.6
Dual-Decoder Transformer  Le et al. (2020) 23.6 33.5
UniST 21.9 33.8
UniST-Joint 22.7 34.4
UniST-CIF 24.9 34.9
Table 2: Results of non-streaming ST models on MuST-C tst-COMMON test set. represents results from Dong et al. (2018) reimplemented by ourselves. “UniST-Joint” means UniST equipped with multi-task joint learning with ASR task. “UniST-CIF” means “UniST-Joint” augmented with the monotonic alignment module in the acoustic encoder. Note that some previous works use external dataset or adopt the multilingual setting, which cannot be compared with UniST directly.

4.4.3 Streaming Speech-to-text Translation

Figure 3: The translation quality against the latency metrics (DAL, AP and AL) on the tst-COMMON set of MuST-C En-De dataset. Decoding strategy here is pre-fixed decision. in SimulEval is set to 5 as default. Points on the curve correspond to strides with 120ms, 200ms, 360ms, 400ms, 440ms, 600ms, 800ms, 1000ms and 40000ms, respectively.

We compared the existing SimulST Ma et al. (2020b) on steaming speech translation tasks. SimulST introduces the wait-k training strategy in simultaneous text translation into simultaneous speech translation tasks. The result is shown in Figure 3. UniST is significantly better than the baseline system in all the three latency metrics. At the same time, our method can be trained in parallel without the help of wait-k strategy, which observably improves training efficiency. Compared with the results of the cascaded system in Table 1, UniST also has obvious performance advantages in terms of quality metrics.

Figure 4: The translation quality against the latency metrics (DAL, AP and AL) on the tst-COMMON set of MuST-C En-De dataset. Decoding strategy here is pre-fixed decision. Points on the curve correspond to in SimulEval with 5, 7, 9, 10, 15 and 20, respectively.

5 Analysis

5.1 Effects of Prefix Decision

For the prefixed decision decoding strategy, the parameter setting of stride is very important. In Figure 4, we compare the influence of different strides on the prefixed decision strategy. It can be seen that increasing stride within a certain range will have a positive impact on the latency-bleu trade-offs. But the model also tends to fall into the field of a larger latency.

5.2 Effects of Adaptive Decision

Figure 5: The translation quality against the latency metrics (DAL) on the tst-COMMON set of MuST-C En-De dataset. Prefix-decision is tested with the stride size of 400ms, and adaptive decision is tested with the stride size of 280ms.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
En (Source) If you have something to give , give it now .
De (target) Wenn Sie etwas zu geben haben , geben Sie es jetzt .
ASR If you have something to give and give it now .
Cascades Wenn Sie etwas zu geben haben und es jetzt geben .
UniST Wenn Sie etwas geben , geben Sie es jetzt .
Table 3: An example from the test set of MuST-C En-De dataset. “ASR" means a streaming system with 440ms’ waiting latency.“Cascades" means a streaming pipeline contains ASR (wait-440ms) and NMT (wait-3). UniST represents UniST with the prefix decision and wait-3 strategy.

We have proposed an adaptive decision in Section 3.4.2. In order to focus on the latency metrics, we select the first 100 samples larger than 10s from the tst-COMMON set to build the test subset, and compare the performance of the adaptive decision and the prefixed decision. The results are shown in Figure 5. It can be seen from the results that for a longer audio stream, a greater delay cost is required to obtain a comparative quality balance. At the same time, the average performance of the streaming model with a longer audio is worse than that of the streaming model with a shorter audio as shown in Figure  3. Compared with the fixed strategy decoding method, the adaptive decoding method has a better balance between delay and quality. Through observation, it is found that the adaptive decoding method can ignore the silent segment. For example, after predicting the punctuation, it will read continuously to accumulate enough source semantic information. In addition, the adaptive decoding method can further reduce the delay by setting the number of write operations after the accumulated information is sufficient according to the length ratio of the source sentences and the target sentences between different language pairs, which requires further exploration.

5.3 Case Study

In Table 3, we show an example of simultaneous decoding for cascaded systems and end-to-end systems. The cascade system has the drawbacks of error accumulation and delay accumulation. While the end-to-end model has inherent advantages in this respect. For example, in this example, our method can attend to the speaker’s prosody information from the original audio input, such as pauses, so it can accurately predict the punctuation in the target language text.

6 Conclusion

We propose UniST, a novel and unified training framework for jointly offline speech-to-text translation and online speech-to-text translation. UniST consists of a masked acoustic model, a continuous integrate-and-fire module and standard Transformer encoder-decoder blocks. Compared to the conventional speech-to-text models, UniST has the following advantages: a) The masked acoustic model extract features from the raw waveform, avoiding latency resulted from audio segment and can improve model performance via pre-training. b) The continuous integrate-and-fire module aligns the transcription length with integrated embedding, based on which we propose incremental encoding-decoding policy to adaptively decrease latency in online speech translation. c) UniST supports joint training with ASR and multilingual data, which enhances model performance and has strong scalability.

The experiment on MUST-C datasets validates the effectiveness of UniST over serveral state-of-the-art baselines. The results show that UniST rivals or surpasses strong offline translation baselines while significantly improves the quality reduces latency in online scenario. Furthermore, we inspect the effects of our decoding policy via extensive experiments and case studies.

References

  • L. F. Abbott (1999) Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain research bulletin 50 (5-6), pp. 303–304. Cited by: §3.2.2.
  • A. Alinejad and A. Sarkar (2020) Effectively pretraining a speech translation decoder with machine translation data. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    pp. 8014–8020. Cited by: §1, §2.
  • E. Ansari, A. Axelrod, N. Bach, O. Bojar, R. Cattoni, F. Dalvi, N. Durrani, M. Federico, C. Federmann, J. Gu, F. Huang, K. Knight, X. Ma, A. Nagesh, M. Negri, J. Niehues, J. Pino, E. Salesky, X. Shi, S. Stüker, M. Turchi, A. H. Waibel, and C. Wang (2020) FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN. In Proceedings of the 17th International Conference on Spoken Language Translation, IWSLT 2020, pp. 1–34. Cited by: §1.
  • N. Arivazhagan, C. Cherry, W. Macherey, C. Chiu, S. Yavuz, R. Pang, W. Li, and C. Raffel (2019) Monotonic infinite lookback attention for simultaneous machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1313–1323. Cited by: §2.
  • N. Arivazhagan, C. Cherry, W. Macherey, and G. Foster (2020a) Re-translation versus streaming for simultaneous translation. In Proceedings of the 17th International Conference on Spoken Language Translation, pp. 220–227. Cited by: §2.
  • N. Arivazhagan, C. Cherry, I. Te, W. Macherey, P. Baljekar, and G. Foster (2020b) Re-translation strategies for long form, simultaneous, spoken language translation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7919–7923. Cited by: §2.
  • A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020)

    Wav2vec 2.0: a framework for self-supervised learning of speech representations

    .
    Vol. 33. Cited by: §3.2.1, §4.2.
  • P. Bahar, A. Zeyer, R. Schlüter, and H. Ney (2019) On using specaugment for end-to-end speech translation. arXiv preprint arXiv:1911.08876. Cited by: §2.
  • S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater (2019) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 58–68. Cited by: §1, §2.
  • A. Bérard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin (2018) End-to-end automatic speech translation of audiobooks. In ICASSP, pp. 6224–6228. Cited by: §1, §2.
  • A. Bérard, O. Pietquin, C. Servan, and L. Besacier (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744. Cited by: §2.
  • J. Chen, M. Ma, R. Zheng, and L. Huang (2020) MAM: masked acoustic modeling for end-to-end speech-to-text translation. arXiv preprint arXiv:2010.11445. Cited by: §3.2.1.
  • C. Cherry and G. Foster (2019) Thinking slow about latency evaluation for simultaneous machine translation. arXiv preprint arXiv:1906.00048. Cited by: §4.3.
  • K. Cho and M. Esipova (2016)

    Can neural machine translation do simultaneous translation?

    .
    arXiv preprint arXiv:1606.02012. Cited by: §4.3.
  • F. Dalvi, N. Durrani, H. Sajjad, and S. Vogel (2018) Incremental decoding and training methods for simultaneous translation in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 493–499. Cited by: §1, §2.
  • M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi (2019a) Must-c: a multilingual speech translation corpus. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2012–2017. Cited by: §4.1.
  • M. A. Di Gangi, M. Negri, V. N. Nguyen, T. Amirhossein, and M. Turchi (2019b) Data augmentation for end-to-end speech translation: fbk@ iwslt’19. In 16th International Workshop on Spoken Language Translation 2019, Cited by: §2.
  • L. Dong and B. Xu (2020) CIF: continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6079–6083. Cited by: §3.2.2.
  • L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In ICASSP, pp. 5884–5888. Cited by: Table 1, Table 2.
  • Q. Dong, M. Wang, H. Zhou, S. Xu, B. Xu, and L. Li (2021a) Consecutive decoding for speech-to-text translation. In

    The Thirty-fifth AAAI Conference on Artificial Intelligence, AAAI

    ,
    Cited by: §2.
  • Q. Dong, R. Ye, M. Wang, H. Zhou, S. Xu, B. Xu, and L. Li (2021b) “Listen, understand and translate”: triple supervision decouples end-to-end speech-to-text translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 12749–12759. Cited by: §1, §2.
  • M. Elbayad, L. Besacier, and J. Verbeek (2020) Efficient wait-k models for simultaneous machine translation. arXiv preprint arXiv:2005.08595. Cited by: §2.
  • C. Fügen, A. Waibel, and M. Kolss (2007) Simultaneous translation of lectures and speeches. Machine translation 21 (4), pp. 209–252. Cited by: §2.
  • J. Gu, G. Neubig, K. Cho, and V. O. Li (2017a) Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1053–1062. Cited by: §4.3.
  • J. Gu, G. Neubig, K. Cho, and V. O. Li (2017b) Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1053–1062. Cited by: §1, §2.
  • C. Han, M. Wang, H. Ji, and L. Li (2021) Learning shared semantic space for speech-to-text translation. External Links: 2105.03095 Cited by: Table 2.
  • H. Inaguma, S. Kiyono, K. Duh, S. Karita, N. Yalta, T. Hayashi, and S. Watanabe (2020) ESPnet-st: all-in-one speech translation toolkit. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 302–311. Cited by: Table 2.
  • S. Indurthi, H. Han, N. K. Lakumarapu, B. Lee, I. Chung, S. Kim, and C. Kim (2020) End-end speech-to-text translation with modality agnostic meta-learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7904–7908. Cited by: §2.
  • Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C. Chiu, N. Ari, S. Laurenzo, and Y. Wu (2019) Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In ICASSP, pp. 7180–7184. Cited by: §1, §2.
  • T. Kano, S. Sakti, and S. Nakamura (2017) Structured-based curriculum learning for end-to-end english-japanese speech translation. Proc. Interspeech 2017, pp. 2630–2634. Cited by: §2.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71. Cited by: §4.1.
  • L. Lapique (1907) Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarization.. Journal of Physiology and Pathololgy 9, pp. 620–635. Cited by: §3.2.2.
  • H. Le, J. Pino, C. Wang, J. Gu, D. Schwab, and L. Besacier (2020)

    Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation

    .
    pp. 3520–3533. Cited by: Table 2.
  • Y. Liu, H. Xiong, Z. He, J. Zhang, H. Wu, H. Wang, and C. Zong (2019) End-to-end speech translation with knowledge distillation. arXiv preprint arXiv:1904.08075. Cited by: §1, §2.
  • Y. Liu, J. Zhang, H. Xiong, L. Zhou, Z. He, H. Wu, H. Wang, and C. Zong (2020a) Synchronous speech recognition and speech-to-text translation with interactive decoding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8417–8424. Cited by: §1, §2.
  • Y. Liu, J. Zhu, J. Zhang, and C. Zong (2020b) Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920. Cited by: §1, §2, Table 2.
  • S. B. H. K. K. Livescu and A. L. S. Goldwater (2019) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. pp. 58–68. Cited by: §1, §2.
  • M. Ma, L. Huang, H. Xiong, K. Liu, C. Zhang, Z. He, H. Liu, X. Li, and H. Wang (2019a) Stacl: simultaneous translation with integrated anticipation and controllable latency. pp. 3025–3036. Cited by: §1, §2, §3.1, §3.4.2, §4.3.
  • X. Ma, M. J. Dousti, C. Wang, J. Gu, and J. Pino (2020a) SIMULEVAL: an evaluation toolkit for simultaneous translation. pp. 144–150. Cited by: §4.3.
  • X. Ma, J. Pino, J. Cross, L. Puzon, and J. Gu (2019b) Monotonic multihead attention. arXiv preprint arXiv:1909.12406. Cited by: §2.
  • X. Ma, J. Pino, and P. Koehn (2020b) SimulMT to simulst: adapting simultaneous text translation to end-to-end simultaneous speech translation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 582–587. Cited by: §1, §2, §3.4.2, §4.4.3.
  • X. Ma, Y. Wang, M. J. Dousti, P. Koehn, and J. Pino (2021a) Streaming simultaneous speech translation with augmented memory transformer. pp. 7523–7527. Cited by: §3.4.2.
  • X. Ma, Y. Wang, M. J. Dousti, P. Koehn, and J. Pino (2021b) Streaming simultaneous speech translation with augmented memory transformer. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7523–7527. Cited by: §1.
  • A. D. McCarthy, L. Puzon, and J. Pino (2020) SkinAugment: auto-encoding speaker conversions for automatic speech translation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7924–7928. Cited by: §2.
  • Y. Oda, G. Neubig, S. Sakti, T. Toda, and S. Nakamura (2014) Optimizing segmentation strategies for simultaneous speech translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 551–556. Cited by: §1, §2.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In NAACL-HLT (Demonstrations), Cited by: §4.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, pp. 311–318. Cited by: §4.3.
  • H. F. Pardede, V. Zilvan, D. Krisnandi, A. Heryana, and R. B. S. Kusumo (2019) Generalized filter-bank features for robust speech recognition against reverberation. In 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA), pp. 19–24. Cited by: §3.2.1.
  • J. Pino, L. Puzon, J. Gu, X. Ma, A. D. McCarthy, and D. Gopinath (2019) Harnessing indirect training data for end-to-end automatic speech translation: tricks of the trade. arXiv preprint arXiv:1909.06515. Cited by: §1, §2.
  • Y. Ren, J. Liu, X. Tan, C. Zhang, Q. Tao, Z. Zhao, and T. Liu (2020) SimulSpeech: end-to-end simultaneous speech to text translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3787–3796. Cited by: §1, §2.
  • N. San, M. Bartelds, M. Browne, L. Clifford, F. Gibson, J. Mansfield, D. Nash, J. Simpson, M. Turpin, M. Vollmer, et al. (2021) Leveraging neural representations for facilitating access to untranscribed speech from endangered languages. arXiv preprint arXiv:2103.14583. Cited by: §3.2.1.
  • F. Schneider and A. Waibel (2020) Towards stream translation: adaptive computation time for simultaneous machine translation. In Proceedings of the 17th International Conference on Spoken Language Translation, pp. 228–236. Cited by: §2.
  • M. C. Stoian, S. Bansal, and S. Goldwater (2020) Analyzing asr pretraining for low-resource speech-to-text translation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7909–7913. Cited by: §1, §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §3.3, Table 1.
  • C. Wang, Y. Tang, X. Ma, A. Wu, D. Okhonko, and J. Pino (2020a) Fairseq s2t: fast speech-to-text modeling with fairseq. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 33–39. Cited by: Table 2.
  • C. Wang, Y. Wu, S. Liu, M. Zhou, and Z. Yang (2020b) Curriculum pre-training for end-to-end speech translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3728–3738. Cited by: §2.
  • R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen (2017) Sequence-to-sequence models can directly translate foreign speech. Proc. Interspeech 2017, pp. 2625–2629. Cited by: §1, §2.
  • R. Ye, M. Wang, and L. Li (2021) End-to-end speech translation via cross-modal progressive training. arXiv preprint arXiv:2104.10380. Cited by: Table 2.
  • C. Yi, S. Zhou, and B. Xu (2021) Efciently fusing pretrained acoustic and linguistic encoders for low-resource speech recognition. IEEE Signal Processing Letters. Cited by: §3.2.2.
  • B. Zhang, I. Titov, B. Haddow, and R. Sennrich (2020a) Adaptive feature selection for end-to-end speech translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 2533–2544. Cited by: §2, Table 2.
  • R. Zhang, C. Zhang, Z. He, H. Wu, and H. Wang (2020b) Learning adaptive segmentation policy for simultaneous translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2280–2289. Cited by: §2.
  • R. Zhang and C. Zhang (2020) Dynamic sentence boundary detection for simultaneous translation. In Proceedings of the First Workshop on Automatic Simultaneous Translation, pp. 1–9. Cited by: §2.
  • B. Zheng, R. Zheng, M. Ma, and L. Huang (2019) Simpler and faster learning of adaptive policies for simultaneous translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1349–1354. Cited by: §2.
  • R. Zheng, M. Ma, B. Zheng, K. Liu, and L. Huang (2020a) Opportunistic decoding with timely correction for simultaneous translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 437–442. Cited by: §2.
  • R. Zheng, M. Ma, B. Zheng, K. Liu, J. Yuan, K. Church, and L. Huang (2020b) Fluent and low-latency simultaneous speech-to-speech translation with self-adaptive training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 3928–3937. Cited by: §2.