Speech-to-text translation (hereinafter referred to as speech translation) aims to translate a speech in source language into a text in target language, which can help people efficiently communicate with each other in different languages. The traditional approach is a pipeline system composed of an automatic speech recognition (ASR) model and a text machine translation (MT) model. In this approach, two models are independently trained and tuned, leading to the problem of time delay, parameter redundancy, and error propagation. In contrast, end-to-end ST model has potential advantages to alleviate these problems. Recent works have emerged rapidly and shown promising performances [20, 6, 5, 11, 17].
Despite the advantages, it is notoriously difficult to implement an end-to-end ST model which does not use transcriptions as intermediate, and its performance is generally limited. Previous studies resort to pretraining or multi-task learning to improve the translation quality. They either apply a pretrained encoder trained on ASR data , or jointly train with ASR to obtain a better acoustic model, or with MT to acquire a better language model [20, 2, 6]. However, the basic unit shared between different tasks is module parameters. Different tasks in this method cannot utilize information from each other. To alleviate this flaw, several studies propose analogous two-stage model [12, 1, 17]. In this model, decoder in the first stage performs recognition and generates a hidden state with which the second decoder conducts translation. Although the translation quality can be improved with additional information from the first decoder, the second decoder needs to wait until the complete transcription is recognized, greatly limiting the efficiency of training and inference process. In addition, this model can only make the translation process utilize information from recognition process, but cannot for the other direction.
However, we find that the generation process in ASR and ST can help each other: (1) the generation of speech translation would become easier with additional information from the transcribed words than just from the speech signal, (2) the translated words can also assist the recognition process. As the example shown in Figure 1, the input is a complete speech utterance in English and the outputs in two tasks can interact with each other. When translating Chinese word “yiqie” (the meaning of everything) at step in the ST task, the already transcribed word “everything” at step in the transcription can provide the additional context. For ASR task, the translated word “xianzai” at step can also help to recognize “now” at step . Therefore, if the generation of two tasks can interact with each other, the quality of transcription and translation can both be improved.
To this end, we propose a novel interactive learning model which can perform speech recognition and speech translation synchronously and interactively. Compared with the traditional multi-task learning model which shares part of parameters and treats different tasks separately, tasks in our approach can exchange the information of each other. With an interactive attention sub-layer, translation decoder in our model predicts next word with the transcribed words as auxiliary information, and for recognition decoder vice versa. Therefore, at each step, word prediction in each task not only relies on its previously generated outputs, but also the outputs in the other task. Furthermore, we introduce a wait- policy where the generation process of speech translation is always steps later than speech recognition, so that the translation decoder can attend to more transcribed words. We conduct extensive experiments to verify the effectiveness of our proposed approaches on new TED English-to-German/French/Chinese/Japanese speech translation corpora.
Our main contributions are summarized as follows:
We propose an interactive learning model which can conduct speech recognition and speech translation interactively, enhancing the quality of both tasks.
Different from traditional multi-task learning model which generates transcriptions or translations separately, our method can simultaneously generate both transcriptions and translations in one model.
Experiments on four language pairs have demonstrated that our model can outperform strong baselines, including the pipeline system, the pretrained end-to-end ST model, the traditional multi-task learning model, and the two-stage model.
2 Related Work
Speech translation has traditionally been approached through a pipeline system which consists of an ASR model and a text MT model [18, 4, 7, 8, 19]. Recent works have shown the feasibility of collapsing the cascade system into an end-to-end model. The first conjecture was proposed by 1999The 1999The who presumed that end-to-end speech translation is possible to implement with the development of memory, computation speed, and representation methods. It is not until 2016 that berard2016listen berard2016listen realized the first pure end-to-end model without using any source transcriptions. Considering its notorious difficulty, the performance of end-to-end ST model is generally limited. Several work proposed a variety of approaches to improve the translation quality. Some applied multi-task learning to train speech translation jointly with ASR [20, 2, 6]. Others attempted to pretrain ST model with extra ASR data to promote acoustic model, or with target sentences to improve language model [5, 11]. liu2019end liu2019end proposed to use a text MT model as teacher model to instruct ST model through knowledge distillation.
An intuition is that speech translation can become easier if the model has access to the transcription as intermediate. Therefore, several researchers proposed two-stage models where the first decoder is used to recognize transcriptions and the second decoder conducts translating with the hidden state in the former stage. kano2017structured kano2017structured first proposed the basic two-stage model and used pretraining strategy for the individual sub-models. anastasopoulos2018tied anastasopoulos2018tied,anastasopoulos2018leveraging employed a triangle model on low-resource speech translation. sperber2019attention sperber2019attention further applied an attention-passing mechanism which can integrate auxiliary data and improve model robustness. However, the second decoder needs to wait until the complete transcription is recognized, which greatly affects the training and inference efficiency. Besides, it can only utilize transcriptions to improve translation quality but leaves the recognition task alone. As shown in Figure 1, the outputs of recognition and translation are complementary and can benefit each other. Therefore, it is reasonable to improve the quality of both tasks through interactive learning.
zhou2019synchronous zhou2019synchronous proposed a synchronous bidirectional inference model in which left-to-right and right-to-left inferences perform in parallel. The two decoding directions can help each other, and make full use of the target-side history and future information during translation. zhang2019synchronous zhang2019synchronous further applied this inference model on other sequence generation tasks, such as summarization, obtaining significant improvement as well. However, their works are conducted on the same task with outputs in different directions. The most related work with us is from wang2019synchronously wang2019synchronously who synchronously performed multilingual translation within a beam. In our work, we have two different tasks and aim to implement speech recognition and speech translation in one model synchronously.
Considering that Transformer model is now the state-of-art model in MT field , and also shows a superior performance in ASR filed [9, 21], we adopt Transformer model as the core structure. However, our proposed approach can be applied to any encoder-decoder architectures.
The Transformer follows the typical encoder-decoder architecture. The encoder first maps the input sequence into a sequence of continuous representations , from which the decoder generates the output sequence one word at a time. In Transformer, the encoder includes layers and each layer is composed of two sub-layers: the self-attention sub-layer and the feed-forward sub-layer. The decoder also consists of
layers and each layer has three sub-layers. The first one is the masked self-attention sub-layer, which adds masks to prevent present positions from attending to the future positions during training. The second is the encoder-decoder attention sub-layer, followed by the feed-forward sub-layer. Residual connection and layer normalization are employed around each sub-layer in the encoder and decoder.
The calculation process of three attention sub-layers can be formalized into the same formula as,
where , and denotes the query, key, and value respectively. is the dimension of the key. The feed-forward sub-layer is then applied to yield the output of a whole layer. And softmax function is employed to predict the final output.
It is worth noting that for the self-attention sub-layer, the query, key, and value are the hidden representation from the same layer. For encoder-decoder sub-layer, the query is the hidden representation from the masked self-attention sub-layer in the decoder, the key and value are from the top layer in the encoder.
4 Our Approach
In this section, we propose a novel framework to implement interactive learning for speech recognition and speech translation during training and inference, which is shown in Figure 2. Before we introduce this framework in detail, we first introduce how Transformer model is applied to the ASR, MT, and ST tasks.
4.1 ASR, MT, and ST Task
Speech recognition, text machine translation, and speech translation tasks can all adopt Transformer model, while different tasks have different input sequences and output sequences . Specifically,
For ASR task, the input sequence is a sequence of speech features, where
is the frame number of speech sequence. Specifically, the speech feature is first converted from raw speech signal by applying log-Mel filterbanks with mean and variance normalization. Frame stack and downsampling are used to reduce the input length similar with sak2015fast sak2015fast, resulting in a sequence with dimension of. The output sequence is the corresponding transcription, where is the source sentence length.
For MT task, the input sequence is the transcription in source language and the output sequence is the corresponding translation in target language, where is the target sentence length.
For end-to-end ST task, the input sequence is the same with ASR task and the output sequence is the corresponding translation in target language.
In addition to the end-to-end model, ST task can also be implemented in a pipeline approach, where the speech utterance is first transcribed by an ASR model and then passed to a MT model. Another method is the multi-task learning model where the ASR model and ST model are combined with a shared encoder and trained jointly.
4.2 Interactive Learning Model
In the traditional multi-task learning, different tasks are trained independently with shared parameters. However, as discussed in Section 1, the output in one task is complementary with that in the other which can assist the prediction. Therefore, it is reasonable to improve the performances of both tasks by interactively exchanging information from each other. Besides, the traditional multi-task learning can only perform one task during inference, while sometimes the transcription and translation are required at the same time. To solve these problems, we propose an interactive learning model where two tasks can not only interactively learn from each other but also generate predictions synchronously.
The main model structure is shown in Figure 2
. First, the speech signal is processed into the acoustic feature sequence and projected by a linear transformation layer, whose dimension is converted to the hidden size. Then, the encoder embeds the sequence into a high level acoustic representation. Two decoders are applied for different tasks in which one performs speech recognition and the other conducts speech translation.
To make two decoders interactively learn from each other, we replace the self-attention sub-layer in the standard Transformer decoder with our proposed interactive attention sub-layer. As shown in Figure 3, the interactive attention sub-layer is composed of a self-attention sub-layer and a cross-attention sub-layer. The former uses the hidden representation from task 1 as the query , key and value to learn higher representation . While the latter uses the hidden representation from task 1 as the query , and the hidden representation from task 2 as key and value to integrate the representation of the other task. All the hidden representations are extracted from the same layer. It can be calculated as:
Then the output of self-attention sub-layer and that of cross-attention sub-layer can be integrated by a fusion function to obtain the final representation:
We use a linear interpolation as fusion function, which can be calculated as:
where is a hyper-parameter to control how much information of the other task should be taken into consideration. Then both decoders can obtain the combined representation which contains information from the outputs in two tasks.
4.3 Training and Inference
Since our approach performs ASR task and ST task in one model, two tasks can be optimized at the same time. We additionally append two special labels ( and ) at the start of transcriptions and translations to indicate whether the generation process is recognition or translation. Given a set of training data , where is the sequence of speech features, is the sequence of source transcription and is the corresponding target translation, the objective function is to maximize the log-likelihood over both the transcription and the translation,
With interactive attention sub-layer, the recognition decoder and translation decoder can utilize the information from both itself and the other. Specifically, at time , the recognition decoder and translation decoder have generated the first 1 words respectively, then the -th word in translation can be predicted based on the 1 already generated translation words and the
1 already transcribed words. It is the same for the generation process in speech recognition task. Therefore, the prediction probability of each transcriptionand translation can be formalized as,
The inference process is similar with training. We run beam search algorithm for the two tasks. Two beams are applied for different tasks and expand hypotheses respectively. The outputs of two tasks are generated in parallel, with the interactive attention sub-layer to implement information exchanging between two decoders. At each step, the word with highest probability will be selected and added to each hypotheses. The inference process terminates until both tasks reach the end of sentences. In this way, the hypotheses in speech recognition task and speech translation task can be generated synchronously.
4.4 Wait- Policy
Considering that the speech translation task is more difficult than speech recognition, it would be helpful for the translation process if the translation decoder can get more information at each step. Therefore, we introduce a wait- policy, in which the translation decoder begins to perform until the first source words are transcribed by the recognition decoder. That is, the generation process of translation is always words later than the generation of transcription. For example, if , the first translation word is predicted based on the acoustic representation of encoder with the first two transcription words. Then the second translation word can use the hidden representation of acoustic encoder, the first three transcription words, and the first predicted translation word, etc. ma2018stacl ma2018stacl, they applied the wait- policy in simultaneous translation where the translation decoder is always k words behind the incoming source stream. Different from them, the decoders in our work have access to the complete source speech utterances, and the wait- policy is only applied to the translation decoder. During training, we append special label () before the start of translation, which indicates that the generation process of translation is steps later than recognition.
Prior studies usually conduct experiments on Fisher and Callhome, a corpus of telephone conversations which include English transcriptions and Spanish (Es) translations . However, the ASR word error rate (WER) of this corpus is fairly high111This corpus contains ASR outputs which are provided by post2013improved post2013improved, with a WER of over 40%., due to the spontaneous speaking style and challenging acoustics. Therefore, we construct a new speech translation corpus collected from TED talks which are a popular data resource in both speech recognition and machine translation fields.
To build this corpus, we first crawl the raw data (including video data, subtitles and timestamps) from the TED website222https://www.ted.com. Audio in each talk is extracted from video and saved in wav format. Subtitles in each talk usually have an English manual transcription and more than one translations in different languages. Here, we only collect the subtitles which contains English transcription with translations in German, French, Chinese, and Japanese (briefly, De/Fr/Zh/Ja). Adjacent subtitles and timestamps in English transcriptions are combined according to strong punctuations, such as period and question marks. Then each audio is segmented into small utterances based on the combined timestamps. This process guarantees that each speech utterance contains complete semantic information, which is important for translation. Translations in different languages are also combined based on the timestamps to align with speech utterances333gangi2019must gangi2019must built similar corpora, however their corpora do not consist of En-Zh and En-Ja language pairs, and they used a different segment way..
Finally, we obtain 235K/299K/299K/273K triplet data for En-De/Fr/Zh/Ja language pairs respectively, which contain speech utterances, manual transcriptions and translations. Development and test sets are split according to the partition in IWSLT. We use tst2014 as development (Dev) set and tst2015 as test set. The remaining data are used as training set. This dataset is available on http://www.nlpr.ia.ac.cn/cip/dataset.htm.
5.2 Model Settings
The speech features have 80-dimension log-Mel filterbanks extracted with a step size of 10ms and window size of 25ms, which are extended with mean subtraction and variance normalization. The features are stacked with 3 frames to the left and downsampled to a 30ms frame rate. We remove punctuations, lowercase and tokenize English transcriptions using scripts from Moses444https://www.statmt.org/moses/. We also lowercase and tokenize the translations in German and French. Chinese sentences are segmented by Jieba555https://github.com/fxsjy/jieba and Japanese sentences are segmented by Mecab666http://taku910.github.io/mecab. For En-De and En-Fr, parallel sentences are encoded using BPE method  which has a shared vocabulary of 30K tokens. For En-Zh and En-Ja, we encode source transcriptions and target translations, respectively, and the vocabulary size is limited to the most frequent 30K. ASR performance is evaluated with WER computed on lowercased, tokenized manual transcriptions without punctuations. As for text translation and speech translation, we report case-insensitive tokenized BLEU  for De/Fr language pairs and character-level BLEU for Zh/Ja.
All of the models are implemented based on the model adopted from Transformer. We use the configuration transformer_base used by vaswani2017attention vaswani2017attention which contains 6-layer encoders and 6-layer decoders with 512-dimensional hidden sizes. We train our models with Adam optimizer  on 2 NVIDIA V100 GPUs. For inference, we perform beam search with a beam size of .
We compare the proposed method with the following baseline models:
Pipeline system: ASR and MT model are independently trained, and then the outputs of ASR model are taken as the inputs to MT model.
Pretrained ST model: The encoder of end-to-end ST model is first initialized by training on ASR data, and then the model is finetuned on speech translation data.
Multi-task learning model: ASR model and ST model are jointly trained with the parameters of encoder shared.
Two-stage model: This model contains two stages where the outputs of the first stage are transcriptions and the second stage are translations. We re-implement the basic model based on Transformer following sperber2019attention sperber2019attention. The model in the first stage is also initialized by training on ASR data.
Table 1 shows the main results of speech recognition and speech translation on En-De/Fr/Zh/Ja TED corpora. The BLEU scores in the first row are the translation results by text MT model when the clean manual transcriptions are given as inputs. This can be seen as the upper bound for speech translation task. We set and in the interactive learning model.
We first analyze En-De and En-Fr language pairs. From the first two rows, we can see that the translation quality drops dramatically when the output of ASR model is fed as the input to the MT model compared with the clean transcriptions input. It indicates that text MT model is very sensitive to recognition errors, which is one of the main problems in the pipeline system. Pretrained end-to-end ST model outperforms the pipeline system by 0.99 BLEU points on En-Fr language direction, but it does not show superiority on En-De. We argue that end-to-end model may have superiority of less error propagation on more similar language pairs, such as En-Fr or En-Es. This is consistent with weiss2017sequence weiss2017sequence who conducted experiments on En-Es and found end-to-end ST has better performance than the pipeline system. Compared with the end-to-end model, multi-task learning model can obtain some improvements, which improves 2.01 and 0.98 BLEU scores for En-De and En-Fr, respectively. However, with information exchanging, our proposed interactive learning model significantly outperforms multi-task learning model on the quality of both speech recognition and speech translation. It demonstrates the effectiveness of the interactive attention mechanism. Although our method does not outperform two-stage model on En-Fr speech translation task, it has a better performance on ASR result. The underlying reason is that the goal of two-stage model is to optimize the translation quality with the information of complete transcription while ignoring the recognition, so it can improve the translation quality but leave the recognition alone.
It is even more difficult to implement end-to-end speech translation on dissimilar language pairs, such as En-Zh and En-Ja. Because these kind of models are required to learn not only the alignments between source frames and translation words, but also the word orders in long distances. Therefore, in our experiments, most of the end-to-end models are inferior than pipeline system. However, the proposed interactive learning model can significantly outperform end-to-end ST model, traditional multi-task learning model and two-stage model, approaching to or slightly better than pipeline system.
5.5 Effect of the Hyper-parameters
We investigate how much information from two tasks should be taken into consideration in the interactive attention sub-layer. Table 2 reports the WER and BLEU scores under different on the En-Zh. If , the model degrades to traditional multi-task learning model which does not utilize any information from the other task. As shown in the table, as increases, both recognition quality and translation quality can be improved with information interacting. When , our interactive learning model achieves the best performance on the speech translation task. However, can not be too large, otherwise two tasks may interfere with each other and affect its own performance. Therefore, we use for all experiments.
5.6 Effect of in Wait- Policy
We then investigate the effect of word latency in wait- policy on En-Zh language pairs. As shown in Table 3, the speech translation quality in BLEU scores can be improved with the increase of word latency. It indicates that the speech translation task can become easier if more source information from the same modality is given. However, as increases, it will affect the performance of speech recognition task. If , this model degrades to the analogous two-stage model. Then the speech translation task can obtain the information from complete transcribed sentence, while speech recognition task can not utilize any information from translations. The interactive learning model has the best performance when .
5.7 Parameters and Speeds
The parameter sizes of different models are shown in Table 4. The pipeline system needs a separate ASR model and MT model, so its parameters are doubled. Two-stage model has 1.5 times larger parameters since it has two different decoders in two stages. In multi-task learning model and interactive learning model, we share the parameter between different tasks. Therefore, they have the same number of parameters with end-to-end model. Table 4 also shows the training and inference speed of different models on En-Zh test set. The training speed of interactive learning model is 4.23 steps per second, which is comparable with the end-to-end model but is much faster than two-stage model. During inference, the average decoding speed of interactive learning model is 11.98 utterances per second. Although it is slower than end-to-end model and multi-task learning model, it can generate transcriptions paired with translations in one model synchronously. While two-stage model can also generate transcription and translation in a single model, its implementation which is in a cascade manner is much slower even than pipeline system.
5.8 Case Study
We show the case study in Figure 4. In pipeline system, ASR model first recognizes the speech utterance into “brainstormed on solutions to the best child is facing their city”. Since it wrongly recognizes “the biggest challenges” into “the best child is”, text MT then translates the incorrect recognition phrase, resulting the result is far from the reference. It is more difficult for the end-to-end ST model to generate a correct translation and its output is totally wrong. This model may comprehend the speech of “brainstorm” into “buhrstone” which has a similar pronunciation and it omits the translation of “the biggest”. Although the multi-task learning model has an enhanced acoustic encoder, it repeatedly attends to the speech of “storm” without transcription as guidance and translates it twice. As for two-stage model, it erroneously recognized “the biggest” into “the best” in the first stage based on which the second decoder also gives a wrong translation. Compared to the above approaches, our model generates the right transcription and translation through interactive attention mechanism, which matches the reference best.
6 Conclusion and Future Work
In this paper, we propose an interactive learning model to conduct speech recognition and speech translation interactively and simultaneously. The generation process of recognition and translation in this model can not only utilize the already generated outputs, but also the outputs generated in the other task. We then present a wait- policy which can further improve the speech translation quality. Experimental results on different language pairs demonstrate the effectiveness of our model. In the future, we plan to design a streaming encoder and make a step forward in achieving end-to-end simultaneous interpretation.
The research work described in this paper has been supported by the National Key Research and Development Program of China under Grant No. 2016QY02D0303, the Natural Science Foundation of China under Grant No. U1836221 and 61673380, and Beijing Municipal Science and Technology Project No. Z181100008918017 as well. The research work in this paper has also been supported by Beijing Advanced Innovation Center for Language Resources.
-  (2018) TIED multitask learning for neural speech translation. In Proceedings of NAACL 1, pp. 82–91. Cited by: §1.
-  (2018) Leveraging translations for speech transcription in low-resource settings. In Proceedings of Interspeech, pp. 1279–1283. Cited by: §1, §2.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.2.
-  (2015) Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR . Cited by: §2.
-  (2018) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proceedings of NAACL, pp. 58–68. Cited by: §1, §1, §2.
-  (2018) End-to-end automatic speech translation of audiobooks. In Proceedings of ICASSP, pp. 6224–6228. Cited by: §1, §1, §2.
Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proceedings of ICASSP, pp. 4960–4964. Cited by: §2.
-  (2017) State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of ICASSP. Cited by: §2.
-  (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of ICASSP, pp. 5884–5888. Cited by: §3.
-  (2016) Deep residual learning for image recognition. In Proceedings of CVPR, pp. 770–778. Cited by: §4.2.
-  (2019) Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In Proceedings of ICASSP, pp. 7180–7184. Cited by: §1, §2.
-  (2017) Structured-based curriculum learning for end-to-end english-japanese speech translation. In Proceedings of Interspeech, pp. 2630–2634. Cited by: §1.
-  (2015) Adam: a method for stochastic optimization. In Proceedings of ICLR. Cited by: §5.2.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL, pp. 311–318. Cited by: §5.2.
-  (2013) Improved speech-to-text translation with the fisher and callhome spanish–english speech translation corpus. In Proceedings of IWSLT. Cited by: §5.1.
-  (2016) Neural machine translation of rare words with subword units. In Proceedings of ACL, pp. 1715–1725. Cited by: §5.2.
-  (2019) Attention-passing models for robust and data-efficient end-to-end speech translation. Transactions of ACL 7, pp. 313–325. Cited by: §1, §1.
-  (2014) Sequence to sequence learning with neural networks. In Proceedings of NIPS, pp. 3104–3112. Cited by: §2.
-  (2017) Attention is all you need. In Proceedings of NIPS, pp. 5998–6008. Cited by: §2, §3.
-  (2017) Sequence-to-sequence models can directly translate foreign speech. In Proceedings of Interspeech, pp. 2625–2629. Cited by: §1, §1, §2.
-  (2018) Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese. In Proceedings of Interspeech, pp. 791–795. Cited by: §3.