Speech translation (ST) aims to convert speech signals to texts in other languages. Conventionally, it is formulated as a two-step cascaded task, automatic speech recognition (ASR) followed by text-based machine translation (MT) [Ney1999ST, Matusov2005ST, Post2013ST]. Such cascaded systems typically suffer from the following issues. First, errors in ASR may propagate to MT. Second, since the intermediate representation is text, cascaded systems cannot fully leverage speech information (e.g., prosody) for translation. Finally, the MT module cannot start until the ASR module has (partially) finished, resulting in long inference latency. Recently, end-to-end (E2E) ST (i.e., direct ST), which directly maps audio features to texts, has become more and more popular [vila2018end, sperber2020speech]. In [Berard2016ST], the authors propose to use attention-based E2E encoder-decoder models (AED) [chan2015listen, wang21t_interspeech] on a small French-English synthetic corpus. In [weiss2017sequence], a similar model structure is applied to the Fisher Callhome Spanish-English task and outperforms the cascaded method on the Fisher test set. AED-based models were also used in [Berard2018ST] for a large-scale E2E ST task. However, AED models are usually operated in an offline mode which cannot start decoding until the full utterance is observed.
E2E ST and ASR are similar in that they are both sequence-to-sequence mappings. Many model architectures can thus be shared, especially between ST using monotonic alignments [raffel2017online] and ASR. To enable more effective communication between users, streaming (i.e., simultaneous) models are topics of investigation in both areas. Monotonic chunkwise attention (MoChA) [chiu2018monotonic] was used in both MT and ASR. The MT version was extended to monotonic infinite lookback attention (MILk) [arivazhagan2019monotonic] and monotonic multi-head attention [ma2019monotonic, ma2021streaming], and the ASR version was improved by multitask learning [miao2019online] and minimum latency training strategies [inaguma2020minimum]. Another streaming model architecture is the Neural transducer [prabhavalkar2017asr, sainath2020asr, li2020asr, saon2021asr], which outperforms MoChA and has emerged to be the state-of-the-art (SOTA) streaming E2E model in ASR [li2021recent], but has been less investigated in ST. Recently, Liu et al. proposed cross attention augmented transducer (CAAT) for ST [liu2021caat]. It uses Transformers in the joint network to combine encoder and prediction network outputs. Due to the use of Transformers and multi-step decision for memory footprint reduction, the latency of CAAT is large. In addition, to train a CAAT, complicated regularization terms and extensive hyper-parameter tuning are required.
In this paper, to leverage the success of the SOTA streaming technology in ASR, we propose to use neural transducers, specifically a low-latency and low-computational-cost Transformer transducer (TT) [xiechen2021tt] for streaming E2E ST. To improve the representation fusion ability of the joint networks in TT, we propose attention pooling. In addition, we extend TT for multilingual ST. The TT models are trained on a 50K-hour pseudo-labeled ST data set, which is generated by feeding the reference texts in ASR corpus to a MT model [liu2019, jia2019st, gaido2020end]. We do not use any human-labeled paired ST data in this work. Experimental results on the Microsoft speech language translation (MSLT) corpus [federmannmicrosoft] demonstrate that the proposed method not only achieves good bilingual evaluation understudy (BLEU) scores but also significantly reduces inference latency. The remainder of this paper is organized as follows. We describe the proposed methods and model structures in Section 2. The experimental setup and evaluation results are shown in Section 3 and 4. We conclude this paper in Section 5.
2 System Description
2.1 From Wait- To Neural Transducers
Figure 1 shows the decoding graphs of the commonly adopted wait- algorithm [ma2019stacl] for streaming ST, where and denote the time steps for encoder output and output labels (i.e., read and write operations), respectively. As indicated by the names, wait- in Figure 0(a) waits for 3 read operations to start writing, whereas wait- in Figure 0(b) can access the whole sentence.
Instead of using hard-coded wait steps and a fixed read-write policy in wait-, neural transducers, whose decoding graphs are depicted in Figure 2, make read and write decisions in a data-driven fashion. During training, a neural transducer considers all possible alignments between encoder output and labels. At test time, it generates the most likely paths adaptively based on the input features. As shown in Figure 2, if there is no significant word reordering, the neural transducer may follow the orange path or a different green path. If there is a significant word reordering at the end of the utterance, it can use the blue decoding path corresponding to wait-.
The model structure of neural transducers is shown in Figure 3. It has three components: an encoder network, a prediction network, and a joint network. The encoder takes -dimension audio features as input and generates
-dimension hidden representations. The prediction network uses the embedding of non-blank output token at time and predicts hidden representation for step . As for the joint network, it combines and to a tensor, whose element at and
is denoted by the vector
. After a softmax operation, the model generates probability, where Y is the vocabulary list and denotes the blank output (i.e., output nothing). Note that for notation simplicity, we ignore batch size and use the same time resolution for and graves2012asr] and Transformers in TT models [facebook2019TT, zhang2020tt].
2.2 Streaming TT Model
We apply TT in this work since it usually obtains better performance than RNN-T in ASR tasks [xiechen2021tt, facebook2019TT, zhang2020tt]. Each Transformer block in the encoder is constructed from a multi-head self-attention layer followed by a feedforward layer. In order for TT to work in a streaming mode with low latency and low computational cost, we apply the attention mask proposed in [xiechen2021tt]. The attention mask can be the same for different layers. At each layer , we divide the input into chunks along time with chunk size , where . At time step , can only see the frames inside its own chunk and a fixed number of left chunks . Figure 4 shows an example of the reception field for a three-layer Transformer model with chunk size and the number of left chunks at output position . Note that since the features cannot access the frames ahead its own chunk, as shown by the right-most blue circle in the first input layer, the number of look-ahead frames is kept . Moreover, the left reception field increases linearly with the number of layers, enabling the model to use a long history information for a better performance.
2.3 Attention Pooling for Joint Networks
The joint network in a conventional neural transducer combines the output representations of encoder and prediction network with simple linear layers:
where the two sources of output are multiplied with and , to map the feature vectors to -dimension, respectively. denotes a non-linear function, which is typically or . Finally, the feature vector is converted to the output dimension using .
Recent study in ASR shows that the representation fusion ability of such joint networks can be improved by a bilinear pooling approach [zhang2022improving]. In this paper, we propose attention pooling, which could adapt the pooling weights according to the input using an attention-like weighting mechanism. Different from ASR, ST needs to consider not only the current output probability but also whether writing a non-blank token at a future step is better. The adaptive attention weights in attention pooling may work as an additional type of feature to help ST models to make more appropriate decisions. Note that we keep the time and space complexity of attention pooling to be linear so that it consumes less computation resources during inference. The proposed attention pooling is defined as Equation (2) to (5) below:
where is the pooling term at time steps and , maps the 1-dimension feature to -dimension, and denotes the contribution of encoder and prediction network to the pooling term, denotes Hadamard product, and are used to calculate the attention weights, and denotes tensor-dot operation.
We also design a stronger qkv attention pooling method that uses separate weights for query, key, and value. It can be expressed by replacing Equation (4) and (5) with Equation (6) and (7), respectively:
where , , and are the linear layers for query, key, and value for encoder features, and , , and are the corresponding weights for the prediction network. Note that we use Hadamard product and tensor-dot to avoid quadratic time and space complexity.
2.4 Multilingual ST with TT
ST supporting a single language pair such as English-Chinese (EN-ZH) are often referred to as bilingual ST. It is inefficient to build a separate bilingual ST model for every language pair in the world. In addition, running multiple bilingual ST models simultaneously requires a lot of memory and computation resources. In this work, we propose to apply TT to multilingual ST by sharing the encoder and using separate prediction and joint networks for different target languages. Since the encoder (64M of parameters in our experiments) is much larger than joint and prediction networks (24M combined), the size of such a multilingual ST model is comparable to a bilingual model.
Figure 5 shows the one-to-many multilingual ST model using TT. Shared encoder output is fed to multiple prediction and joint networks. To train the multilingual ST model, we alternate the training data for different batches, e.g., one batch using EN-ZH data and another English-German (EN-DE) (i.e., data mixing ratio is 50%-50%).
3 Experimental Setup
We use 50 thousand (K) hours of Microsoft internal ASR data as the training set. All the data are anonymized with personally identifiable information removed. The original transcriptions are in English, and we use Microsoft cognitive translation service to translate them into Chinese and German. We do not use any human-labeled paired ST data. For evaluation, we use the publicly available MSLT_v1.0 for EN-DE and MSLT_v1.1 for EN-ZH [federmannmicrosoft]. 111We are not allowed to use other public data sets due to license restrictions.
3.1 Cascaded Method
We use the cascaded method as the baseline for our experiments. The ASR module is a streaming TT model described in Section 2.2. It is trained using the above 50K-hour English audio and the corresponding English transcriptions. The encoder consists of 18 Transformer blocks, each containing 320 hidden nodes, 8 attention heads, and 2048 feedforward nodes. As for the prediction network, we use 2 LSTM layers. Each LSTM layer has 1024 hidden nodes and the embedding dimension is also 1024. The joint network is a simple feedforward layer containing 512 nodes. For English transcriptions, the vocabulary size is 4K. The model has 88M parameters in total. The input to the model is 80-dimension log-Mel filter-bank features with 25ms windows and 10ms shift, extracted from mixed band training data [li2012improving]. Before the input is fed to the Transformer blocks, it is filtered and down-sampled by a factor of 4 (i.e., the resulting sampling rate is 40ms) using two convolutional layers. The chunk size of the streaming mask for the Transformer blocks is 4. The output of the ASR module is used by a non-streaming text-based MT module to get the translation results. The MT model consists of a 6-layer Transformer encoder and a 2-layer RNN decoder. Each Transformer layer has a feedforward network of size 2048 and 8 attention heads. The embedding size is 512. The MT model has 67M parameters.
3.2 Streaming E2E ST Models
The streaming E2E ST models are trained using the same 50K-hour English audio, but with the corresponding translated labels generated by the MT model. The ST models have the same architecture as that of the ASR module in the above cascaded system, except that the output dimensions are changed according to vocabulary sizes. The vocabulary size of EN-ZH is 11K, whereas that of EN-DE is 4K. In addition to chunk size 4, which has 160ms look-ahead and is denoted as TT-160ms, we conduct experiments using a chunk size of 80, whose look-ahead is 3.2s and is thus denoted as TT-3.2s.
3.3 ASR Encoder Initialization and ASR Multi-Task Learning
In addition to pseudo-labeling, we investigate ASR encoder initialization and ASR multi-task learning for ST. For ASR encoder initialization, we use the ASR module from the cascaded method to initialize the encoder and randomly initialize the prediction and joint networks. Then we fine-tune the whole model for the EN-ZH task. As for ASR multi-task learning, we adopt the multilingual ST model described in Section 2.4, but with English and Chinese as the output languages.
3.4 Attention Pooling for Joint Networks
Each new weight matrix introduced by attention pooling is implemented as a linear layer. The open neural network exchange (ONNX) conversion procedures are modified accordingly.
3.5 ONNX Conversion and Model Compression
After getting the checkpoints of E2E ST models, we convert them to ONNX format, compress each component, and evaluate the compressed models. The weights in the encoders are compressed to uint8, and the RNN and feedforward layers in the prediction and joint networks are compressed using neural-network unified preprocessing heterogeneous architecture (NUPHAR).
3.6 Latency Measurement
We use average proportion (AP), average lagging (AL), and differentiable average lagging (DAL) proposed in [ma2020simuleval] to measure the inference latencies of our ST systems. Note that different from [ma2020simuleval], our results are generated after the models are converted to ONNX format and compressed.
4 Evaluation Results
4.1 Transformer Transducer for Speech Translation
The first parts of Table 1 and 2 contain the BLEU scores and latencies of the cascaded ST models and TT-based E2E ST models. First, comparing TT-3.2s and TT-160ms, TT-3.2s outperforms TT-160ms in BLEU scores on both EN-ZH and EN-DE, at the cost of a significant latency increase. The differences between TT-3.2s and TT-160ms in BLEU scores are small: 0.7 on EN-ZH and 1.3 on EN-DE, indicating that TT can maintain a good translation quality when working in streaming mode. The AL of TT-3.2s is about 2.2s, shorter than the look-ahead time. The reason is that except for the first output token, TT-3.2s does not have to wait for the whole 3.2s to generate an output. On the contrary, the AL of TT-160ms is 841ms, longer than 160ms. This shows that that TT-160ms requires multiple frames to handle word reordering. Second, comparing the BLEU scores of cascaded models and TT-160ms, we observe that on EN-ZH, there is still a gap. However, on the EN-DE test set, TT-160ms outperforms the cascaded model. Note that TT-160ms is a streaming model with a very small latency, whereas the cascaded model is a non-streaming model and is trained using additional text-text MT data. We also conduct experiments on a non-streaming AED E2E model, but its performance is not as good as TT-160ms. Finally, we mainly use pseudo-labeling to deal with the data scarcity problem in this study. To exploit the 50K-hour ASR training data, we investigate ASR encoder initialization and ASR multi-task learning as described in Section 3.3. As shown in Table 1, these two methods do not help in our experiments, possibly because our ST models are trained with a large amount of training data.
4.2 Attention Pooling for Joint Networks
Table 1 contains the comparison between TT-160ms and different pooling methods for joint networks. Bilinear pooling does not improve the performance of TT-160ms in this study. The reason may be that it lacks the ability to adapt the pooling weights according to the input, which is important in ST since the output of the prediction network is in a different language and is not monotonic w.r.t. the audio features. The attention pooling methods proposed in this paper show consistent BLEU score improvements over TT-160ms. The latencies are also very close to those of TT-160ms. Note that each input frame is 10ms and the encoder has a subsampling factor of 4. The attention pooling methods are thus at most 1-2 steps slower at the encoder output, and the slightly higher latency of simple attention pooling over qkv attention pooling can be negligible. Since the simple attention pooling method obtains a larger BLEU score improvement per additional parameter, which is calculated as , we use it for the evaluation on EN-DE. As shown in Table 2, attention pooling obtains a consistent BLEU score improvement over TT-160ms.
4.3 Multilingual ST with TT
The last lines in Table 1 and Table 2 correspond to the EN-ZH output and EN-DE output of the TT-based streaming E2E multilingual ST. Note that although the results are shown in different tables, they are generated simultaneously. The BLEU scores of multilingual ST are slightly worse than those of the bilingual TT-160ms models, the differences are 0.1 for EN-ZH and 0.2 for EN-DE, respectively. In addition to good BLEU scores, multilingual ST greatly reduces the model size and computation burden since it shares a single encoder for multiple languages.
|ASR encoder init||34.7||0.61||841||834|
|ASR multi-task learning||34.7||0.61||841||834|
|qkv attention (+3.9M)||35.3||0.62||875||877|
|multilingual EN-ZH output||34.8||0.61||841||834|
|multilingual EN-DE output||29.2||0.61||828||828|
We propose neural transducers for large-scale streaming E2E ST. To improve the performance of TT for ST, we propose attention pooling for joint networks. Moreover, we extend TT to multilingual ST by sharing the encoder. Experimental results on the EN-ZH and EN-DE test sets of MSLT show that the proposed TT-based streaming E2E ST models achieve high-quality translation performance with low inference latency. More specifically, the proposed streaming E2E ST system outperforms a non-streaming cascaded system on EN-DE.
We would like to thank Drs. Long Zhou, Yu Wu, and Shujie Liu at Microsoft Research Asia for valuable suggestions.