Spoken Language Translation (SLT) refers to the task of transcribing the spoken utterance in source language with the text in the target language. SLT systems are typically categorized in two types: Cascade systems and End-to-End systems. Cascade SLT systems in their popular form comprises of an automatic speech recognition (ASR) system followed by a machine translation (MT) system. The initial ASR system generates the text sequence for the spoken utterance and the generated text sequence is translated to the target language by a machine translation (MT) system. Speech recognition task has a monotonic alignment between the spoken sequence and the labeling sequence. But this property is not present in MT and SLT tasks. ASR system relies more on the local context to transcribe the speech while MT systems would need the wider context of a word to translate it. In a cascade approach, both ASR and MT could be trained separately. The improvements in both ASR and MT performance could be easily translated to the SLT model. End-to-End SLT system is a single model that directly transcribes the spoken utterance with the text in target language. End-to-End SLT models have to learn the complex mapping between the spoken sequence and the labeling sequence in the target language which involves the word splitting, merging and reordering. End-to-End SLT systems could result in simpler models to train, have low latency and could lead to direct optimization of the required task. Using the the source language text while optimizing the End-to-End SLT model could improve the performance.
Presently cascade SLT systems perform better than the End-to-End SLT systems. Popular End-to-End SLT systems are sequence-to-sequence with attention models. With the recent advancements in these models the performance gap between cascade and End-to-End SLT systems is decreasing. But the cascade SLT system suffers from a large performance gap when MT models are evaluated using ASR hypotheses compared to using the oracle text. This large performance gap is due to the erroneous ASR hypotheses and the MT system which is not able to handle the errors ASR hypotheses. In this study, we explore approaches to reduce this performance gap by creating an End-to-End differentiable pipeline between ASR and MT systems. SLT systems are trained with ASR objective as an auxiliary loss and both the networks are connected through the neural hidden representations. By this training the model could have differentiable path between input(speech) and the final labeling sequence and also utilize the ASR objective for better performance of SLT system. During the inference, both the models are connected through N-best hypotheses i.e., N-best hypotheses from ASR model are translated by the MT system in parallel. The likelihood of every N-best translation is conditioned on the likelihood from the ASR hypothesis as mentioned in equation7.
Presently most state-of-the-art MT systems use transformer model [Transformer]. Transformer is a recurrent-free encoder-decoder model. Transformer employs self-attention to model the dependencies in the sequences and cross-attention to model the co-relations across the sequences. With the use of causality masks to prevent the model from peaking in to the future, transformer model could be parallelized while training. The model has shown capabilities to model the long-term contexts. Transformer model has been adapted for training ASR systems [dong2018_speech_transformer, ESPNET_tranformer]. Transformer models have been used for developing End-to-End SLT systems [SPMT_tranformer_diag_loss, ESPNET_rnn_tranformer]. In this study, we use transformer models for training ASR, MT and and SLT systems. The performances of End-to-End and cascade SLT models are analyzed. The SLT systems in this study are developed with two different granularities i.e., character/BPE units. The performances of these models are analyzed on English to Portuguese translation tasks.
Recent Developments in SLT systems
Some of the recent SLT models submitted to IWSLT-2019 are described below:
The cascade SLT system with a pipeline of ASR, punctuation model and MT has the best performance in IWSLT 2019 [KIT_IWSLT2019]. The ASR used in this work is an ensemble of LSTM-based encoder decoder model and a big transformer model(¿150M parameters) trained with stochastic depth [Transformer_stocastic_depth]. The MT model used is a multi-lingual model(¿200M) trained for two languages(German and Portuguese). End-to-End SLT systems are trained using LSTM-encoder-decoder models with characters as target units in [ON_TRAC_IWSLT2019]. Transformer models have been adapted for training End-to-End SLT systems [di2019adapting]. A convolutional front-end is used to sub-sample the acoustic sequence by a factor of 4 and the models are optimized with an addition distance penalty for the attention to focus more on the local context. Augmentation methods such as spec-augment [Spec_Aug] and use of back-translated synthetic text has improved the performance [FBK_iwslt2019]. The performances End-to-End SLT systems have been improved by initializing the parameters of the model from independently trained ASR and MT systems [inaguma2018jhu]. Block multi-task learning (BMTL) has been used for training MT systems [CMU_IWSLT2019]. BMTL is an encoder-decoder networks where encoder is shared by decoders which generates tokens with multiple granularities such as characters,sub-words. A Multi-Engine Machine Translation (MEMT) is a machine translation that uses inputs in multiple granularities and produces word based translations. Use of pre-trained language models in encoder decoder models have been explored for developing MT models in [LIG_IWSLT2019_using_bert]. Multi-model SLT systems were studied in [Multimodel_MT_IMP_LONDON], an LSTM-based encoder-decoder ASR is followed by a multi-model Transformer model. Text and video features have been used for developing multi-model machine translation systems.
All the experiments in this paper are conduced using HOW2 data-set [how2_dataset_paper]. The data-set comprises of train, dev and test sets. The train, dev and test sets consists of 185K, 2305 and 2022 sentences which amounts to 298, 3 and 4 hours of speech data. All the sentences in the data-set have parallel Portuguese translations. All the models in this work are trained using train set and early stopping is done using validation set.
Pre-processing and Feature Extraction:
All the text is lower-cased and the punctuation symbols are removed. The sentence piece model is trained to obtain a sub-word vocabulary of 5000 tokens for both English and Portuguese texts. Audio from the videos are extracted and 16 kHz signal is used in this experiments. 40-Dimensional Mel-filter bank features are extracted using Kaldi-toolkit with 25 ms window size and 10 ms overlap. Cepstral mean and variances of the features are normalized per-video. The performances of all the translation systems are measured using Sacre-BLEU.
3 Transformer Model
ASR systems used in this work are trained using transformer models. Transformer is an encoder-decoder model that models the temporal dependencies using self-attention mechanism. Typical encoder-decoder model has three blocks i.e., encoder, decoder and an attention module. The encoder models the sequential dependencies in the input and produces a hidden state sequence H. The decoder in conditioned on the past predicted label sequence
and the context vectorto produce the present label . The context vector is produced by summarizing the sub-sequence H which is relevant to produce the correct label. Attention mechanism provides a way to peek in to the input sequence and the predicted output sequence while generating a new label.
Let =[, , ,……., ] be the input acoustic features and H=[, , , …….., ] be the hidden representation. The hidden representation H is used by the decoder to produce the label sequence Y=[, , , ……, ].
Self-attention mechanism produces the output sequence as a weighted sum of the input sequence. Where the weights are derived from the similarity between each query and key vectors. The self-Attention mechanism can be described by the below equation:
Where Q,K and V are the query, key and value respectively. Here query, key has the same dimensionality and key, value has the same sequence length. The scalar is a normalizing factor to prevent the softmax from entering to the regions with small gradients.
The Self-attention mechanism does not have an explicit notion of position in the sequence. The information about the position in the sequence is conveyed by fixed trigonometric position embeddings. The position embeddings used in this work are described below:
In the above sequence, is the index of dimension and represents the position in the sequence. The encoding scheme generates a vector for each position and this vectors are summed to the input feature sequence.
Transformer employs a Multi-Head attention to learn the dependencies between the same sequence and different sequences. The Multi-Head attention computes scaled dot product attention multiple times in parallel. Before computing the attention, queries, keys and values are passed through linear projections. Each time the scaled dot-product attention is computed, the outputs of all the heads are concatenated and passed through a linear layer with dimension .
where each head is given by
A position-wise feed-forward network consists of two linear layers with rectified linear activation units (ReLU) as shown in the equation5.
Where and , , are the biases for the weights and respectively. Here is the dimensionality of position-wise feed-forward network. The blocks described above are connected as shown in the Figure. 1.
While training with speech inputs (i.e., for ASR and End-to-End SLT the models) convolution front-end is used. The front-end convolution module contains two convolution layers with a kernel size of 3 and stride of 2. The convolution layers used have 64 output channels and are trained with ReLU. These layers reduce the sequence length of the speech features and memory requirements of the transformer. The positional encoding is added to the features after the convolution layers. While training with text inputs (i.e.,for MT systems) an embedding layer is used as the input layer to convert the discrete symbol sequence to the sequence of continuous embeddings.
The model is optimized using ADAM optimizer with =0.9 and =0.99 and . The learning rate schedule used is defined below:
The models are trained with 25000 warmup steps and the value of k is set to 1. A label smoothing of 0.1 is used for training the models. Both the residual and attention dropout values are set to 0.1.
4 Comparing the Performance of Cascade and End-to-END SLT Systems
4.1 Transformer ASR systems
ASR systems are trained with characters or BPE units as target units. Both the models are trained with 12 encoder layers and 6 decoder layers. The models are trained for 150 epochs. Each batch comprised of 7000 target units. The models from the last ten epochs are averaged and the averaged weights are used during the decoding. ASR hypothesis are decoded with a beam size of 10. The start-of-the-sequence(¡SOS¿) and end-of-the-sequence (¡EOS¿) are modeled by additional tokens. The data-set has some very long sequences with 400-500 characters and decoding these sentences increases the decoding time. To reduce the decoding time the vectorized beam search described in[vectorized_beam_search] has been used. Decoding with larger beam size has been counter productive, the prefix with the noisy EOS token has higher log-likelihood compared to the actual sequence in the beam. A threshold mechanism described in [selftraining_EOS_threshold_FAIR]
has been used .i.e., the EOS is considered only if the probability of EOS istimes the top candidate. The value of is set to be 1 and this is obtained by fine tuning on the dev set. The length penalties of 1 and 0.8 are used for character and BPE models. No language model has been used in this work. The performance of both the models presented below Table. 1.
|Dev set(WER)||Test set(WER)|
Row 3 of Table.1 is the performance of the LSTM-based sequence-to-sequence model presented in [how2_dataset_paper]. Row 4 & 5 are the performances of ASR systems developed with transformer models with characters and BPE (5K) units as target units. The performance of transformer model with BPE units as targets has performed better. Row 2 is the ASR systems trained with TDNN-LFMMI using Kaldi-recipes this ASR is not used in any SLT systems, it is presented to compare the difference in the performances of the ASR systems.
4.2 Transformer Machine Translation Systems
Transformer models have been used for training Machine Translation (MT) systems. MT systems are trained with three different input-output granularities such as characters-characters, characters-BPE and BPE-BPE. The details about the architecture of the networks are described as follows: The model has 6-encoder layers, 6-decoder layers and the dimensionality of and are 512, 1024 respectively. The models are trained with 8 parallel heads. The models are trained for 150 epochs and each batch comprised of 7000 target units. The model from the last ten epochs are averaged and the averaged weights are used for decoding. The beam size and length peanlities for decoding are derived from the development set. The optimal beam size for all the granularities is 5, but the length penalties has varied, for characters-characters, characters-BPE and BPE-BPE the length penalties are 1.2, 1, 1.2 respectively. The tokens predicted from the model are converted back to the text and the Sacre BLEU is computed. The EOS threshold described in section 4.1 has been used. The performances of the MT systems trained are presented in Table. 2.
|Dev Sacre-BLEU||Test Sacre-BLEU|
From Table.2, it can be observed that both character-character and BPE-BPE based models had comparable performances. The charecter based models have taken longer time to train and decode compared to the BPE models. The models with character input and BPE as output has performed better than the other two systems.
4.3 Cascade SLT models
Cascaded SLT comprises of an ASR system followed by an MT. Two different cascades systems have been trained i.e., ASR systems with characters, BPE’s as output units along with the corresponding MT model. The performance of cascaded SLT system is presented in the Table. 3. The ASR and MT models described in section 4.1, section 4.2 are used in the cascaded pipeline. The ASR hypothesis is decoded and the decoded hypothesis is used by the MT model to produce the translation.
|Architecture (Input-Output) Granularity||1-best||n-best||ranked n-best|
|Dev set||Test set||Dev set||Test set||Dev set||Test set|
From Table 3, a large performance degradation can be observed between the cascade SLT systems and the MT systems described in section 4.2. Due to the use of ASR hypothesis as input text the performance of MT systems a degradation in the BLEU ¿ 10 has been observed. Column 1 is the cascade-SLT system, Column 2, 3 is the performance of SLT systems using n-best hypothesis. Columns 4, 5 are the performance of SLT systems which uses the n-best hypothesis which are ranked based on the log-likelihood scores. The per-sentence log-probabilities from the n-best ASR hypothesis is used as a ranking score. The prefixes that correspond to the ASR hypothesis are initialized with their corresponding n-best ranking score. The general pipeline search is described below:
In the above equation, is source speech and is the target token sequence and is the source token sequence. Using the N-best hypothesis without ranking scores from the ASR has degraded the performance which can be seen in column 3,4. Using N-best hypothesis with ranking scores has improved the performance. The training procedure described in section 5 is training the above objective function by considering only the one-best hypothesis.
4.4 End-to-End SLT models
End-to-End SLT systems are trained using transformer models. The architectural details of the model are described below: the model has 12-encoders, 6 decoder layers with , are 256 and 2048 with 4 heads. The models are trained to predict the Portuguese characters/BPE units. The models are trained for 150 epochs and each batch comprised of 7000 target units. The models form the last ten epochs are averaged and the averaged weight is used for decoding the hypothesis. The hypotheses are decoded with a beam of 10 and length penalty of 1. The beam search and EOS threshold described in section 4.1 are used. The performance of the End-to-End SP-MT are tabulated in Table. 4.
|Architecture (Target unit) (No.of parameters)||Dev set||Test set|
From the Table. 4, it can be observed that the models with BPE as target units has performed better than the models with characters as target units. The performance of these models is worse than the performance of the cascaded SLT models. To increase the model capacity, additional layers a have been added to encoder and decoder and the architecture of the model is in column 1. Row 8 of Table. 4 is a large transformer model with 20 encoder and 10 decoder layers and is 512 and rest of the models are trained is 256.
4.5 Augmented Training for Cascaded SLT
To reduce the mismatch between the ASR hypothesis and the oracle text, the training corpus for MT systems are augmented with the ASR hypothesis. ASR hypothesis for the training data is obtained using the ASR model described in section 4.1. The performance of these models is presented in Table 5.
|Oracle Text||ASR hypothesis|
|Dev set||Test set||Dev set||Test set|
Rows 1,2 and 3 are the different cascaded SLT systems with different input/output granularites. Column 2, 3 are the performances of cascaded SLT system evaluated with oracle text and column 4, 5 are the performances of cascaded SLT systems evaluated with ASR hypothesis. From the table .5, the performances of the systems trained with augmented training has not significantly improved the performance of the systems and has reduced the performance of the systems when evaluated with oracle text. The models trained from scratch with the augmented data and the models which are initially trained on clean data and later fine-tuned for augmented data have given the similar performance.
4.6 Using Averaged Embeddings
Apart from relying on the individual embedding, if the model can also rely on the information from the context of the embedding then the model could be robust to noisy ASR hypothesis. To enforce this in the model, we have randomly selected 10% of embedding from the input sequence and each embedding is averaged with the some other randomly selected embedding from the embedding matrix. This operation creates the uncertainty at the present embedding so that the model has to rely more on the context to retrieve information about the present label. This operation is done both at the encoder and decoder and the results are reported in the Table.6.
|Oracle text||ASR hypothesis||Oracle text||ASR hypothesis|
|Dev set||Test set||Dev set||Test set||Dev set||Test set||Dev set||Test set|
In Table.6, Column 2-5 are the performances of MT systems trained by averaging the encoder embeddings. Columns 2-3 and 4-5 are the performances of systems when evaluated using oracle text and ASR hypothesis text. Column 6-9 are the MT systems trained by averaging the decoder embeddings. The trained systems are evaluated using oracle text and ASR hypothesis and the results are presented in columns 6-7 and 8-9 respectively. From the Table.6, it can be observed that the averaging the encoder embeddings has degraded the performance of the models. Averaging the embedings at the decoder has improved the performances of most of the models. Though there can be some improvements these improvements are not significantly better than the cascaded model.
5 Multi-Task Training of SLT systems with ASR objective as an Auxiliary Loss
From the above sections, it can be observed that the performance of the pipeline based systems is higher than the End-to-End approaches. Though the performance of the pipeline based approach is higher it can also observed that the performance gap of the MT systems when evaluated with oracle text and ASR hypothesis is higher than 10 BLEU score. To reduce this performance gap, we have trained a model with End-to-End differentiable pipeline between the spoken sequence and the target token sequence. This architecture uses the ASR objective as an auxiliary loss. The ASR and MT models described in the above sections are connected. The hidden representation from the ASR decoder is used as input to train the MT model. The block-diagram describing the model is presented in Figure. 2. This model would not make any discrete decisions from the ASR model, and the model is optimized for the final objective.
As shown in the block diagram, the neural hidden representation from the decoder of the ASR model is used as the input to the MT model. The models are optimized with a multi-task loss function. The ASR model in the pipeline has two gradients i.e., from ASR objective and MT objective and the MT model in the pipeline has gradients w.r.t to MT objective only. The models are trained for 150 epochs and the weights are averaged and averaged weights are used for decoding the model. The model has two decoders in the pipeline, the decoder produces the n-best hypothesis from the ASR with a beam size of 10 and neural hidden representation from each of the n-best output is considered as a separate input to the MT. The MT model produces the n-best hypothesis for each input with a beam size of 5. All the hypothesis produced by the MT model are combined and the best hypothesis from the MT model is used as the output hypothesis. The performance of the proposed SLT systems are tabulated in the Table. 7. The block-diagram describing the proposed architecture is presented in Figure.3.
|Dev set||Test set||Dev set||Test set|
From the Table. 7, it can be observed that the performance of this model is better than the standard pipeline systems. The proposed model has an improvement in the BLEU score of 4-5 compared to End-to-End systems. From the Table.7, it can also be observed that the performance of the systems using both ASR and MT systems trained with BPE units is higher than the performance of the systems trained using ASR with characters and the MT systems with BPE units. This difference in the performance can be attributed to correspondence between the tokens of ASR and MT objectives. As the data-set has parallel text for training, there could be correspondence between the top 5k BPE units in English and Portuguese tokens and optimizing with the joint loss has optimized the ASR model with BPE as targets better than model optimized with characters as target tokens. From the Table 7, the performance of the model is better than the pipeline based systems. This can also be observed from columns 3-4 of Table 7, these are the WER’s obtained from the jointly-trained model. From the joint training the ASR model with BPE target units has better performance than the ASR with character units and this has lead to better inference of the bottlenecks and the MT target sequence. To improve this performance the ASR and MT models could be ensembled with a separately trained ASR and MT models. The ensembling and joint decoding is presented in section 6.
6 SLT Systems with Multi-Task Training and Joint Decoding
From Table.7, it can be observed that the performance of the ASR model in the joint training is not on-par with the performance in subsection 4.1. To improve this performance the ASR model is ensembled with the ASR model trained in subsection 4.1. During the inference for the softmax distributions from both the models are computed for each prefix and the distributions are averaged and averaged distribution is used for the beam search. The proposed model also produces characters/BPE units as outputs along with the neural-hidden representations. Along the the neural hidden representations, the characters/BPE can also be used for translation and the MT models described in the subsection. 4.2 can also be is used to generate the translations. Both the models are ensembled and the performance are tabulated in the below Table.8. The Block diagram for the joint decoding is presented in Figure. 3. ASR model is decoded with a beam size of 10 and every hypothesis is then decoded by a MT model with a beam size of 5.
|Dev set||Test set||Dev set||Test set||Dev set||Test set||Dev set||Test set|
From Table .8, column 2, 3 are the performances of jointly trained model. Column 4-5, 6-7 are the performances of SLT systems with ASR, MT ensembling respectively. Column 8-9 are the performances of SLT systems with both ASR and MT ensembling. From the table. 8, it can be observed that using joint decoding using both the models has improved the performances of both the models.
7 Using Pre-trained ASR & MT models for Multi-Task Training of SLT systems
The SLT model described in section 5, is initialized using the pre-trained ASR and MT models.The performance of these models is presented in Table.9. The pre-trained models are connected as described in the block diagram 2. The two models are connected with an additional linear layer or self-attention layers. The models are trained either by fine-tuning only the additional layers or fine-tuning the whole network. The models are trained for 100 epochs and decoded with same hyper-parameters as mentioned in section 5.
|Transformer-(characters-BPE) pre-training+ self-attn layers||42.74||42.87||42.49||42.33||42.43||42.33||41.95||41.7|
|Transformer-(char-BPE) fine-tuning+ self-attn layers||42.65||42.93||42.82||42.82||43.81||43.22||43.36||42.98|
From Table.9, row 3, 8 are the performances of systems which are trained from scratch. Rows 4,5 and 9,10 are the performances of the model obtained by fine-tuning the inserted layers. Rows 6,7 and 11, 12 are the performances obtained by fine-tuning whole pre-trained network along with the inserted layers. From Column 2, 3 of Table. 9, it can be observed the initializing the weights of the joint model from a separately trained ASR and MT has shown improvements, when the outputs of ASR are characters i.e., rows 4-7 have better performances than 3 but rows 9-11 have a slight degradation compared to row 8. This difference in performance can be attributed to the fact that the representations produced by the network for the characters are more generic to task compared to BPE units. This could also be seen in a sharp drop in the performance of stand-alone system in row 9 and the performance is recovered when the discrete BPE-units are decoded and translated through MT model. The joint models trained from scratch have shown better improvements when emsembled with other independent ASR and MT models as in rows 3,8. But the joint model with weights initialized from separately trained ASR and MT models as in rows 4-7 and 9-12 did not have much improvements from ensembling as both the models are similar. Ensembling with MT models have shown better improvements in BLEU score compared to ensembling with ASR models. This improvements could be attributed to the reason that MT model in joint model uses the continuous representations from the ASR module while the ensembled MT models uses discrete representations, these modules are using different modalities and also the MT model is the last in the whole pipeline. Using the n-best hypothesis and doing a coupled search between ASR and MT has given better performances than using the 1-best hypothesis. Training the model in the proposed approach has give the best BLEU-score of 47.33 and 46.9 on dev and test sets of HOW2 data-set. By far this is on-par with the best performing systems [KIT_IWSLT2019] in IWSLT-2019 on How2 data-set. The total parameters of the model (ASR+MT) are around 70M which is much lesser than the (¿350M) parameters in the models [KIT_IWSLT2019].
8 Conclusion& Future scope
The performance gap between End-to-End SLT models have been improving but still pipeline based approaches are better than the End-to-End approaches. A large performance gap can be observed when machine translation models are evaluated with ASR hypothesis and oracle text transcription. Proposed systems aims to reduce this gap by training models with end-to-end differentiable pipeline between ASR and MT models. Using the ASR objective as auxiliary loss while optimizing the models have improved the performance of Joint SLT. The performance gains are higher when both ASR and MT models are trained with BPE as target units. As all the models are transformer models, they could be replaced with Non-Auto regressive models which could reduce the latency of decoding. A mechanism to force the transformer to retrieve the information from the context in the presence of less confident embedding could help the model to reduce the performance gap of MT models when used with oracle text and the noisy ASR hypothesis. While training with the noisy transcripts, using a sentence-wise confidence metric and lowering the scale of penalizing the model conditioning on the confidence metric could improve the performance of MT when dealing with noisy transcriptions. Using unpaired additional data for training ASR and MT models could improve the performance of the ensemble.