End-to-end (E2E) speech translation (ST), a paradigm that directly maps audio to a foreign text, has been gaining popularity recently duong-etal-2016-attentional; berard2016listen; bansal2018low; di2019adapting; wang2019bridging. Based on the attentional encoder-decoder framework DBLP:journals/corr/BahdanauCB14
, it optimizes model parameters under direct translation supervision. This end-to-end paradigm avoids the problem of error propagation that is inherent in cascade models where an automatic speech recognition (ASR) model and a machine translation (MT) model are chained together. Nonetheless, previous work still reports that E2E ST delivers inferior performance compared to cascade methodsniehues_j_2019_3532822.
We study one reason for the difficulty of training E2E ST models, namely the uneven spread of information in the speech signal, as visualized in Figure 1, and the consequent difficulty of extracting informative features. Features corresponding to uninformative signals, such as pauses or noise, increase the input length and bring in unmanageable noise for ST. This increases the difficulty of learning Zhang2019trainable; na2019adaptive and reduces translation performance.
In this paper, we propose adaptive feature selection (AFS) for ST to explicitly eliminate uninformative features. Figure 2 shows the overall architecture. We employ a pretrained ASR encoder to induce contextual speech features, followed by an ST encoder bridging the gap between speech and translation modalities. AFS is inserted in-between them to select a subset of features for ST encoding (see red rectangles in Figure 1). To ensure that the selected features are well-aligned to transcriptions, we pretrain AFS on ASR. AFS estimates the informativeness of each feature through a parameterized gate, and encourages the dropping of features (pushing the gate to ) that contribute little to ASR. An underlying assumption is that features irrelevant for ASR are also unimportant for ST.
We base AFS on Drop zhang2020sparsifying, a sparsity-inducing method for encoder-decoder models, and extend it to sparsify speech features. The acoustic input of speech signals involves two dimensions: temporal and feature, where the latter one describes the spectrum extracted from time frames. Accordingly, we adapt Drop to sparsify encoder states along temporal and feature dimensions but using different gating networks. In contrast to zhang2020sparsifying, who focus on efficiency and report a trade-off between sparsity and quality for MT and summarization, we find that sparsity also improves translation quality for ST.
We conduct extensive experiments with Transformer NIPS2017_7181_attention on LibriSpeech En-Fr and MuST-C speech translation tasks, covering 8 different language pairs. Results show that AFS only retains about 16% of temporal speech features, revealing heavy redundancy in speech encodings and yielding a decoding speedup of 1.4. AFS eases model convergence, and improves the translation quality by 1.3–1.6 BLEU, surpassing several strong baselines. Specifically, without data augmentation, AFS narrows the performance gap against the cascade approach, and outperforms it on LibriSpeech En-Fr by 0.29 BLEU, reaching 18.56. We compare against fixed-rate feature selection and a simple CNN, confirming that our adaptive feature selection offers better translation quality.
Our work demonstrates that E2E ST suffers from redundant speech features, with sparsification bringing significant performance improvements. The E2E ST task offers new opportunities for follow-up research in sparse models to deliver performance gains, apart from enhancing efficiency and/or interpretability.
2 Background: Drop
Drop provides a selective mechanism for encoder-decoder models which encourages removing uninformative encoder outputs via a sparsity-inducing objective zhang2020sparsifying. Given a source sequence , Drop assigns each encoded source state with a scalar gate as follows:
are hyperparameters of the hard concrete distribution (HardConcrete)louizos2017learning.
Note that the hyperparameter is crucial to HardConcrete as it directly governs its shape. We associate with through a gating network:
Thus, Drop can schedule HardConcrete via
to put more probability mass at either(i.e ) or (i.e. ). is a trainable parameter. Intuitively, Drop controls the openness of gate via so as to determine whether to remove () or retain () the state .
Drop enforces sparsity by pushing the probability mass of HardConcrete towards , according to the following penalty term:
By sampling with reparameterization kingma2013auto, Drop is fully differentiable and optimized with an upper bound on the objective: , where is a hyperparameter affecting the degree of sparsity – a larger enforces more gates near 0 – and denotes the maximum likelihood loss. An estimation of the expected value of is used during inference. zhang2020sparsifying applied Drop to prune encoder outputs for MT and summarization tasks; we adapt it to E2E ST. Sparse stochastic gates and relaxations were also by bastings-etal-2019-interpretable
to construct interpretable classifiers, i.e. models that can reveal which tokens they rely on when making a prediction.
3 Adaptive Feature Selection
One difficulty with applying encoder-decoder models to E2E ST is deciding how to encode speech signals. In contrast to text where word boundaries can be easily identified, the spectrum features of speech are continuous, varying remarkably across different speakers for the same transcript. In addition, redundant information, like pauses in-between neighbouring words, can be of arbitrary duration at any position as shown in Figure 1, while contributing little to translation. This increases the burden and occupies the capacity of ST encoder, leading to inferior performance duong-etal-2016-attentional; berard2016listen. Rather than developing complex encoder architectures, we resort to feature selection to explicitly clear out those uninformative speech features.
Figure 2 gives an overview of our model. We use a pretrained and frozen ASR encoder to extract contextual speech features, and collect the informative ones from them via AFS before transmission to the ST encoder. AFS drops pauses, noise and other uninformative features and retains features that are relevant for ASR. We speculate that these retained features are also the most relevant for ST, and that the sparser representation simplifies the learning problem for ST, for example the learning of attention strength between encoder states and target language (sub)words. Given a training tuple (audio, source transcription, translation), denoted as respectively,222Note that our model only requires pair-wise training corpora, for ASR, and for ST. we outline the overall framework below, including three steps:
[enhanced,attach boxed title to top center=yshift=-3mm,yshifttext=-1mm, title=E2E ST with AFS, colback=white, colframe=black, coltitle=black, colbacktitle=white, toprule=0.5pt, bottomrule=0.5pt,leftrule=0.5pt,rightrule=0.5pt]
Train ASR model with the following objective and model architecture until convergence:
Finetune ASR model with AFS for steps:
Train ST model with pretrained and frozen ASR and AFS submodules until convergence:
We handle both ASR and ST as sequence-to-sequence problem with encoder-decoder models. We use and to denote the corresponding encoder and decoder respectively. denotes the AFS approach, and means freezing the ASR encoder and the AFS module during training. Note that our framework puts no constraint on the architecture of the encoder and decoder in any task, although we adopt the multi-head dot-product attention network NIPS2017_7181_attention for our experiments.
The ASR model (Eq. 6) directly maps an audio input to its transcription. To improve speech encoding, we apply logarithmic penalty on attention to enforce short-range dependency di2019adapting and use trainable positional embedding with a maximum length of 2048. Apart from , we augment the training objective with the connectionist temporal classification (Graves06connectionisttemporal, CTC) loss as in Eq. 5. Note . The CTC loss is applied to the encoder outputs, guiding them to align with their corresponding transcription (sub)words and improving the encoder’s robustness karita2019transformerasr. Following previous work karita2019transformerasr; wang2020curriculum, we set to .
This stage aims at using AFS to dynamically pick out the subset of ASR encoder outputs that are most relevant for ASR performance (see red rectangles in Figure 1). We follow zhang2020sparsifying and place AFS in-between ASR encoder and decoder during finetuning (see in , Eq. 8). We exclude the CTC loss in the training objective (Eq. 7) to relax the alignment constraint and increase the flexibility of feature adaptation. We use Drop for AFS in two ways.
AFS The direct application of Drop on ASR encoder results in AFS, sparsifying encodings along the temporal dimension :
where is a positive scalar powered by a simple linear gating layer, and is a trainable parameter of dimension . is the temporal gate. The sparsity penalty of AFS follows Eq. 4:
AFS In contrast to text processing, speech processing often extracts spectrum from overlapping time frames to form the acoustic input, similar to the word embedding. As each encoded speech feature contains temporal information, it is reasonable to extend AFS to AFS, including sparsification along the feature dimension :
where estimates the weights of each feature, dominated by an input-independent gating model with trainable parameter .333Other candidate gating models, like linear mapping upon mean-pooled encoder outputs, delivered worse performance in our preliminary experiments. is the feature gate. Note that is shared for all time steps. denotes element-wise multiplication. AFS reuses -relevant submodules in Eq. 11, and extends the sparsity penalty in Eq. 12 as follows:
E2E ST Training
We treat the pretrained ASR and AFS model as a speech feature extractor, and freeze them during ST training. We gather the speech features emitted by the ASR encoder that correspond to , and pass them similarly as done with word embeddings to the ST encoder. We employ sinusoidal positional encoding to distinguish features at different positions. Except for the input to the ST encoder, our E2E ST follows the standard encoder-decoder translation model ( in Eq. 10) and is optimized with alone as in Eq. 9. Intuitively, AFS bridges the gap between ASR output and MT input by selecting transcript-aligned speech features.
Datasets and Preprocessing
We experiment with two benchmarks: the Augmented LibriSpeech dataset (LibriSpeech En-Fr)kocabiyikoglu-etal-2018-augmenting and the multilingual MuST-C dataset (MuST-C) di-gangi-etal-2019-must. LibriSpeech En-Fr is collected by aligning e-books in French with English utterances of LibriSpeech, further augmented with French translations offered by Google Translate. We use the 100 hours clean training set for training, including 47K utterances to train ASR models and double the size for ST models after concatenation with the Google translations. We report results on the test set (2048 utterances) using models selected on the dev set (1071 utterances). MuST-C is built from English TED talks, covering 8 translation directions: English to German (De), Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), Romanian (Ro) and Russian (Ru). We train ASR and ST models on the given training set, containing 452 hours with 252K utterances on average for each translation pair. We adopt the given dev set for model selection and report results on the common test set, whose size ranges from 2502 (Es) to 2641 (De) utterances.
For all datasets, we extract 40-dimensional log-Mel filterbanks with a step size of 10ms and window size of 25ms as the acoustic features. We expand these features with their first and second-order derivatives, and stabilize them using mean subtraction and variance normalization. We stack the features corresponding to three consecutive frames without overlapping to the left, resulting in the final 360-dimensional acoustic input. For transcriptions and translations, we tokenize and truecase all the text using Moses scriptskoehn-etal-2007-moses. We train subword models sennrich-etal-2016-neural on each dataset with a joint vocabulary size of 16K to handle rare words, and share the model for ASR, MT and ST. We train all models without removing punctuation.
Model Settings and Baselines
We adopt the Transformer architecture NIPS2017_7181_attention for all tasks, including (Eq. 6), (Eq. 8) and (Eq. 10). The encoder and decoder consist of 6 identical layers, each including a self-attention sublayer, a cross-attention sublayer (decoder alone) and a feedforward sublayer. We employ the base setting for experiments: hidden size , attention head 8 and feedforward size 2048. We schedule learning rate via Adam () kingma2014adam
, paired with a warmup step of 4K. We apply dropout to attention weights and residual connections with a rate of 0.1 and 0.2 respectively, and also add label smoothing of 0.1 to handle overfitting. We train all models with a maximum step size of 30K and a minibatch size of around 25K target subwords. We average the last 5 checkpoints for evaluation. We use beam search for decoding, and set the beam size and length penalty to 4 and 0.6, respectively. We set, and for AFS following louizos2017learning, and finetune AFS for an additional steps. We evaluate translation quality with tokenized case-sensitive BLEU papineni-etal-2002-bleu, and report WER for ASR performance without punctuation.
We compare our models with four baselines:
A vanilla Transformer-based E2E ST model of 6 encoder and decoder layers. Logarithmic attention penalty di2019adapting is used to improve the encoder.
- ST + ASR-PT:
We perform the ASR pretraining (ASR-PT) for E2E ST. This is the same model as ours (Figure 2) but without AFS finetuning.
We first transcribe the speech input using an ASR model, and then passes the results on to an MT model. We also use the logarithmic attention penalty di2019adapting for the ASR encoder.
- ST + Fixed Rate:
Instead of dynamically selecting features, we replace AFS with subsampling at a fixed rate: we extract the speech encodings after every positions.
Besides, we offer another baseline, ST + CNN
, for comparison on MuST-C En-De: we replace the fixed-rate subsampling with a one-layer 1D depth-separable convolution, where the output dimension is set to 512, the kernel size over temporal dimension is set to 5 and the stride is set to 6. In this way, the ASR encoder features will be compressed to around 1/6 features, a similar ratio to the fixed-rate subsampling.
4.1 Results on MuST-C En-De
but without dropping any neurons, i.e. feature sparsity rate 0%. By contrast, speech features are of high redundancy along temporal dimension, easily inducing high sparsity rate of85%.
We perform a thorough study on MuST-C En-De. With AFS, the first question is its feasibility. We start by analyzing the degree of sparsity in speech features (i.e. sparsity rate) yielded by AFS, focusing on the temporal sparsity rate and the feature sparsity rate . To obtain different rates, we vary the hyperparameter in Eq. 7 in a range of with a step size 0.1.
Results in Figure 3 show that large amounts of encoded speech features () can be easily pruned out, revealing heavy inner-speech redundancy. Both AFS and AFS drop 60% temporal features with of 0.1, and this number increases to when (Figure 2(b)), remarkably surpassing the sparsity rate reported by zhang2020sparsifying
on text summarization (). In contrast to rich temporal sparsification, we get a feature sparsity rate of 0, regardless of ’s value, although increasing decreases (Figure 2(a)). This suggests that selecting neurons from the feature dimension is harder. Rather than filtering neurons, the feature gate acts more like a weighting mechanism on them. In the rest of the paper, we use sparsity rate for the temporal sparsity rate.
We continue to explore the impact of varied sparsity rates on the ASR and ST performance. Figure 4 shows their correlation. We observe that AFS slightly degenerates ASR accuracy (Figure 3(a)), but still retains 95% accuracy on average; AFS often performs better than AFS with similar sparsity rate. The fact that only speech features successfully support 95% ASR accuracy proves the informativeness of these selected features. These findings echo with zhang2020sparsifying, where they observe a trade-off between sparsity and quality.
However, when AFS is applied to ST, we find consistent improvements to translation quality by BLEU, shown in Figure 3(b). Translation quality on the development set peaks at 22.17 BLEU achieved by AFS with a sparsity rate of 85.5%. We set (corresponding to sparsity rate of 85%) for all other experiments, since AFS and AFS reach their optimal result at this point.
|ST + ASR-PT||20.67||1.00|
|ST + CNN||20.64||1.31|
|ST + Fixed Rate ()||21.14 (83.3%)||1.42|
|ST + Fixed Rate ()||20.87 (85.7%)||1.43|
|ST + AFS||21.57 (84.4%)||1.38|
|ST + AFS||22.38 (85.1%)||1.37|
We summarize the test results in Table 1, where we set or for ST+Fixed Rate with a sparsity rate of around 85% inspired by our above analysis. Our vanilla ST model yields a BLEU score of 17.44; pretraining on ASR further enhances the performance to 20.67, significantly outperforming the results of di2019adapting by 3.37 BLEU. This also suggests the importance of speech encoder pretraining di2019adapting; stoian2020analyzing; wang2020curriculum. We treat ST with ASR-PT as our real baseline. We observe improved translation quality with fixed-rate subsampling, +0.47 BLEU at . Subsampling offers a chance to bypass noisy speech signals and reducing the number of source states makes learning translation alignment easier, but deciding the optimal sampling rate is tough. Results in Figure 5 reveal that fixed-rate subsampling deteriorates ST performance with suboptimal rates. Replacing fixed-rate subsampling with our one-layer CNN also fails to improve over the baseline, although CNN offers more flexibility in feature manipulation. By contrast to fixed-rate subsampling, the proposed AFS is data-driven, shifting the decision burden to the data and model themselves. As a result, AFS and AFS surpass ASR-PT by 0.9 BLEU and 1.71 BLEU, respectively, substantially narrowing the performance gap compared to the cascade baseline (-0.14 BLEU).
We also observe improved decoding speed: AFS runs 1.37 faster than ASR-PT. Compared to the fixed-rate subsampling, AFS is slightly slower which we ascribe to the overhead introduced by the gating module. Surprisingly, Table 1 shows that the vanilla ST runs slower than ASR-PT (0.87) while the cascade model is slightly faster (1.06). By digging into the beam search algorithm, we discover that ASR pretraining shortens the number of steps in beam-decoding: ASR-PT vs. vanilla ST (on average). The speedup brought by cascading is due to the smaller English vocabulary size compared to the German vocabulary when processing audio inputs.
4.2 Why (Adaptive) Feature Selection?
Apart from the benefits in translation quality, we go deeper to study other potential impacts of (adaptive) feature selection. We begin with inspecting training curves. Figure 6 shows that ASR pretraining improves model convergence; feature selection makes training more stable. Compared to other models, the curve of ST with AFS is much smoother, suggesting its better regularization effect.
We then investigate the effect of training data size, and show the results in Figure 7. Overall, we do not observe higher data efficiency by feature selection on low-resource settings. But instead, our results suggest that feature selection delivers larger performance improvement when more training data is available. With respect to data efficiency, ASR pretraining seems to be more important (Figure 7, left) bansal-etal-2019-pre; stoian2020analyzing. Compared to AFS, the fixed-rate subsampling suffers more from small-scale training: it yields worse performance than ASR-PT when data size , highlighting better generalization of AFS.
In addition to model performance, we also look into the ST model itself, and focus on the cross-attention weights. Figure 8 visualize the attention value distribution, where ST models with feature selection noticeably shift the distribution towards larger weights. This suggests that each ST encoder output exerts greater influence on the translation. By removing redundant and noisy speech features, feature selection eases the learning of the ST encoder, and also enhances its connection strength with the ST decoder. This helps bridge the modality gap between speech and text translation. Although fixed-rate subsampling also delivers a distribution shift similar to AFS, its inferior ST performance compared to AFS corroborates the better quality of adaptively selected features.
|Transformer + ASR-PT||21.77||26.41||31.56||21.46||25.22||26.84||20.53||14.31|
|ST + ASR-PT||20.67||25.96||32.24||20.84||23.27||24.83||19.94||13.96|
|ST + AFS||21.57||26.78||33.34||23.08||24.68||26.13||21.73||15.10|
|ST + AFS||22.38||27.04||33.43||23.35||25.05||26.55||21.87||14.92|
|SacreBLEU||ST + AFS||21.6||26.6||31.5||22.6||24.6||25.9||20.8||14.9|
|ST + AFS||22.4||26.9||31.6||23.0||24.9||26.3||21.0||14.7|
|Temporal||ST + AFS||84.4%||84.5%||83.2%||84.9%||84.4%||84.4%||84.7%||84.2%|
|Sparsity Rate||ST + AFS||85.1%||84.5%||84.7%||84.9%||83.5%||85.1%||84.8%||84.7%|
|Speedup||ST + AFS||1.38||1.35||1.50||1.34||1.54||1.43||1.59||1.31|
|ST + AFS||1.37||1.34||1.50||1.39||1.42||1.26||1.46||1.37|
AFS vs. Fixed Rate
We compare these two approaches by analyzing the number of retained features with respect to word duration and temporal position. Results in Figure 8(a) show that the underlying pattern behind these two methods is similar: words with longer duration correspond to more speech features. However, when it comes to temporal position, Figure 8(b) illustrates their difference: fixed-rate subsampling is context-independent, periodically picking up features; while AFS decides feature selection based on context information. The curve of AFS is more smooth, indicating that features kept by AFS are more uniformly distributed across different positions, ensuring the features’ informativeness.
Afs vs. AFS
Their only difference lies at the feature gate . We visualize this gate in Figure 10. Although this gate induces no sparsification, it offers AFS the capability of adjusting the weight of each neuron. In other words, AFS has more freedom in manipulating speech features.
|ST + ASR-PT||17.05|
|ST + AFS||18.33|
|ST + AFS||18.56|
|SacreBLEU||ST + AFS||16.9|
|ST + AFS||17.2|
|Temporal||ST + AFS||84.7%|
|Sparsity Rate||ST + AFS||83.5%|
|Speedup||ST + AFS||1.84|
|ST + AFS||1.78|
4.3 Results on MuST-C and LibriSpeech
Table 2 and Table 3 list the results on MuST-C and LibriSpeech En-Fr, respectively. Over all tasks, AFS/AFS substantially outperforms ASR-PT by 1.34/1.60 average BLEU, pruning out 84.5% temporal speech features on average and yielding an average decoding speedup of 1.45. Our model narrows the gap against the cascade model to -0.8 average BLEU, where AFS surpasses Cascade on LibriSpeech En-Fr, without using KD liu2019end and data augmentation wang2020curriculum. Comparability to previous work is limited due to possible differences in tokenization and letter case. To ease future cross-paper comparison, we provide SacreBLEU post-2018-call444signature: BLEU+c.mixed+#.1+s.exp+tok.13a+version.1.3.6 for our models.
5 Related Work
Pioneering studies on ST used a cascade of separately trained ASR and MT systems ney1999speech. Despite its simplicity, this approach inevitably suffers from mistakes made by ASR models, and is error prone. Research in this direction often focuses on strategies capable of mitigating the mismatch between ASR output and MT input, such as representing ASR outputs with lattices saleem2004using; mathias2006statistical; zhang-etal-2019-lattice; beck-etal-2019-neural, injecting synthetic ASR errors for robust MT tsvetkov-etal-2014-augmenting; cheng-etal-2018-towards and differentiable cascade modeling kano2017structured; anastasopoulos-chiang-2018-tied; sperber-etal-2019-attention.
In contrast to cascading, another option is to perform direct speech-to-text translation. duong-etal-2016-attentional and berard2016listen employ the attentional encoder-decoder model DBLP:journals/corr/BahdanauCB14 for E2E ST without accessing any intermediate transcriptions. E2E ST opens the way to bridging the modality gap directly, but it is data-hungry, sample-inefficient and often underperforms cascade models especially in low-resource settings bansal2018low. This led researchers to explore solutions ranging from efficient neural architecture design karita2019transformerasr; di2019adapting; sung2019towards to extra training signal incorporation, including multi-task learning weiss2017sequence; liu2019synchronous, submodule pretraining bansal-etal-2019-pre; stoian2020analyzing; wang2020curriculum, knowledge distillation liu2019end, meta-learning indurthi2019data and data augmentation kocabiyikoglu-etal-2018-augmenting; jia2019leveraging; pino2019harnessing. Our work focuses on E2E ST, but we investigate feature selection which has rarely been studied before.
Speech Feature Selection
Encoding speech signals is challenging as acoustic input is lengthy, noisy and redundant. To ease model learning, previous work often selected features via downsampling techniques, such as convolutional modeling di2019adapting and fixed-rate subsampling lu2015study. Recently, Zhang2019trainable and na2019adaptive proposed dynamic subsampling for ASR which learns to skip uninformative features during recurrent encoding. Unfortunately, their methods are deeply embedded into recurrent networks, hard to adapt to other architectures like Transformer NIPS2017_7181_attention. Recently, salesky-etal-2019-exploring have explored phoneme-level representations for E2E ST, which reduces speech features temporarily by 80% and obtains significant performance improvement, but this requires non-trivial phoneme recognition and alignment.
Instead, we resort to sparsification techniques which have achieved great success in NLP tasks recently correia-etal-2019-adaptively; child2019generating; zhang2020sparsifying. In particular, we employ Drop zhang2020sparsifying for AFS to dynamically retain informative speech features, which is fully differentiable and independent of concrete encoder/decoder architectures. We extend Drop by handling both temporal and feature dimensions with different gating networks, and apply it to E2E ST.
6 Conclusion and Future Work
In this paper, we propose adaptive feature selection for E2E ST to handle redundant and noisy speech signals. We insert AFS in-between the ST encoder and a pretrained, frozen ASR encoder to filter out uninformative features contributing little to ASR. We base AFS on Drop zhang2020sparsifying, and extend it to modeling both temporal and feature dimensions. Results show that AFS improves translation quality and accelerates decoding by 1.4 with an average temporal sparsity rate of 84%. AFS successfully narrows or even closes the performance gap compared to cascading models.
While most previous work on sparsity in NLP demonstrates its benefits from efficiency and/or interpretability perspectives zhang2020sparsifying, we show that sparsification in our scenario – E2E ST – leads to substantial performance gains.
In the future, we will work on adapting AFS to simultaneous speech translation.
We would like to thank Shucong Zhang for his great support on building our ASR baselines. IT acknowledges support of the European Research Council (ERC Starting grant 678254) and the Dutch National Science Foundation (NWO VIDI 639.022.518). This work has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 825460 (ELITR). Rico Sennrich acknowledges support of the Swiss National Science Foundation (MUTAMUR; no. 176727).