Adaptive Feature Selection for End-to-End Speech Translation

10/16/2020 ∙ by Biao Zhang, et al. ∙ Universität Zürich 15

Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to SR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech En-Fr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out  84 of  1.3-1.6 BLEU and a decoding speedup of  1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation)



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end (E2E) speech translation (ST), a paradigm that directly maps audio to a foreign text, has been gaining popularity recently duong-etal-2016-attentional; berard2016listen; bansal2018low; di2019adapting; wang2019bridging. Based on the attentional encoder-decoder framework DBLP:journals/corr/BahdanauCB14

, it optimizes model parameters under direct translation supervision. This end-to-end paradigm avoids the problem of error propagation that is inherent in cascade models where an automatic speech recognition (ASR) model and a machine translation (MT) model are chained together. Nonetheless, previous work still reports that E2E ST delivers inferior performance compared to cascade methods 


Figure 1:

Example illustrating our motivation. We plot the amplitude and frequency spectrum of an audio segment (top), paired with its time-aligned words and phonemes (bottom). Information inside an audio stream is not uniformly distributed. We propose to dynamically capture speech features corresponding to informative signals (red rectangles) to improve ST.

We study one reason for the difficulty of training E2E ST models, namely the uneven spread of information in the speech signal, as visualized in Figure 1, and the consequent difficulty of extracting informative features. Features corresponding to uninformative signals, such as pauses or noise, increase the input length and bring in unmanageable noise for ST. This increases the difficulty of learning Zhang2019trainable; na2019adaptive and reduces translation performance.

In this paper, we propose adaptive feature selection (AFS) for ST to explicitly eliminate uninformative features. Figure 2 shows the overall architecture. We employ a pretrained ASR encoder to induce contextual speech features, followed by an ST encoder bridging the gap between speech and translation modalities. AFS is inserted in-between them to select a subset of features for ST encoding (see red rectangles in Figure 1). To ensure that the selected features are well-aligned to transcriptions, we pretrain AFS on ASR. AFS estimates the informativeness of each feature through a parameterized gate, and encourages the dropping of features (pushing the gate to ) that contribute little to ASR. An underlying assumption is that features irrelevant for ASR are also unimportant for ST.

Figure 2: Overview of our E2E ST model. AFS is inserted between the ST encoder (blue) and a pretrained ASR encoder (gray) to filter speech features for translation. We pretrain AFS jointly with ASR and freeze it during ST training.

We base AFS on Drop zhang2020sparsifying, a sparsity-inducing method for encoder-decoder models, and extend it to sparsify speech features. The acoustic input of speech signals involves two dimensions: temporal and feature, where the latter one describes the spectrum extracted from time frames. Accordingly, we adapt Drop to sparsify encoder states along temporal and feature dimensions but using different gating networks. In contrast to zhang2020sparsifying, who focus on efficiency and report a trade-off between sparsity and quality for MT and summarization, we find that sparsity also improves translation quality for ST.

We conduct extensive experiments with Transformer NIPS2017_7181_attention on LibriSpeech En-Fr and MuST-C speech translation tasks, covering 8 different language pairs. Results show that AFS only retains about 16% of temporal speech features, revealing heavy redundancy in speech encodings and yielding a decoding speedup of 1.4. AFS eases model convergence, and improves the translation quality by 1.3–1.6 BLEU, surpassing several strong baselines. Specifically, without data augmentation, AFS narrows the performance gap against the cascade approach, and outperforms it on LibriSpeech En-Fr by 0.29 BLEU, reaching 18.56. We compare against fixed-rate feature selection and a simple CNN, confirming that our adaptive feature selection offers better translation quality.

Our work demonstrates that E2E ST suffers from redundant speech features, with sparsification bringing significant performance improvements. The E2E ST task offers new opportunities for follow-up research in sparse models to deliver performance gains, apart from enhancing efficiency and/or interpretability.

2 Background: Drop

Drop provides a selective mechanism for encoder-decoder models which encourages removing uninformative encoder outputs via a sparsity-inducing objective zhang2020sparsifying. Given a source sequence , Drop assigns each encoded source state with a scalar gate as follows:



are hyperparameters of the hard concrete distribution (HardConcrete) 


Note that the hyperparameter is crucial to HardConcrete as it directly governs its shape. We associate with through a gating network:


Thus, Drop can schedule HardConcrete via

to put more probability mass at either

(i.e ) or (i.e. ). is a trainable parameter. Intuitively, Drop controls the openness of gate via so as to determine whether to remove () or retain () the state .

Drop enforces sparsity by pushing the probability mass of HardConcrete towards , according to the following penalty term:


By sampling with reparameterization kingma2013auto, Drop is fully differentiable and optimized with an upper bound on the objective: , where is a hyperparameter affecting the degree of sparsity – a larger enforces more gates near 0 – and denotes the maximum likelihood loss. An estimation of the expected value of is used during inference. zhang2020sparsifying applied Drop to prune encoder outputs for MT and summarization tasks; we adapt it to E2E ST. Sparse stochastic gates and relaxations were also by bastings-etal-2019-interpretable

to construct interpretable classifiers, i.e. models that can reveal which tokens they rely on when making a prediction.

3 Adaptive Feature Selection

One difficulty with applying encoder-decoder models to E2E ST is deciding how to encode speech signals. In contrast to text where word boundaries can be easily identified, the spectrum features of speech are continuous, varying remarkably across different speakers for the same transcript. In addition, redundant information, like pauses in-between neighbouring words, can be of arbitrary duration at any position as shown in Figure 1, while contributing little to translation. This increases the burden and occupies the capacity of ST encoder, leading to inferior performance duong-etal-2016-attentional; berard2016listen. Rather than developing complex encoder architectures, we resort to feature selection to explicitly clear out those uninformative speech features.

Figure 2 gives an overview of our model. We use a pretrained and frozen ASR encoder to extract contextual speech features, and collect the informative ones from them via AFS before transmission to the ST encoder. AFS drops pauses, noise and other uninformative features and retains features that are relevant for ASR. We speculate that these retained features are also the most relevant for ST, and that the sparser representation simplifies the learning problem for ST, for example the learning of attention strength between encoder states and target language (sub)words. Given a training tuple (audio, source transcription, translation), denoted as respectively,222Note that our model only requires pair-wise training corpora, for ASR, and for ST. we outline the overall framework below, including three steps:

[enhanced,attach boxed title to top center=yshift=-3mm,yshifttext=-1mm, title=E2E ST with AFS, colback=white, colframe=black, coltitle=black, colbacktitle=white, toprule=0.5pt, bottomrule=0.5pt,leftrule=0.5pt,rightrule=0.5pt]

  1. [leftmargin=*]

  2. Train ASR model with the following objective and model architecture until convergence:

  3. Finetune ASR model with AFS for steps:

  4. Train ST model with pretrained and frozen ASR and AFS submodules until convergence:


We handle both ASR and ST as sequence-to-sequence problem with encoder-decoder models. We use and to denote the corresponding encoder and decoder respectively. denotes the AFS approach, and means freezing the ASR encoder and the AFS module during training. Note that our framework puts no constraint on the architecture of the encoder and decoder in any task, although we adopt the multi-head dot-product attention network NIPS2017_7181_attention for our experiments.

ASR Pretraining

The ASR model (Eq. 6) directly maps an audio input to its transcription. To improve speech encoding, we apply logarithmic penalty on attention to enforce short-range dependency di2019adapting and use trainable positional embedding with a maximum length of 2048. Apart from , we augment the training objective with the connectionist temporal classification (Graves06connectionisttemporal, CTC) loss as in Eq. 5. Note . The CTC loss is applied to the encoder outputs, guiding them to align with their corresponding transcription (sub)words and improving the encoder’s robustness karita2019transformerasr. Following previous work karita2019transformerasr; wang2020curriculum, we set to .

AFS Finetuning

This stage aims at using AFS to dynamically pick out the subset of ASR encoder outputs that are most relevant for ASR performance (see red rectangles in Figure 1). We follow zhang2020sparsifying and place AFS in-between ASR encoder and decoder during finetuning (see in , Eq. 8). We exclude the CTC loss in the training objective (Eq. 7) to relax the alignment constraint and increase the flexibility of feature adaptation. We use Drop for AFS in two ways.

AFS The direct application of Drop on ASR encoder results in AFS, sparsifying encodings along the temporal dimension :


where is a positive scalar powered by a simple linear gating layer, and is a trainable parameter of dimension . is the temporal gate. The sparsity penalty of AFS follows Eq. 4:


AFS In contrast to text processing, speech processing often extracts spectrum from overlapping time frames to form the acoustic input, similar to the word embedding. As each encoded speech feature contains temporal information, it is reasonable to extend AFS to AFS, including sparsification along the feature dimension :


where estimates the weights of each feature, dominated by an input-independent gating model with trainable parameter .333Other candidate gating models, like linear mapping upon mean-pooled encoder outputs, delivered worse performance in our preliminary experiments. is the feature gate. Note that is shared for all time steps. denotes element-wise multiplication. AFS reuses -relevant submodules in Eq. 11, and extends the sparsity penalty in Eq. 12 as follows:


We perform the finetuning by replacing () in Eq. (8-7) with either AFS () or AFS () for extra steps. We compare these two variants in our experiments.

E2E ST Training

We treat the pretrained ASR and AFS model as a speech feature extractor, and freeze them during ST training. We gather the speech features emitted by the ASR encoder that correspond to , and pass them similarly as done with word embeddings to the ST encoder. We employ sinusoidal positional encoding to distinguish features at different positions. Except for the input to the ST encoder, our E2E ST follows the standard encoder-decoder translation model ( in Eq. 10) and is optimized with alone as in Eq. 9. Intuitively, AFS bridges the gap between ASR output and MT input by selecting transcript-aligned speech features.

4 Experiments

Datasets and Preprocessing

We experiment with two benchmarks: the Augmented LibriSpeech dataset (LibriSpeech En-Fr) 

kocabiyikoglu-etal-2018-augmenting and the multilingual MuST-C dataset (MuST-C) di-gangi-etal-2019-must. LibriSpeech En-Fr is collected by aligning e-books in French with English utterances of LibriSpeech, further augmented with French translations offered by Google Translate. We use the 100 hours clean training set for training, including 47K utterances to train ASR models and double the size for ST models after concatenation with the Google translations. We report results on the test set (2048 utterances) using models selected on the dev set (1071 utterances). MuST-C is built from English TED talks, covering 8 translation directions: English to German (De), Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), Romanian (Ro) and Russian (Ru). We train ASR and ST models on the given training set, containing 452 hours with 252K utterances on average for each translation pair. We adopt the given dev set for model selection and report results on the common test set, whose size ranges from 2502 (Es) to 2641 (De) utterances.

For all datasets, we extract 40-dimensional log-Mel filterbanks with a step size of 10ms and window size of 25ms as the acoustic features. We expand these features with their first and second-order derivatives, and stabilize them using mean subtraction and variance normalization. We stack the features corresponding to three consecutive frames without overlapping to the left, resulting in the final 360-dimensional acoustic input. For transcriptions and translations, we tokenize and truecase all the text using Moses scripts

koehn-etal-2007-moses. We train subword models sennrich-etal-2016-neural on each dataset with a joint vocabulary size of 16K to handle rare words, and share the model for ASR, MT and ST. We train all models without removing punctuation.

Model Settings and Baselines

We adopt the Transformer architecture NIPS2017_7181_attention for all tasks, including (Eq. 6), (Eq. 8) and (Eq. 10). The encoder and decoder consist of 6 identical layers, each including a self-attention sublayer, a cross-attention sublayer (decoder alone) and a feedforward sublayer. We employ the base setting for experiments: hidden size , attention head 8 and feedforward size 2048. We schedule learning rate via Adam (kingma2014adam

, paired with a warmup step of 4K. We apply dropout to attention weights and residual connections with a rate of 0.1 and 0.2 respectively, and also add label smoothing of 0.1 to handle overfitting. We train all models with a maximum step size of 30K and a minibatch size of around 25K target subwords. We average the last 5 checkpoints for evaluation. We use beam search for decoding, and set the beam size and length penalty to 4 and 0.6, respectively. We set

, and for AFS following louizos2017learning, and finetune AFS for an additional steps. We evaluate translation quality with tokenized case-sensitive BLEU papineni-etal-2002-bleu, and report WER for ASR performance without punctuation.

We compare our models with four baselines:


A vanilla Transformer-based E2E ST model of 6 encoder and decoder layers. Logarithmic attention penalty di2019adapting is used to improve the encoder.


We perform the ASR pretraining (ASR-PT) for E2E ST. This is the same model as ours (Figure 2) but without AFS finetuning.


We first transcribe the speech input using an ASR model, and then passes the results on to an MT model. We also use the logarithmic attention penalty di2019adapting for the ASR encoder.

ST + Fixed Rate:

Instead of dynamically selecting features, we replace AFS with subsampling at a fixed rate: we extract the speech encodings after every positions.

Besides, we offer another baseline, ST + CNN

, for comparison on MuST-C En-De: we replace the fixed-rate subsampling with a one-layer 1D depth-separable convolution, where the output dimension is set to 512, the kernel size over temporal dimension is set to 5 and the stride is set to 6. In this way, the ASR encoder features will be compressed to around 1/6 features, a similar ratio to the fixed-rate subsampling.

4.1 Results on MuST-C En-De

(a) Feature Gate Value
(b) Temporal Sparsity Rate
Figure 3: Feature gate value and temporal sparsity rate as a function of on MuST-C En-De dev set. Larger decreases the gate value of

but without dropping any neurons, i.e. feature sparsity rate 0%. By contrast, speech features are of high redundancy along temporal dimension, easily inducing high sparsity rate of


We perform a thorough study on MuST-C En-De. With AFS, the first question is its feasibility. We start by analyzing the degree of sparsity in speech features (i.e. sparsity rate) yielded by AFS, focusing on the temporal sparsity rate and the feature sparsity rate . To obtain different rates, we vary the hyperparameter in Eq. 7 in a range of with a step size 0.1.

Results in Figure 3 show that large amounts of encoded speech features () can be easily pruned out, revealing heavy inner-speech redundancy. Both AFS and AFS drop 60% temporal features with of 0.1, and this number increases to when (Figure 2(b)), remarkably surpassing the sparsity rate reported by zhang2020sparsifying

on text summarization (

). In contrast to rich temporal sparsification, we get a feature sparsity rate of 0, regardless of ’s value, although increasing decreases (Figure 2(a)). This suggests that selecting neurons from the feature dimension is harder. Rather than filtering neurons, the feature gate acts more like a weighting mechanism on them. In the rest of the paper, we use sparsity rate for the temporal sparsity rate.

(a) ASR
(b) ST
Figure 4: ASR (WER) and ST (BLEU) performance as a function of temporal sparsity rate on MuST-C En-De dev set. Pruning out 85% temporal speech features largely improves translation quality and retains 95% ASR accuracy.

We continue to explore the impact of varied sparsity rates on the ASR and ST performance. Figure 4 shows their correlation. We observe that AFS slightly degenerates ASR accuracy (Figure 3(a)), but still retains 95% accuracy on average; AFS often performs better than AFS with similar sparsity rate. The fact that only speech features successfully support 95% ASR accuracy proves the informativeness of these selected features. These findings echo with zhang2020sparsifying, where they observe a trade-off between sparsity and quality.

However, when AFS is applied to ST, we find consistent improvements to translation quality by BLEU, shown in Figure 3(b). Translation quality on the development set peaks at 22.17 BLEU achieved by AFS with a sparsity rate of 85.5%. We set (corresponding to sparsity rate of 85%) for all other experiments, since AFS and AFS reach their optimal result at this point.

Model BLEU Speedup
MT 29.69 -
Cascade 22.52 1.06
ST 17.44 0.87
ST + ASR-PT 20.67 1.00
ST + CNN 20.64 1.31
ST + Fixed Rate () 21.14 (83.3%) 1.42
ST + Fixed Rate () 20.87 (85.7%) 1.43
ST + AFS 21.57 (84.4%) 1.38
ST + AFS 22.38 (85.1%) 1.37
Table 1: BLEU and speedup on MuST-C En-De test set. . We evaluate the speedup on GeForce GTX 1080 Ti with a decoding batch size of 16, and report average results over 3 runs. Numbers in parentheses are the sparsity rate.
Figure 5: Impact of in fixed-rate subsampling on ST performance on MuST-C En-De test set. Sparsity rate: . This subsampling underperforms AFS, and degenerates the ST performance at suboptimal rates.

We summarize the test results in Table 1, where we set or for ST+Fixed Rate with a sparsity rate of around 85% inspired by our above analysis. Our vanilla ST model yields a BLEU score of 17.44; pretraining on ASR further enhances the performance to 20.67, significantly outperforming the results of di2019adapting by 3.37 BLEU. This also suggests the importance of speech encoder pretraining di2019adapting; stoian2020analyzing; wang2020curriculum. We treat ST with ASR-PT as our real baseline. We observe improved translation quality with fixed-rate subsampling, +0.47 BLEU at . Subsampling offers a chance to bypass noisy speech signals and reducing the number of source states makes learning translation alignment easier, but deciding the optimal sampling rate is tough. Results in Figure 5 reveal that fixed-rate subsampling deteriorates ST performance with suboptimal rates. Replacing fixed-rate subsampling with our one-layer CNN also fails to improve over the baseline, although CNN offers more flexibility in feature manipulation. By contrast to fixed-rate subsampling, the proposed AFS is data-driven, shifting the decision burden to the data and model themselves. As a result, AFS and AFS surpass ASR-PT by 0.9 BLEU and 1.71 BLEU, respectively, substantially narrowing the performance gap compared to the cascade baseline (-0.14 BLEU).

We also observe improved decoding speed: AFS runs 1.37 faster than ASR-PT. Compared to the fixed-rate subsampling, AFS is slightly slower which we ascribe to the overhead introduced by the gating module. Surprisingly, Table 1 shows that the vanilla ST runs slower than ASR-PT (0.87) while the cascade model is slightly faster (1.06). By digging into the beam search algorithm, we discover that ASR pretraining shortens the number of steps in beam-decoding: ASR-PT vs. vanilla ST (on average). The speedup brought by cascading is due to the smaller English vocabulary size compared to the German vocabulary when processing audio inputs.

Figure 6: ST training curves (MuST-C En-De dev set). ASR pretraining significantly accelerates model convergence, and feature selection further stabilizes and improves training. .

4.2 Why (Adaptive) Feature Selection?

Apart from the benefits in translation quality, we go deeper to study other potential impacts of (adaptive) feature selection. We begin with inspecting training curves. Figure 6 shows that ASR pretraining improves model convergence; feature selection makes training more stable. Compared to other models, the curve of ST with AFS is much smoother, suggesting its better regularization effect.

Figure 7: BLEU as a function of training data size on MuST-C En-De. We split the original training data into non-overlapped five subsets, and train different models with accumulated subsets. Results are reported on the test set. Note that we perform ASR pretraining on the original dataset. .

We then investigate the effect of training data size, and show the results in Figure 7. Overall, we do not observe higher data efficiency by feature selection on low-resource settings. But instead, our results suggest that feature selection delivers larger performance improvement when more training data is available. With respect to data efficiency, ASR pretraining seems to be more important (Figure 7, left) bansal-etal-2019-pre; stoian2020analyzing. Compared to AFS, the fixed-rate subsampling suffers more from small-scale training: it yields worse performance than ASR-PT when data size , highlighting better generalization of AFS.

Figure 8: Histogram of the cross-attention weights received per ST encoder output on MuST-C En-De test set. For each instance, we collect attention weights averaged over different heads and decoder layers following zhang2020sparsifying. Larger weight indicates stronger impact of the encoder output on translation. Feature selection biases the distribution towards larger weights. .

In addition to model performance, we also look into the ST model itself, and focus on the cross-attention weights. Figure 8 visualize the attention value distribution, where ST models with feature selection noticeably shift the distribution towards larger weights. This suggests that each ST encoder output exerts greater influence on the translation. By removing redundant and noisy speech features, feature selection eases the learning of the ST encoder, and also enhances its connection strength with the ST decoder. This helps bridge the modality gap between speech and text translation. Although fixed-rate subsampling also delivers a distribution shift similar to AFS, its inferior ST performance compared to AFS corroborates the better quality of adaptively selected features.

(a) Duration Analysis
(b) Position Analysis
Figure 9: The number of selected features vs. word duration (left) and position (right) on MuST-C En-De test set. For word duration, we align the audio and its transcription by Montreal Forced Aligner McAuliffe2017, and collect each words’ duration and its corresponding retained feature number. For position, we uniformly split each input into 50 pieces, and count the average number of retained features in each piece. .
Metric Model De Es Fr It Nl Pt Ro Ru
BLEU di2019adapting 17.30 20.80 26.90 16.80 18.80 20.10 16.50 10.50
Transformer + ASR-PT 21.77 26.41 31.56 21.46 25.22 26.84 20.53 14.31
ST 17.44 23.85 28.43 19.54 21.23 22.55 17.66 12.10
ST + ASR-PT 20.67 25.96 32.24 20.84 23.27 24.83 19.94 13.96
Cascade 22.52 27.92 34.53 24.02 26.74 27.57 22.61 16.13
ST + AFS 21.57 26.78 33.34 23.08 24.68 26.13 21.73 15.10
ST + AFS 22.38 27.04 33.43 23.35 25.05 26.55 21.87 14.92
SacreBLEU ST + AFS 21.6 26.6 31.5 22.6 24.6 25.9 20.8 14.9
ST + AFS 22.4 26.9 31.6 23.0 24.9 26.3 21.0 14.7
Temporal ST + AFS 84.4% 84.5% 83.2% 84.9% 84.4% 84.4% 84.7% 84.2%
Sparsity Rate ST + AFS 85.1% 84.5% 84.7% 84.9% 83.5% 85.1% 84.8% 84.7%
Speedup ST + AFS 1.38 1.35 1.50 1.34 1.54 1.43 1.59 1.31
ST + AFS 1.37 1.34 1.50 1.39 1.42 1.26 1.46 1.37
Table 2: Performance over 8 languages on MuST-C dataset. : results reported by the ESPNet toolkit watanabe2018espnet, where the hyperparameters of beam search are tuned for each dataset.

AFS vs. Fixed Rate

We compare these two approaches by analyzing the number of retained features with respect to word duration and temporal position. Results in Figure 8(a) show that the underlying pattern behind these two methods is similar: words with longer duration correspond to more speech features. However, when it comes to temporal position, Figure 8(b) illustrates their difference: fixed-rate subsampling is context-independent, periodically picking up features; while AFS decides feature selection based on context information. The curve of AFS is more smooth, indicating that features kept by AFS are more uniformly distributed across different positions, ensuring the features’ informativeness.

Figure 10: Illustration of feature gate with .

Afs vs. AFS

Their only difference lies at the feature gate . We visualize this gate in Figure 10. Although this gate induces no sparsification, it offers AFS the capability of adjusting the weight of each neuron. In other words, AFS has more freedom in manipulating speech features.

Metric Model En-Fr
BLEU berard2018end 13.40
watanabe2018espnet 16.68
liu2019end 17.02
wang2019bridging 17.05
wang2020curriculum 17.66
ST 14.32
ST + ASR-PT 17.05
Cascade 18.27
ST + AFS 18.33
ST + AFS 18.56
SacreBLEU ST + AFS 16.9
ST + AFS 17.2
Temporal ST + AFS 84.7%
Sparsity Rate ST + AFS 83.5%
Speedup ST + AFS 1.84
ST + AFS 1.78
Table 3: Performance on LibriSpeech En-Fr.

4.3 Results on MuST-C and LibriSpeech

Table 2 and Table 3 list the results on MuST-C and LibriSpeech En-Fr, respectively. Over all tasks, AFS/AFS substantially outperforms ASR-PT by 1.34/1.60 average BLEU, pruning out 84.5% temporal speech features on average and yielding an average decoding speedup of 1.45. Our model narrows the gap against the cascade model to -0.8 average BLEU, where AFS surpasses Cascade on LibriSpeech En-Fr, without using KD liu2019end and data augmentation wang2020curriculum. Comparability to previous work is limited due to possible differences in tokenization and letter case. To ease future cross-paper comparison, we provide SacreBLEU post-2018-call444signature: BLEU+c.mixed+#.1+s.exp+tok.13a+version.1.3.6 for our models.

5 Related Work

Speech Translation

Pioneering studies on ST used a cascade of separately trained ASR and MT systems ney1999speech. Despite its simplicity, this approach inevitably suffers from mistakes made by ASR models, and is error prone. Research in this direction often focuses on strategies capable of mitigating the mismatch between ASR output and MT input, such as representing ASR outputs with lattices saleem2004using; mathias2006statistical; zhang-etal-2019-lattice; beck-etal-2019-neural, injecting synthetic ASR errors for robust MT tsvetkov-etal-2014-augmenting; cheng-etal-2018-towards and differentiable cascade modeling kano2017structured; anastasopoulos-chiang-2018-tied; sperber-etal-2019-attention.

In contrast to cascading, another option is to perform direct speech-to-text translation. duong-etal-2016-attentional and berard2016listen employ the attentional encoder-decoder model DBLP:journals/corr/BahdanauCB14 for E2E ST without accessing any intermediate transcriptions. E2E ST opens the way to bridging the modality gap directly, but it is data-hungry, sample-inefficient and often underperforms cascade models especially in low-resource settings bansal2018low. This led researchers to explore solutions ranging from efficient neural architecture design karita2019transformerasr; di2019adapting; sung2019towards to extra training signal incorporation, including multi-task learning weiss2017sequence; liu2019synchronous, submodule pretraining bansal-etal-2019-pre; stoian2020analyzing; wang2020curriculum, knowledge distillation liu2019end, meta-learning indurthi2019data and data augmentation kocabiyikoglu-etal-2018-augmenting; jia2019leveraging; pino2019harnessing. Our work focuses on E2E ST, but we investigate feature selection which has rarely been studied before.

Speech Feature Selection

Encoding speech signals is challenging as acoustic input is lengthy, noisy and redundant. To ease model learning, previous work often selected features via downsampling techniques, such as convolutional modeling di2019adapting and fixed-rate subsampling lu2015study. Recently, Zhang2019trainable and na2019adaptive proposed dynamic subsampling for ASR which learns to skip uninformative features during recurrent encoding. Unfortunately, their methods are deeply embedded into recurrent networks, hard to adapt to other architectures like Transformer NIPS2017_7181_attention. Recently, salesky-etal-2019-exploring have explored phoneme-level representations for E2E ST, which reduces speech features temporarily by 80% and obtains significant performance improvement, but this requires non-trivial phoneme recognition and alignment.

Instead, we resort to sparsification techniques which have achieved great success in NLP tasks recently correia-etal-2019-adaptively; child2019generating; zhang2020sparsifying. In particular, we employ Drop zhang2020sparsifying for AFS to dynamically retain informative speech features, which is fully differentiable and independent of concrete encoder/decoder architectures. We extend Drop by handling both temporal and feature dimensions with different gating networks, and apply it to E2E ST.

6 Conclusion and Future Work

In this paper, we propose adaptive feature selection for E2E ST to handle redundant and noisy speech signals. We insert AFS in-between the ST encoder and a pretrained, frozen ASR encoder to filter out uninformative features contributing little to ASR. We base AFS on Drop zhang2020sparsifying, and extend it to modeling both temporal and feature dimensions. Results show that AFS improves translation quality and accelerates decoding by 1.4 with an average temporal sparsity rate of 84%. AFS successfully narrows or even closes the performance gap compared to cascading models.

While most previous work on sparsity in NLP demonstrates its benefits from efficiency and/or interpretability perspectives zhang2020sparsifying, we show that sparsification in our scenario – E2E ST – leads to substantial performance gains.

In the future, we will work on adapting AFS to simultaneous speech translation.


We would like to thank Shucong Zhang for his great support on building our ASR baselines. IT acknowledges support of the European Research Council (ERC Starting grant 678254) and the Dutch National Science Foundation (NWO VIDI 639.022.518). This work has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 825460 (ELITR). Rico Sennrich acknowledges support of the Swiss National Science Foundation (MUTAMUR; no. 176727).