It is now evident that neural sequence-to-sequence models [sutskever2014sequence] are capable of directly transcribing or translate speech in an end-to-end approach. A single neural model which directly maps speech inputs to text outputs advantageously eliminates the individual components in non end-to-end or cascaded approaches, while yielding competitive performance [sperber2019attention, nguyen2019improving]. The hybrid approach for speech recognition and the cascaded approach for speech translation may still give the best accuracy in many conditions, but as neural architectures continue to develop, the gap is closing [niehues_j_2019_3525578].
The Transformer [vaswani2017attention] is a popular architecture choice which has achieved state-of-the-art performance for many sequence learning tasks, particularly machine translation [ng-etal-2019-facebook]. When applied to speech recognition and direct speech translation, this architecture also stands out as the highest performing option for several datasets [sperber2018self, pham2019transformer, di2019adapting].
The disadvantage of the Transformer is that, its core function – self-attention – does not have an inherent mechanism to model sequential positions. The original work [vaswani2017attention] added position information to the word embeddings via a trigonometric position encoding. Specifically, each element in the sequence is assigned an absolute
position with a corresponding encoding (a vector similar to embeddings of the discrete variables, but not updated during training). Recent adaptation to speech recognition[pham2019transformer]111This is the closest speech adaptation that does not change or introduce additional layers (e.g. LSTM [hochreiter1997long] or TDNN [waibel1989tdnn]). showed that the base model, extended in depth, is already sufficient for competitive performance compared to other architecture approaches.
However, this absolute position scheme is far from ideal for acoustic modeling. First, text sequences may have a stricter correlation with position; for example, in English the “Five Ws” words often appear at the beginning of the sentences, while there may be larger variation in the absolute position of phones in speech signals and utterances. Second, speech sequences are often times (in terms of frames) longer than their transcript character sequence, which can be exacerbated by surrounding noises or silences. Figure 1 shows an illustration in which the speech (at frame 500) is between applause, which changes absolute positions, but should not affect the resulting transcript. Ideally, we want to keep positional information time-shift invariant.
Recently, relative positional encoding has become popularized as a consistent reinforcement for the self-attention. Originally proposed by [shaw-etal-2018-self] to replace absolute positions by taking into account the relative positions between the states in self-attention, this method has also been formalized to adapt into language modeling [dai-etal-2019-transformer], which allows the models to capture very long dependency between paragraphs.
In this work, we bring the advantages of relative position encoding to the Deep Transformer [pham2019transformer] for both speech recognition (ASR) and direct speech translation (ST). The resulting novel model maintains the trigonometric position encodings to better scale with longer speech sequences, and is able to model bidirectional positions as well. On speech recognition, we show that this model consistently improves the Transformer on the standard English Switchboard and Fisher benchmarks (on both 300h and 2000h conditions), and, to the best of our knowledge, is the best published end-to-end model without augmentation on these datasets. More impressively, for speech translation, a single model is able to improve the previous best on the MuST-C benchmark [mustc19] by BLEU points. While extending to the IWSLT speech translation task, which is very challenging because it requires of generating audio segmentations, we find that the relative model scales much better with the segmentation quality than the absolute counterpart, and can challenge a very strong cascaded model, which has the advantage of additional model parameters, an intermediate re-segmentation component, and more data.
2 Model Description
A speech-to-text model for either automatic speech recognition or direct translation transforms a source speech input withframes into a target text sequence with tokens
. The encoder transforms the speech inputs into hidden representations. The decoder firsts generates a language model style hidden representation given the previous inputs, then uses the attention mechanism [bahdanau2014neural] to generate the relevant context from the encoder states, which is then combined and generate the output distribution .
The Transformer [vaswani2017attention] uses attention as the main network component to learn encoder and decoder hidden representations. Given three sequences of vectorized states consisting of queries , , attention computes an energy function between each query and each key . These energy terms are then normalized with a softmax function, and then used to take the weighted average of the values
. The energy function can be modeled with neural networks[luong2015effective] or as simple as projected (with thre additional weight matrices) dot-product between two vectors or as parallelized matrix-multiplication in Equation 6.
[vaswani2017attention] also improved the attention above through the concept of multi-head attention (MAH), which splits the transformed term to different heads. The same dot-product operation is applied on each of the query, key and values heads, and finally the result is the concatenation of the outcomes.
The Transformer encoder and decoder are constructed based through stacked layers that have identical components. Each encoder layer has one self-attention (MAH) sub-layer, which is followed by a position-wise feed-forward neural network with ReLU activation function.222It is a sub-layer from the top-down perspective, analyzing the network, but as a neural network itself, it has two hidden layers of its own
Each decoder layer is quite similar to the encoder counterpart, with the self-attention sub-layer to connect the decoder states, and the feed-forward network. There is an addition encoder-decoder attention layer in between to extract the context vectors from the top encoder states. Furthermore, the Transformer uses residual connections boost information from bottom layers (e.g. the input embeddings) to the top layers. Layer normalization[ba2016layer] plays a supportive role, keeping the norms of the outputs in check, when used after each residual connection.
2.2 Relative Position Encoding in Transformer
Equation 6 suggests that attention is position-invariant, i.e if the key and value states change their order, the output remains the same. In order to alleviate this problem for this content-based model, positional information within the input sequence is represented in a similar manner with the word embeddings. The positions are treated as discrete variables and then transformed to embeddings either using a look-up table with learnable parameters [sukhbaatar2015end] or with fixed encodings in a trigonometric form:
When applied to speech input, this encoding is then added to speech input features [pham2019transformer]. The periodic property of the encodings allow the model to generalize to unseen input length. Following the factorization in [dai-etal-2019-transformer], we can rewrite the energy function in Equation 6 for self-attention between two encoder hidden states and to decompose into 4 different terms:
Equation 8 gives us an interpretation of the function: in which term A is purely content-based comparison between two hidden states (i.e speech feature comparison), term D gives a bias between two absolute positions. The other terms represent the specific content and position addressing.
The extension proposed by previously [shaw-etal-2018-self] and later [dai-etal-2019-transformer] changed the terms B, C, D so that only the relative positions are taken into account:
The new term computes the relevance between the input query and the relative distance between and . Term introduces an additional bias to the content of the key state , while term represents the bias to the global distance. Terms and also have an additional linear projection so that the positions and embeddings have different projections.
With this relative position scheme, when the two inputs and are shifted (for example, having extra noise or silent in the utterance), the energy function stays the same (for the first layer of the network). Moreover, it can also establish certain inductive bias in the data; for example, the average length of silence or applauses, given the global and local bias terms.
2.3 Adaptation to speech inputs
For relative position encodings with speech inputs, should we use learnable embeddings or fixed encodings to represent the distance ? The latter has the clear advantage that it already has the periodic property, and given that speech input can be as long as thousands of frames, the former approach would require a necessary cut-off [shaw-etal-2018-self] to adapt to longer input sequences. These reasons make sinusoidal encodings a logical choice.
Importantly, the relative position scheme above was proposed for autoregressive language models, in which the attention has only one direction. For speech encoders, each state can attend to both left and right directions, thus we propose to use positive distance when the keys are to the left () and negative distance otherwise. As a result, the encodings for and will have the same terms while the terms will have opposite signs, which gives the model a hint to assign different biases to different directions. Implementation wise, it is able to efficiently compute terms and with the minimal amount of matrix operations. It is necessary to compute terms with for each query (For a sequence with states, the distance between one state to another is always in that range).333[dai-etal-2019-transformer] only needs to compute terms as it has only one direction This is followed by the shifting trick [dai-etal-2019-transformer] to achieve the required energy terms.
ASR. For ASR tasks, our experiments were conducted on the standard English Switchboard and Fisher data under both benchmark conditions: hours and hours of training data. Our reported test results are for the Hub5 testset with two subsets Switchboard and CallHome. Target transcripts are segmented with byte-pair encoding [sennrich2016bpeacl] using k merges.
SLT. We split our SLT task into two different subtasks. Many SLT datasets require an auto-segmentation component to splits the audio into sentence-like segments.444This is commonly seen in IWSLT evaluation campaigns [niehues_j_2019_3525578]. For end-to-end models, this step is crucial due to the lack of incremental decoding and higher GPU memory requirements. The recent MuST-C [mustc19] corpus contains segmentations for both training and testset, requiring no extra segmentation component, and so we use its English-German pair serves as our first experimental benchmark. We further carry out experiments on the IWSLT 2019 evaluation campaign data, a superset of MuST-C, where segmentation is not given; here we can compare the effects of variable-quality segmentation on different end2end models, and also compare models to highly competitive tuned cascades. We use the MuST-C validation data for both tasks.
Our baselines for all experiments use the Deep Stochastic Transformer [pham2019transformer]. We use the relative encoding scheme above for both encoder and decoder to yield relative Transformers.
For ASR, both our baseline Transformer and relative Transformer have encoder and decoder layers with the model size and the feed-forward networks have the hidden layer size of . Dropout is applied with the same mask across time steps [gal2016theoretically] with and also directly at the discrete decoder inputs with . All models are trained for at most steps and the reported model parameters are the average of the 10 checkpoints with lowest perplexities on the cross-validation data.
For SLT, the models and the training process are identical to ASR, with the exception that we use encoder layers.555The SLT data sequences are longer and thus need more memory Following the curriculum learning intuition that SLT models benefit from pre-training the speech encoder with ASR [bansal2018pre], we first pre-trained the model for ASR with the parallel English transcripts from MuST-C, and then fine-tune the encoder weights and re-initialize the decoder for SLT. This approach enabled us to consistently train our SLT models without divergence (which may happen when the learning rate is too aggressive or the half-precision GPU mode is used).
For all models, the batch size is set to fit the models to a single GPU 666Titan V and Titan RTX with 12 and 24 GB respectively and accumulate gradients to update every target tokens. We used the same learning rate schedule as the Transformer translation model [vaswani2017attention] with warmup steps for the Adam [kingma2014adam] optimizer.
3.3 Speech Recognition Results
We present ASR results on the Switchboard-300 benchmark in Table 1. It is important to clarify that spectral augmentation (dubbed as SpecAugment) is a recently proposed augmentation method that tremendously improved the regularization ability of seq2seq models for speech recognition [park2019specaugment]. In better demonstrate the effect of relative attention, we conduct experiments with and without augmentation.
Compared to the Deep Stochastic model [pham2019transformer], using relative attention is able to reduce our WER from to and to on SWB and CH, without any augmentation. Compared to other works under this condition, our results are second to none among the published end2end models, and can rival the LFMMI hybrid model [povey2016purely] that has an external language model utilizing extra monolingual data.
With spectral augmentation, the improvement from relative attention is still noticeable, further reducing WER from to , and to from the baseline (the relative gain on CallHome is kept at ). This is second only to [park2019specaugment], the state-of-the-art on this benchmark at and ; however their models use an aggressively regularized training regime on multiple TPUs for 20 days. Other end-to-end models [zeyer2019comparison, nguyen2019improving] using single GPUs showed similar behavior to ours with SpecAugment. Finally, with additional speed augmentation, relative attention is still additive, with further gains of 0.3 and 0.7 compared to our strong baseline.
|[saon2017english] Hybrid w/ BiLSTM||7.7||13.9|
|[han2017capio] Dense TDNN-LSTM||6.1||11.0|
|Deep Transformer (Ours)||6.5||11.9|
|Deep Relative Transformer (Ours)||6.2||11.4|
The experiments on the larger dataset with 2000h follow the above results for 300h, continuing to show positive effects from that relative position encodings. The error rates on those SWB and CH decrease from and to and (Table 2). Our best model is significantly better than previously published CTC [audhkhasi2018building] and LSTM-based [nguyen2019improving] models, and approaches the heavily tuned hybrid system [han2017capio] with dense TDNN-LSTM. It is likely possible to reach better error rates, with the help of ensembled models, further data augmentation, and language models. Our experiments here, however, show that the novel relative model is consistently better than the baseline, regardless of the data size and augmentation conditions.
3.4 Speech Translation Results
Our first SLT models were trained only on the MuST-C training data and the results are reported on the COMMON testset777MuST-C is a multilingual dataset and this testset is the commonly shared utterances between the languages.. Note that in this testset, we are provided with the segmentation of each utterance which has a corresponding translation. For each utterance, we can directly translate with the end2end model, and the final score can be obtained using standard BLEU scorers such as SacreBLEU [post-2018-call] because the output and the reference are already sentence-aligned in a standardized way.
As shown in Table 3, our Deep Transformer baseline achieves an impressive BLEU score compared to the ST-Transformer [di2019adapting], which is a Transformer model specifically adapted for speech translation. However, using relative position information makes self-attention more robust and effective still, as our BLEU score increases to .
To try to maximize the performance of an end-to-end speech translation model, we also add the Speech-Translation TED corpus 888Available from the evaluation campaign at https://sites.google.com/view/iwslt-evaluation-2019/speech-translation and follow the method from [di2019adapting] to add synthetic data for speech translation, where a cascaded system is used to generate translations for the TEDLIUM-3 data [hernandez2018ted]. Our cascade system is built based on the procedure from the winning system in the 2019 IWSLT ST evaluation campaign [pham2019iwslt].
With these additional corpora, we observe a considerable boost in translation performance (similarly observed in [di2019adapting]). More importantly, the relative model further enlarges the performance gap between two models to now BLEU points. We hypothesize that the model is able to more effectively use the additional data, with data patterns more easily captured when the model considers relative rather than absolute distance between speech features. More concretely, each training corpus has a different segmentation method, which leads to large variation in spoken patterns, which is difficult to capture using absolute position encodings.
To verify our hypothesis, we compare these two models and the cascaded system on the TEDTalk testsets without a provided segmentation. These talks are available as long audio files and require an external audio segmentation step to make translation feasible. It is important to note that the cascaded model has a separate text re-segmentation component [cho2017nmt] which takes ASR output and reorganizes it into logical sentences, which is a considerable advantage compared to the end2end models. We experimented with several audio segmentation methods and see that the cascade is less affected by the segmentation quality than the end-to-end models.
The results in Table 4 compare two different segmentation methods, LIUM [rouvier2013open] and VAD [Charles2013], and four different testsets. The relative Transformer unsurprisingly consistently outperforms the Transformer, regardless of segmentation. Moreover, comparing between the segmenters, the relative model more effectively uses higher segmentation quality, yielding a larger BLEU difference. While the base Transformer only increases up to BLEU with better segmentation, this figure becomes up to BLEU points for the relative counterpart. In the end, the cascade model still shows that heavily tuned separated components, together with an explicit text segmentation module, is an advantage over end-to-end models, but this gap is closing with more efficient architectures.
|+Additional Data [di2019data]||23.0|
|Deep Transformer (w/ SpecAugment)||24.2|
|Deep Relative Transformer (w/ SpecAugment)||25.2|
Speech recognition and translation with end-to-end models have become active research areas. In this work, we adapted the relative position encoding scheme to speech Transformers for these two tasks. We showed that the resulting novel network provides consistent and significant improvement through different tasks and data conditions, given the properties of acoustic modeling. Inevitably, audio segmentation remains a barrier to end-to-end speech translation; we look forward to future neural solutions.