. This model directly learns the mapping from the input acoustic signals to the output transcriptions without decomposing the problems into several different modules such as lexicon modeling, acoustic modeling and language modeling as in the conventional hybrid architecture. While this kind of E2E approach significantly simplifies the speech recognition pipeline, the weakness is that it is difficult to tune the strength of each component. One particular problem from our observations is that the attention based E2E model tends to make grammatical errors, which indicates that the language modeling power of the model is weak, possibly due to the small amount of training data, or the mismatch between the training and evaluation data. However, due to the jointly model approach in the attention model, it is unclear how to improve the strength of the language modeling power, i.e., attributing more weights to the previous output tokens in the decoder, or to improve the strength of the acoustic modeling power, i.e., attributing more weights to the context vector from the encoder.
While an external language model may be used to mitigate the weakness of the language modeling power of an attention-based E2E model, by either re-scoring the hypothesis or through shallow or deep fusion , the improvements are usually limited, and it incurs additional computational cost. Inspired by SpecAgument  and BERT , we propose a semantic mask approach to improve the strength of the language modeling power in the attention-based E2E model, which, at the same time, improves the generalization capacity of the model as well. Like SpecAugment, this approach masks out partial of the acoustic features during model training. However, instead of using a random mask as in SpecAugment, our approach masks out the whole patch of the features corresponding to an output token during training, e.g., a word or a word-piece. The motivation is to encourage the model to fill in the missing token (or correct the semantic error) based on the contextual information with less acoustic evidence, and consequently, the model may have a stronger language modeling power and is more robust to acoustic distortions.
In principle, our approach is applicable to the attention-based E2E framework with any type of neural network encoder. To constrain our research scope, we focus on the transformer architecture 
, which is originally proposed for neural machine translation. Recently, it has been shown that the transformer model can achieve competitive or even higher recognition accuracy compared with the recurrent neural network (RNN) based E2E model for speech recognition. Compared with RNNs, the transformer model can capture the long-term correlations with a computational complexity of , instead of using many steps of back-propagation through time (BPTT) as in RNNs. We evaluate our transformer model with semantic masking on Librispeech and TedLium datasets. We show that semantic masking can achieve significant word error rate reduction (WER) on top of SpecAugment, and we report the lowest WERs on the test sets of the Librispeech corpus with an E2E model.
2 Related Work
As aforementioned, our approach is closely related to SpecAugment , which applies a random mask to the acoustic features to regularize an E2E model. However, our masking approach is more structured in the sense that we mask the acoustic signals corresponding to a particular output token. Besides the benefit in terms of model regularization, our approach also encourages the model to reconstruct the missing token based on the contextual information, which improves the power of the implicit language model in the decoder. The masking approach operates as the output token level is also similar to the approach used in BERT , but with the key difference that our approaches works in the acoustic space.
In terms of the model structure, the transformer-based E2E model has been investigated for both attention-based framework as well as RNN-T based models . Our model structure generally follows , with a minor difference that we used a deeper CNN before the self-attention blocks. We used a joint CTC/Attention loss to train our model following .
3 Semantic Masking
3.1 Masking Strategy
Our masking approach requires the alignment information in order to perform the token-wise masking as shown in Figure 1. There are multiple speech recognition toolkits available to generate such kind of alignments. In this work, we used the Montreal Forced Aligner111https://github.com/MontrealCorpusTools/Montreal-Forced-Alignertrained with the training data to perform forced-alignment between the acoustic signals and the transcriptions to obtain the word-level timing information. During model training, we randomly select a percentage of the tokens and mask the corresponding speech segments in each iteration. Following , in our work, we randomly sample 15% of the tokens and set the masked piece to the mean value of the whole utterance.
It should be noted that the semantic masking strategy is easy to combine with the previous SpecAugment masking strategy. Therefore, we adopt a time warp, frequency masking and time masking strategy in our masking strategy.
3.2 Why Semantic Mask Works?
Spectrum augmentation  is similar to our method, since both propose to mask spectrum for E2E model training. However, the intuitions behind these two methods are different. SpecAugment randomly masks spectrum in order to add noise to the source input, making the E2E ASR problem harder and prevents the over-fitting problem in a large E2E model.
In contrast, our model aims to force the decoder to learn a better language model. Suppose that if a few words’ speech features are masked, the E2E model has to predict the token based on other signals, such as tokens that have generated or other unmasked speech features. In this way, we might alleviate the over-fitting issue that generating words only considering its corresponding speech features while ignoring other useful features. We believe our model is more effective when the input is noisy, because a model may generate correct tokens without considering previous generated tokens in a noise-free setting but it has to consider other signals when inputs are noisy, which is confirmed in our experiment.
Following , we add convolution layers before Transformer blocks and discard the widely used positional encoding component. According to our preliminary experiments, the convolution layers slightly improve the performance of the E2E model. In the following, we will describe the CNN layers and Transformer block respectively.
4.1 CNN Layer
We represent input signals as a sequence of log-Mel filter bank features, denoted as , where is a 83-dim vector. Since the length of spectrum is much longer than text, we use VGG-like convolution block 
with layer normalization and max-pooling function. The specific architecture is shown in Figure2 . We hope the convolution block is able to learn local relationships within a small context and relative positional information. According to our experiments, the specific architecture outperforms Convolutional 2D subsampling method . We also use 1D-CNN in the decoder to extract local features replacing the position embedding 222Experiment results show the Encoder CNN is more powerful than the decoder CNN..
4.2 Transformer Block
Our Transformer architecture is implemented as , depicting in Figure 3. The transformer module consumes the outputs of CNN and extracts features with a self-attention mechanism. Suppose that , and are inputs of a transformer block, its outputs are calculated by the following equation
where is the dimension of the feature vector. To enable dealing with multiple attentions, multi-head attention is proposed, which is formulated as
is the number of attention heads. Moreover, residual connection, feed-forward layer and layer normalization  are indispensable parts in Transformer, and their combinations are shown in Figure 3.
4.3 ASR Training and Decoding
Following previous work , we employ a multi-task learning strategy to train the E2E model. Formally speaking, both the E2E model decoder and the CTC module predict the frame-wise distribution of given corresponding source , denoted as and . We weighted averaged two negative log likelihoods to train our model
where is set to 0.7 in our experiment.
We combine scores of E2E model , CTC score and a RNN based language model in the decoding process, which is formulated as
where and are tuned on the development set. Following , we rescore our beam outputs based on another transformer based language model and the sentence length penalty .
denotes the sentence generative probability given by a Transformer language model.
In this section, we describe our experiments on LibriSpeech  and TedLium2 . We compare our results with state-of-the-art hybrid and E2E systems. We implemented our approach based on ESPnet , and the specific settings on two datasets are the same with , except the decoding setting. We use the beam size 20, , and in our experiment.
|RWTH (E2E) ||2.9||8.4||2.8||9.3|
|ESPNET Transformer ||2.2||5.6||2.6||5.7|
|Wav2letter Transformers ||2.11||5.25||2.30||5.64|
|+ Rescore ||2.17||4.67||2.31||5.18|
|+ LM Fusion||2.40||6.02||2.66||6.15|
|Model with SpecAugment||3.33||9.05||3.57||9.00|
|+ LM Fusion||2.20||5.73||2.39||5.94|
|Model with Semantic Mask||2.93||7.75||3.04||7.43|
|+ LM Fusion||2.09||5.31||2.32||5.55|
|+ Speed Perturbation||2.10||5.30||2.35||5.35|
|RWTH (HMM) ||1.9||4.5||2.3||5.0|
|Wang et al. ||-||-||2.26||4.85|
|Multi-stream self-attention ||1.8||5.8||2.2||5.7|
5.1 Librispeech 960h
We represent input signals as a sequence of 80-dim log-Mel filter bank with 3-dim pitch features . SentencePiece is employed as the tokenizer, and the vocabulary size is 5000. The hyper-parameters in Transformer and SpecAugment follow  for a fair comparison. We use Adam algorithm to update the model, and the warmup step is . The learning rate decreases proportionally to the inverse square root of the step number after the -th step. We train our model epochs on 4 P40 GPUs, which approximately costs 5 days to coverage. We also apply speed perturbation by changing the audio speed to 0.9, 1.0 and 1.1. Following , we average the last 5 checkpoints as the final model. Unlike  and , we use the same checkpoint for test-clean and test-other dataset.
The RNN language model uses the released LSTM language model provided by ESPnet333https://github.com/espnet/espnet/tree/master/egs/librispeech/asr1
. The Transformer language model for rescoring is trained on LibriSpeech language model corpus with the GPT-2 base setting (308M parameters). We use the code of NVIDIA Megatron-LM444https://github.com/NVIDIA/Megatron-LM to train the Transformer language model.
We evaluate our model in different settings. The baseline Transformer represents the model with position embedding. The comparison between baseline Transformer and our architecture (Model with SpecAugment) indicates the improvements attributed to the architecture. Model with semantic mask is we use the semantic mask strategy on top of SpecAugment, which outperforms Model with SpecAugment with a large margin in a no external language model fusion setting, demonstrating that our masking strategy helps the E2E model to learn a better language model. The gap becomes smaller when equipped with a language model fusion component, which further confirms our motivation in Section 1. Speed Perturbation does not help model performance on the clean dataset, but it is effective on the test-other dataset. Rescore is beneficial to both test-clean and test-other datasets.
As far as we know, our model is the best E2E ASR system on the Librispeech testset, which achieves a comparable result with wav2letter Transformer on test-clean dataset and a better result on test-other dataset, even though our model (75M parameters) is much smaller than the wav2letter Transformer (210M parameters). The reason might be that our semantic masking is more suitable on a noisy setting, because the input features are not reliable and the model has to predict the next token relying on previous ones and the whole context of the input. Our model is built upon the code base of ESPnet, and achieves relative gains due to the better architecture and masking strategy. Comparing with hybrid methods, our model obtains a similar performance on the test-clean set, but is still worse than the best hybrid model on the test-other dataset.
We also analyze the performance of different masking strategies, showing in Table 2, where all models are shallow fused with the RNN language model. The SpecAugment provides 30 relative gains on test-clean and other datasets. According to the comparison between the second line and the third line, we find that the word masking is more effective on test-other dataset. The last row indicates word mask is complementary to random mask on the time axis.
|Kaldi (Chain + TDNN + Large LM)||9.0|
|+ LM Fusion||8.1|
|+ LM Fusion||7.7|
To verify the generalization of the semantic mask, we further conduct experiments on TedLium2 
dataset, which is extracted from TED talks. The corpus consists of 207 hours of speech data accompanying 90k transcripts. For a fair comparison, we use the same data-preprocessing method, Transformer architecture and hyperparameter settings as in
. Our acoustic features are 80-dim log-Mel filter bank and 3-dim pitch features, which is normalized by the mean and the standard deviation for training set. The utterances with more than 3000 frames or more than 400 characters are discarded. The vocabulary size is set to 1000.
The experiment results are listed in Table 3, showing a similar trend as the results in Librispeech dataset. Semantic mask is complementary to specagumentation, which enables better S2S language modeling training in an E2E model, resulting in a relative 4.5 gain. The experiment proves the effectiveness of semantic mask on a different and smaller dataset.
This paper presents a semantic mask method for E2E speech recognition, which is able to train a model to better consider the whole audio context for the disambiguation. Moreover, we elaborate a new architecture for E2E model, achieving state-of-the-art performance on the Librispeech test set in the scope of E2E models.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, 2015, pp. 5206–5210.
-  Shubham Toshniwal, Anjuli Kannan, Chung-Cheng Chiu, Yonghui Wu, Tara N. Sainath, and Karen Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018, 2018, pp. 369–375.
-  Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” CoRR, vol. abs/1904.08779, 2019.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
-  Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, and Wangyou Zhang, “A comparative study on transformer vs RNN in speech applications,” CoRR, vol. abs/1909.06317, 2019.
-  Alex Graves, “Sequence transduction with recurrent neural networks,” CoRR, vol. abs/1211.3711, 2012.
-  Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer, “Transformers with convolutional context for ASR,” CoRR, vol. abs/1904.11660, 2019.
-  Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in , 2016, pp. 770–778.
-  Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016.
-  Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Edouard Grave, Tatiana Likhomanenko, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, and Ronan Collobert, “End-to-end asr: from supervised to semi-supervised learning with modern architectures,” arXiv preprint arXiv:1911.08460, 2019.
-  Anthony Rousseau, Paul Deléglise, and Yannick Estève, “TED-LIUM: an automatic speech recognition dedicated corpus,” in Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, 2012, pp. 125–129.
-  Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Rwth asr systems for librispeech: Hybrid vs attention,” Interspeech, Graz, Austria, pp. 231–235, 2019.
-  Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, et al., “Transformer-based acoustic modeling for hybrid speech recognition,” arXiv preprint arXiv:1910.09799, 2019.
-  Kyu J. Han, Ramon Prieto, Kaixing Wu, and Tao Ma, “State-of-the-art speech recognition using multi-stream self-attention with dilated 1d convolutions,” CoRR, vol. abs/1910.00716, 2019.
-  Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur, “A pitch extraction algorithm tuned for automatic speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, 2014, pp. 2494–2498.
-  Anthony Rousseau, Paul Deléglise, and Yannick Estève, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26-31, 2014, 2014, pp. 3935–3939.