which outperformed conventional hybrid systems relying on Hidden Markov Models[4, 17]. In contrast to prior work, which required training several independent components (acoustic, pronunciation, and language models) and had many degrees of complexity, E2E models are faster and easier to implement, train, and deploy.
To enhance the speech recognition accuracy, ASR models are often augmented with independently trained language models that re-score the list of n-best hypotheses. The use of the external language model induces a natural trade-off between model speed and accuracy. While simple N-gram language models (e.g., KenLM) are extremely fast, they can not achieve the same level of performance as heavier and more powerful neural language models, such as Transformers [12, 7, 1].
Language model re-scoring effectively expands the search space of speech recognition candidates; however, it can barely help when the ground truth word was assigned a low score by erroneous ASR model. Traditional left-to-right language models are also prone to error accumulation: if some word at the beginning of the decoded speech is misrecognized, it will affect the scores of all succeeding words by providing them with incorrect context. To address these problems, we propose to train a conditional language model that corrects the errors made by the system operating similar to neural machine translation (NMT)[22, 6] by “translating” corrupted ASR output into the correct language.
There is a plethora of prior work on correcting ASR systems output, and we refer the reader to  for a detailed overview. Most closely to our work,  propose to train a spelling correction model based on RNN with attention  to correct the output of Listen, Attend and Spell (LAS) model. In contrast to this work, our model is based on Transformer architecture  and does not require a complementary text-to-speech model for training data generation.
Transformers used for NMT are usually trained on millions of parallel sentences and tend to easily overfit if the data is scarce, which is the case we have. To solve this problem, we propose two self-complementary regularization techniques. First, we augment training data with the perturbed outputs of several ASR models trained on K-fold partitions of the training dataset. Second, we initialize both encoder and decoder with the weights of pre-trained BERT 
We evaluate the proposed approach on LibriSpeech dataset  and use Jasper DR 10x5  as our baseline ASR module. Our correction model, when applied to the greedy output of Jasper ASR, outperforms both the baseline and re-scoring with 6-gram KenLM language model and almost matches the performance of re-scoring with more powerful Transformer-XL language model.
2.1 Speech recognition baseline model
As our baseline ASR model we use Jasper , a deep convolutional E2E model. Jasper takes as input mel-filter bank features calculated from ms windows with a
ms overlap and maps them to a probability distribution over characters per frame. The model is trained with Connectionist Temporal Classification (CTC) loss. In particular, we build on Jasper-DR-10x5, which consists of blocks of
sub-blocks (1-D convolution, batch norm, ReLU, dropout) where the output of each block is added to the inputs of all following blocks similar to DenseNet.
2.2 Language models used for re-scoring
A language model (LM) estimates the joint probability of a text corpus
by factorizing it with a chain ruleand sequentially modeling each conditional term in the product. To simplify modeling, it is often assumed that the context size (a number of preceding words) necessary to predict each word in the corpus is limited to : . This approximation is commonly referred to as N-gram LM.
Following the original Jasper paper , we are considering two different LMs: 6-gram KenLM  and Transformer-XL  with the full sentence as the context. For generation, a beam search with the width is used where each hypothesis is evaluated with shallow fusion of acoustic and language models:
It is worth noting that Transformer-XL does not replace KenLM but complements it. Beam search in hypothesis formation is governed by the joint score of ASR and KenLM, and the resulting beams are additionally re-scored with Transformer-XL in a single forward pass. Using Transformer-XL instead of KenLM all the way along is too expensive due to much slower inference of the former.
2.3 ASR correction model
The proposed model (Figure 1) has Transformer encoder-decoder architecture  commonly used for neural machine translation. We denote the number of layers in encoder and decoder as , the hidden size as , and the number of self-attention heads as . Similar to prior work [23, 8], the fully-connected inner-layer dimensionality is set to . Dropout with probability
is applied after the embedding layer, after each sub-layer before the residual connection, and after multi-head dot-product attention.
We consider two options for initializing the model weights: random initialization and using the weights of pre-trained BERT . Since BERT has the same architecture as Transformer encoder, its parameters can be straightforwardly used for encoder initialization. In order to initialize the decoder, which has an additional encoder-decoder attention block in each layer, we duplicate and use the parameters of the corresponding self-attention block.
We conduct our experiments on LibriSpeech  benchmark. Librispeech training dataset consists of three parts — train-clean-100, train-clean-360, and train-clean-500, which together provide hours of transcribed speech or around K training sentences. For evaluation, LibriSpeech provides development datasets (dev-clean and dev-other) and test datasets (test-clean and test other). We found that even baseline models made only a few mistakes on dev-clean and selected the checkpoint with the lowest WER on dev-other for evaluation.
To generate training data for our Transformer ASR correction model, we split all training data into folds and trained different Jasper models in a cross-validation manner: each model was trained on folds and used to generate greedy ASR predictions for the remaining th fold. Then, we concatenated all resulting folds and used Jasper greedy predictions as the source side of our parallel corpora with ground truth transcripts as the target side.
However, we did not manage to considerably improve upon Jasper greedy when training the Transformer on resulting K training sentences because of extreme overfitting. To augment our training dataset, we used two techniques:
We took pre-trained Jasper model and enabled dropout during inference on training data. This procedure was repeated multiple times with different random seeds.
We perturbed training data with Cutout  by randomly cutting small rectangles out of the spectrogram, which essentially drops complete phonemes or words and mel frequency channels.
After the augmentation, deduplication, and removal of sentence pairs with WER greater than , we ended up with approximately M of training examples.
The ablation study of the proposed data augmentation techniques is presented in Table 1. In our experiments, adding sentences generated with enabled dropout and cutout was much more efficient; thus, we stick to it as our training dataset in all subsequent experiments. We also experimented with using top-k beams obtained with beam search but found that the resulting sentences lacked in diversity, often differing in a few characters only.
All models used in our experiments are Transformers with parameters (). We train them with NovoGrad  optimizer () for a maximum of K steps with polynomial learning rate decay on a batches of K source and target tokens. For our vocabulary we adopted K WordPieces  used in BERT  so we can straightforwardly transfer its pre-trained weights. According to , we also used label smoothing of for regularization. Each model was trained on a single DGX-1 machine with NVIDIA V GPUs. Models were implemented in PyTorch within NeMo toolkit111https://github.com/nvidia/nemo.
Next, we experiment with various architectures and initialization schemes. Specifically, we either initialize all weights randomly (rand) from or transfer weights from the pre-trained bert-base-uncased model (BERT). Table 2 depicts the performance of different configurations.
|Jasper + 6-gram|
|Jasper + TXL|
Models with randomly initialized encoder improve upon the results of Jasper greedy on “other” portions of evaluation datasets; however, their correction harms the performance on the “clean” portion. Adding BERT-initialized decoder achieves slightly better results, but it still lags behind the baseline Jasper with LM re-scoring.
|Model||Example 1||Example 2||Example 3|
|Ground truth||pierre looked at him in surprise||i’ve gained fifteen pounds and||and how about little avice caro|
|Greedy||pure locat e ham in a surprise||afgain fifteen pounds and||and hawbout little ov his carrow|
|+ 6-gram||pure locate him in surprise||again fifteen pounds and||and about little of his care|
|+ TXL||pure locate him in surprise||again fifteen pounds and||and but little of his care|
|Ours||pierre looked at him in surprise||i’ve gained fifteen pounds and||and how about little of his care|
|Ground truth||one day the traitor fled with a teapot and a basketful of cold victuals|
|Greedy||one day the trade of fled with he teappot and a basketful of cold victures|
|+ 6-gram||one day the trade of fled with the tea pot and a basketful of cold pictures|
|+ TXL||one day the trade of fled with the tea pot and a basketful of cold victores|
|Ours||one day the trader fled with a teapot and a basketful of cold victuals|
Models with BERT-initialized encoder are strictly better than both Jasper greedy and models with encoder initialized randomly. They outperform Jasper with 6-gram KenLM on “other” portions of evaluation datasets and approach the accuracy of the Jasper with Transformer-XL re-scoring. The best performance is achieved by the model with both encoder and decoder initialized with pre-trained BERT.
Interestingly, our ASR correction model considerably pushes forward the performance on much noisier “other” evaluation datasets and only moderately improves the results on “clean”. This can be explained by the slight domain mismatch in our training data and “clean” evaluation datasets. Our data was collected with the models which achieve around WER on average (or even higher, if dropout is on during generation) and does not contain many “clean” examples, which usually have much lower WER. Figure 2 shows that the distribution of WER in training data is indeed much closer to dev-other than to dev-clean.
3.4 Analysis of corrected examples
To conduct a qualitative analysis of ASR corrections produced by our best model with BERT-initialized encoder and decoder, we examined the examples on which it successfully corrects the output of Jasper greedy.
Table 3, which depicts the excerpts with the largest difference in WER between greedy and corrected ASR outputs, reveals an interesting pattern. Both greedy decoding and re-scoring with external LMs make a mistake at the very beginning of the speech. If it is poorly decoded by acoustic model and has little or no context for LM, the scores used for evaluating beams in Equation 1 are simply unreliable. The mistakenly generated context might also negatively affect the succeeding left-to-right LM scores leading to even more errors (Table 4). Our model, on the other hand, successfully corrects ASR output by leveraging the bidirectional context provided by corrupted yet complete decoded excerpt.
In this work, we investigated the use of Transformer-based encoder-decoder architecture for the correction of ASR systems output. The proposed ASR output correction technique is capable of “translating” the erroneous output of the acoustic model into grammatically and semantically correct text.
The proposed approach enhances the acoustic model accuracy by a margin comparable to shallow fusion and re-scoring with external language models. Analysis of corrected examples demonstrated that our model works in scenarios when the scores produced by both acoustic and external language models are not reliable.
To overcome the problem of extreme overfitting on the relatively small training dataset, we proposed several data augmentation and model initialization techniques, i.e., enabling dropout and Cutout  during acoustic model inference and initializing both encoder and decoder with the parameters of pre-trained BERT . We also performed a series of ablation studies showing that both data augmentation and model initialization have a significant impact on model performance.
-  (2019) Adaptive input representations for neural language modeling. ICLR. Cited by: §1.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
-  (2016) End-to-end attention-based large vocabulary speech recognition. In ICASSP, Cited by: §1.
Global optimization of a neural network-hidden markov model hybrid. In IJCNN-91-Seattle, Vol. 2, pp. 789–794. Cited by: §1.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In ICASSP, Cited by: §1.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP. Cited by: §1.
-  (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §1, §2.2.
-  (2019) Bert: pre-training of deep bidirectional transformers for language understanding. NAACL. Cited by: §1, §2.3, §2.3, §3.2, §4.
Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: 2nd item, §4.
-  (2018) Understanding back-translation at scale. EMNLP. Cited by: §3.1.
-  (2018) Automatic speech recognition errors detection and correction: a review. Procedia Computer Science. Cited by: §1.
-  (2019) Jasper: an end-to-end convolutional neural acoustic model. Interspeech. Cited by: §1, §1, §2.1, §2.2, Table 2.
Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286. Cited by: §2.1, §3.2.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, Cited by: §2.1.
-  (2019) A spelling correction model for end-to-end speech recognition. In ICASSP, pp. 5651–5655. Cited by: §1.
-  (2011) KenLM: faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pp. 187–197. Cited by: §1, §2.2.
-  (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Cited by: §1.
-  (2017) Densely connected convolutional networks. In , pp. 4700–4708. Cited by: §2.1.
-  (2019) NeMo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577. Cited by: §2.1.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In ICASSP, pp. 5206–5210. Cited by: §1, §3.1.
-  (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §3.1.
-  (2014) Sequence to sequence learning with neural networks. NIPS. Cited by: §1.
-  (2017) Attention is all you need. In NIPS, Cited by: §1, §2.3, §3.2.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.2.
-  (2019) Bridging the gap between training and inference for neural machine translation. ACL. Cited by: §3.1.
-  (2017) Very deep convolutional networks for end-to-end speech recognition. In ICASSP, pp. 4845–4849. Cited by: §1.