1 Introduction and related work
End-to-end speech recognition models are simpler to implement and to train than bootstrapped systems, though one might imagine that with each training step – for systems requiring force alignment – some performance is left aside. In practice, best-results for common benchmarks are still dominated by classical ASR models. We set out to study end-to-end systems on LibriSpeech and, without any algorithmic contribution, see if they can be made to perform as well as more complex training pipelines. The difficulties involved in properly optimizing acoustic models with CTC or Seq2Seq (v.s. e.g. cross-entropy) combined with more readily available regularization techniques for classic pipelines make this comparison challenging. Our best acoustic models nonetheless reach 5.18% WER on test-other, showing that end-to-end models can compete with traditional pipelines.
As in other domains, self and semi-supervised learning in ASR where a pretrained network generates and trains on its own labels yields improvements . In end-to-end ASR, pseudo-labeling and self-training can be quite effective, and its effectiveness is further improved when more data is available . In this setting, we train a model on Librispeech, then use that model in conjunction with a language model to generate pseudo-labels. We show that with this training scheme, we achieve a 2.30% and 5.29% WER test-clean and test-other (respectively) without an external language model or decoding procedure, and reach 2.03% and 4.11% WER with language models (decoding and beam rescoring).
Deep neural networks were reintroduced in ASR with HMMs, and there are plenty of state-of-the-art models relying on force alignment today [12, 22, 20]. Nonetheless, there have been increasingly more end-to-end, competitive results, trained with CTC [10, 1], ASG , LF-MMI , sequence-to-sequence (Seq2Seq) [3, 4], or transducers training [27, 16], and differentiable decoding . Listen Attend and Spell  is a family of end-to-end models based on biLSTMs, which got state-of-the-art results when using proper regularization through data augmentation , and consequently we use specAugment in all of our experiments. Seq2Seq models are not limited to RNNs, for instance Time-Depth Seperable columns exhibited strong results recently . Our best models are transformer-based, as [22, 20], which gave good results in Seq2Seq even without external language models .
2 Acoustic Models
In this section, we present our three families of acoustic models (AMs). All AMs are token-based, outputting 10k word pieces 
, although this is just a choice for simplicity in this comparative study, not the result of a limitation. Similarly, all AMs take 80-channel log-melfilterbanks as input, with STFTs computed on 25ms Hamming windows strided by 10ms, except for TDS models that that are using a 30ms window.
2.1 ResNet Acoustic Models
ResNets have been introduced first in the context of computer vision, and then successfully applied to speech recognition [34, 29, 21]
. ResNets are composed of several blocks of convolutions (in our case only 1D convolutions), with skip connections to reduce the issue of vanishing gradients in deep neural networks. In total, there are 42 convolutional layers in our architecture, each of them with a kernel size equals to 3. More precisely, our ResNets are fed 80 log-mel filterbanks, which are first mapped to a 1024 embedding space with one convolutional layer (with stride 2). Then 12 blocks of three 1D convolutions follow. Each of the convolutional layer is followed by ReLU, dropout and layer norm. The number of hidden units is increased with the depth of the network, with 8 blocks of 1024 units (dropout 0.15), followed by 2 blocks of 1536 (dropout 0.20) and 2 blocks of 2048 units (dropout 0.25). Specific convolution layers are inserted between ResNet blocks when the hidden representation size is increased, and 3 fully connected layers (2048 units) finalize the network. Our architecture performs heavy pooling (16 frames in total, 160ms): in addition to the first strided convolutional layer, 3 pooling layers (stride 2) are distributed across the depth of the network (after blocks 3, 7 and 10).
2.2 Time-Depth Separable Convolution Acoustic Models
On the encoder side, we extend the TDS block designed in  by increasing the number of channels in the feature maps emitted in between the two linear layers, so as to increase the model capacity. The increasing factor is set to 3. Given that we are using the same word-pieces as , we stick to use the same three sub-sampling layers with stride 2 in each for an optimal context size. The number of TDS blocks are increased to (5, 6, 10) after each sub-sampling layer with channels (10, 14, 18). Kernel sizes are 211 in all the convlutions. Thus, the total number of parameters in the encoder reaches 200M. On the decoder side, we apply either CTC or self-attention with RNN. For self-attention, we use -groups of -layer GRU with 512 hidden units, together with the same efficient key-value attention [14, 31]:
where is the activation from encoder with 512 dimensions in total, and
is the query vector at timein round , generated by the GRU . The initial is a 512-dimension token embedding, and the final is linearly projected to output classes for token classification. In our experiments, and are both set to 3. For CTC-trained models, the output of the encoder is directly linearly projected to output classes. We use dropout in all TDS blocks and GRUs to prevent overfitting.
2.3 Transformers-based Acoustic Models
Our transformer-based acoustic models have of small front-end: 3 layers of 1D convolutions of kernel width 3, with input size and output size respectively , , , each followed by a GLU activation 
and max-pooling over 2 frames. Thus, the output of this front-end strides by 8 frames (80ms). The rest of the AM isfor 24 layers or for 36 layers (or blocks) of Transformers of hidden size , with 4 attention heads; thus with a feedforward network (FFN) of width (activations), one hidden layer, and the ReLU non-linearity. More precisely, given a sequence of vectors of dimension , the input is represented by the matrix ; following exactly , we have:
where is the output of the self-attention layer, with a skip connection, and is the output of the FFN layer, with a skip connection. As is standard: our norm is layer norm, and self-attention is defined as in Eq. 1, but with , , and . For CTC-trained models, the output of the encoder
is followed by a linear layer to the output classes. For Seq2Seq models, we have an additional decoder, which is a stack of 6 Transformers, with encoding dimension 256, and 4 attention heads. The probability distribution of the transcription is factorized as:
where is a special symbol indicating the beginning of the transcription. For all layers (encoder and decoder – when present): we use dropout on the self-attention, and we also use layer drop , dropping entire layers at the FFN level.
3.1 Technical Details
We work with the standard split for LibriSpeech, and the standard text data for LM training. We use the wav2letter++111https://github.com/facebookresearch/wav2letter toolkit  to train our models and both recipe and pre-trained models will be released in it.
All hyperparameters including model architecture are cross-validated on dev-clean and dev-other. We use plain SGD with momentum to train ResNet and TDS models, and Adagrad to train Transformers. For Transformers, we do linear warm-up of the learning rate over 32k to 64k updates. We start with a learning rate of 0.03, and halve it every 40 epochs after the first150. For ResNets, the learning rate follows a cosine schedule, starting at 0.4 and ending at 0 after 500 epochs. For TDS models, the learning rate is initialized to 0.06 and 0.3 for Seq2seq and CTC repectively and gets halved every 150 epochs. The momentum of TDS and ResNet are set to 0.5 and 0.6. We set batchsize per GPU to 4 for ResNet and TDS models, 8 for Transformers. For Seq2Seq, we do 3 epochs of attention window pretraining, and use 99% of teacher forcing (1% of outputsampling). We also use 20% dropout in the decoder for TDS (10% dropout and 10% layer drop in the decoder for Transformers), together with 5% label smoothing, 1% random sampling and 1% word piece sampling. TDS models are trained on 64 GPUs for 600 epochs in 7 days, while ResNets and Transformers are trained on average for 3 days on 32 or 64 GPUs for our biggest models (Transformers). We also use SpecAugment , specifically with the LD policy, in all types of models.
3.2 Decoding and Language Modeling
Decoding consists in combining the output of the acoustic model with a language model (LM). We perform single pass beam search decoding with an LM, and then use another LM for beam rescoring. We implemented lexicon-based and lexicon-free beam search with eithern-grams or convolutional language models (GCNN ), and rescore with either GCNN or Transformers language models . Lexicon-based decoding is used with CTC models (see Table 2), either with a 4-gram word-level LM, or a word-level GCNN. Lexicon-free decoding is used with S2S models, either with a 6-gram word-pieces (WP) LM, or a word-pieces based GCNN. The decoder takes as input the emissions of the acoustic model, a trie over the LM vocabulary, and a LM. We tune the language model weight and the word insertion penalty on the validation sets (dev-clean and dev-other). The final score we want to maximize for is
To stabilize the sequence-to-Sequence beam search, we utilize the Hard Attention Limit and End-of-sentence Threshold in . To improve the decoding efficiency, we incorporate the thresholding technique in  and strategies mentioned in  including 1) hypothesis merging, 2) score caching and 3) batching the LM forwarding. Further, according to 
, only blank token is proposed if its posterior probability is larger than 0.95 in CTC decoding.
We also use both word-level GCNN and Transformer LM to rescore the set of final hypothesis in the beam. The N-best candidates are selected by the first-pass decoding with n-grams or GCNN. They are then rescored with
where and represents the GCNN LM and Transformer LM respectively, and is the transcription length in characters.
All the LMs are trained on the standard LibriSpeech LM corpus using toolkit KenLM  for n-gram LM and fairseq  for GCNN and Transformer LM. The model architectures of the two neural LMs are shared across words and word-pieces. We use the GCNN-14B in  as our ConvLM, while the Transformer LM is the same as the one trained on Google Billion Words in . Word-level perplexities of all the LM variants can be found in Table 1.
|Fully convolutional ||letter||GCNN||word||3.1||9.9||3.3||10.5|
|Conv. Transformers ||5k WP||-||-||4.8||12.7||4.7||12.9|
|TDS Convs. ||10k WP||GCNN||-||5.0||14.5||5.4||15.6|
|Decoding||10k WP||GCNN||10k WP||3.0||8.9||3.3||9.8|
|LAS ||16k WP||-||-||2.8||6.8|
|Decoding||16k WP||RNN||16k WP||2.5||5.8|
|biLSTM + attn. ||10k BPE||-||-||4.3||12.9||4.4||13.5|
|+ Transformer decoding||10k BPE||Transformer||10k BPE||2.6||8.4||2.8||9.3|
|HMM/biLSTM ||12k CDp||4gram+LSTM||word||2.2||5.1||2.6||5.5|
|+ Transformer rescoring||12k CDp||+ Transformer||word||1.9||4.5||2.3||5.0|
|Conv. Transformers ||6k triphones||3gram, rescored||word||1.8||5.8||2.2||5.7|
|+ TDNN + LSTM|
|Conv. Transformers ||chenones||4gram||word||2.60||5.59|
|+ Transformer rescoring||chenones||Transformer||word||2.26||4.85|
|ResNet (306M)||10k WP||-||-||3.96||10.12||4.00||10.02|
|ResNet LibriVox||10k WP||-||-||3.11||7.72||3.23||8.19|
|TDS (200M)||10k WP||-||-||4.34||11.04||4.52||11.16|
|TDS LibriVox||10k WP||-||-||3.14||7.86||3.30||8.46|
|Conv. Transformers (322M)||10k WP||-||-||2.98||7.36||3.18||7.49|
|+ Rescoring||10k WP||GCNN + Transf.||word||2.20||5.05||2.51||5.56|
|+ Rescoring||10k WP||GCNN + Transf.||word||2.18||4.97||2.49||5.53|
|Conv. Transformers LibriVox||10k WP||-||-||2.66||6.44||3.03||6.83|
|+ Rescoring||10k WP||GCNN + Transf.||word||2.21||4.82||2.61||5.33|
|+ Rescoring||10k WP||GCNN + Transf.||word||2.16||4.67||2.63||5.27|
|TDS (190M)||10k WP||-||-||3.15||8.19||3.56||8.40|
|Decoding||10k WP||6gram||10k WP||2.72||6.97||3.18||7.08|
|Decoding||10k WP||GCNN||10k WP||2.50||6.21||2.91||6.31|
|Conv. Transformers (266M)||10k WP||-||-||2.56||6.65||3.05||7.01|
|Decoding||10k WP||6gram||10k WP||2.28||5.88||2.58||6.15|
|+ Rescoring||10k WP||GCNN + Transf.||word||2.48||5.11||2.50||5.51|
|Decoding||10k WP||GCNN||10k WP||2.11||5.25||2.30||5.64|
|+ Rescoring||10k WP||GCNN + Transf.||word||2.17||4.67||2.31||5.18|
|Conv. Transformers LibriVox||10k WP||-||-||1.96||4.83||2.30||5.29|
|Decoding||10k WP||6gram||10k WP||1.90||4.35||2.30||4.84|
|+ Rescoring||10k WP||GCNN + Transf.||word||1.73||3.59||2.01||4.22|
|Decoding||10k WP||GCNN||10k WP||1.79||3.86||2.18||4.42|
|+ Rescoring||10k WP||GCNN + Transf.||word||1.70||3.52||2.03||4.11|
3.3 LibriSpeech Results
All our results for LibriSpeech are listed in Table. 2. We present results under three scenarios 1) without any decoding nor external LM, 2) with one-pass decoding only, and 3) with decoding followed by beam resoring. The decoding beam size for the Transformer models are usually 500 and 50 for seq2seq and CTC respectively, while they are 1000 and 100 for TDS and ResNet models. We push strong baseline on the simple ResNet models and improve the TDS models significantly over the past results . These convolutional based models, even trained with CTC, performs as good as or better than biLSTM based ones . Our best acoustic models are Transformers based, and reach 7.01% without any decoding on test-other, and reach 5.18% WER with decoding and rescoring. Their accuracy are further boosted by rescoring with Transformer and GCNN LMs. This shows that end-to-end training can perform as well or better than traditional pipelines.
3.4 LibriVox Results
LibriVox222https://librivox.org is a large collection of freely-available audiobooks. We select 65K hours of read speech from English book listings, and after filtering and preprocessing to remove readings of duplicate text and corrupted audio, we run VAD on the resulting collection of audio with a CTC model trained on LibriSpeech, and segment the result into chunks no greater than 36 seconds; the resulting audio corpus contains 55.8K hours of read speech.
We then generate pseudo-labels for this audio using the recipe described in . To generate the pseudo-labels, we use an Transformer AM trained on LibriSpeech with CTC loss that achieves a 6.38% WER on dev-other when decoded with a 4-gram word LM – the same model as is listed in Table 2. We pseudo-label all audio using this AM and 4-gram decoding using a smaller beam size to increase decoding speed; these different parameters still achieve results within 0.01% absolute WER compared to decoding with the optimal parameters on the same AM without LM rescoring.
Assuming these labels are ground-truth, we train acoustic models on a combination of the 960K hours of labeled audio from LibriSpeech in conjunction with the pseudo-labeled audio from LibriVox, where batches are uniformly sampled (without weighting) from both LibriSpeech and LibriVox datasets. Transformer AMs with both CTC and Seq2Seq loss were trained for 2-4 days on this combined, achieving WERs on dev-other of 4.83% and 1.96% on dev-clean without decoding or use of a language model, and are to our knowledge are the best-reported results for this setting. Results with decoding/rescoring are shown in Table 2, and are state-of-the-art on Librispeech in the semi-supervised setting with unlabeled audio.
We presented state of the art results on LibriSpeech, with end-to-end methods. While allowing for lexicon-free decoding, our 10k word pieces are limiting the amount of striding we can use in our models, and could be replace by AMs outputs words with an arbitrary lexicon . We produced a simple pipeline, that does not require many steps of training. In light of our semi-supervised result without decoding, we think Seq2Seq, transducers, and differentiable decoding are viable methods to get end-to-end state of the art results, without external language models, through semi-supervised learning.
Deep speech 2: end-to-end speech recognition in english and mandarin.
International conference on machine learning, pp. 173–182. Cited by: §1.
-  (2019) Adaptive input representations for neural language modeling. In International Conference on Learning Representations, External Links: Cited by: §3.2, §3.2.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In ICASSP, External Links: Cited by: §1.
-  (2018) State-of-the-art speech recognition with sequence-to-sequence models. ICASSP. Cited by: §1.
-  (2019) A fully differentiable beam search decoder. In ICML, pp. 1341–1350. External Links: Cited by: §1.
-  (2019) Word-level speech recognition with a dynamic lexicon. arXiv preprint arXiv:1906.04323. Cited by: §4.
-  (2016) Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193. Cited by: §1.
-  (2017) Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 933–941. External Links: Cited by: §2.3, §3.2, §3.2.
-  (2019) Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556. Cited by: §2.3.
Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. II–1764–II–1772. External Links: Cited by: §1.
-  (2018) End-to-end speech recognition using lattice-free mmi.. In Interspeech, pp. 12–16. Cited by: §1.
-  (2017) The capio 2017 conversational speech recognition system. External Links: Cited by: §1.
-  (2019) State-of-the-art speech recognition using multi-stream self-attention with dilated 1d convolutions. External Links: Cited by: Table 2.
-  (2019-09) Sequence-to-sequence speech recognition with time-depth separable convolutions. Interspeech 2019. External Links: Cited by: §1, §2.2, §3.2, §3.3, Table 2.
Deep residual learning for image recognition.
Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §2.1.
-  (2019-05) Streaming end-to-end speech recognition for mobile devices. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). External Links: Cited by: §1.
-  (2011) KenLM: faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pp. 187–197. Cited by: §3.2.
-  (2012) Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine. Cited by: §1.
-  (2019) Self-training for end-to-end speech recognition. arXiv preprint arXiv:1909.09116. Cited by: §1, §3.4.
-  (2019) A comparative study on transformer vs rnn in speech applications. External Links: Cited by: §1, Table 2.
-  (2019) Jasper: an end-to-end convolutional neural acoustic model. In Interspeech, External Links: Cited by: §2.1.
-  (2019-09) RWTH asr systems for librispeech: hybrid vs attention. Interspeech 2019. External Links: Cited by: §1, §3.3, Table 2.
-  (2019) Transformers with convolutional context for asr. External Links: Cited by: §1, Table 2.
-  (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §3.2.
SpecAugment: a simple data augmentation method for automatic speech recognition. Interspeech 2019. External Links: Cited by: §1, §3.1, Table 2.
-  (2018) Fully neural network based speech recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems, pp. 10620–10630. Cited by: §3.2.
-  (2017) A comparison of sequence-to-sequence models for speech recognition.. In Interspeech, pp. 939–943. Cited by: §1.
Wav2letter++: the fastest open-source speech recognition system. arXiv preprint arXiv:1812.07625. Cited by: §3.1.
-  (2017) English conversational telephone speech recognition by humans and machines. In Interspeech, pp. 132–136. Cited by: §2.1.
-  (2012) Japanese and korean voice search. In International Conference on Acoustics, Speech and Signal Processing, pp. 5149–5152. Cited by: §2.
-  (2017) Attention is all you need. In Adv. NIPS, Cited by: §2.2, §2.3.
-  (2017) Semi-supervised DNN training with word selection for ASR. In Interspeech, pp. 3687–3691. Cited by: §1.
-  (2019) Transformer-based acoustic modeling for hybrid speech recognition. External Links: Cited by: Table 2.
-  (2017) The microsoft 2016 conversational speech recognition system. In ICASSP, Cited by: §2.1.
-  (2018) Fully convolutional speech recognition. CoRR abs/1812.06864. External Links: Cited by: §3.2, Table 2.