End-to-end automatic speech recognition (ASR) systems combine the functionality of acoustic, pronunciation, and language modelling components into a single neural network. Early approaches to end-to-end ASR employ CTC [8, 9]; however these models require rescoring with an external language model (LM) to obtain good performance . RNN encoder-decoder [6, 28] equipped with attention , originally proposed for machine translation, is an effective approach for end-to-end ASR [3, 5]. These systems see less of a performance drop in the no-LM setting .
More recently, the Transformer  encoder-decoder architecture has been applied to ASR [7, 20, 15]. Transformer training is parallelizable across time, leading to faster training times than recurrent models 
. This makes them especially amenable to the large audio corpora encountered in speech recognition. Furthermore, Transformers are powerful autoregressive models[26, 23], and have achieved reasonable ASR results without incurring the storage and computational overhead associated with using LM’s during inference .
Although current end-to-end technology has seen significant improvements in accuracy, computational requirements in terms of both time and space for performing inference with such models remains prohibitive for edge devices. Thus, there has been increased interest in reducing model sizes to enable on-device computation. The model compression literature explores many techniques to tackle the problem, including: quantization , pruning [19, 11], and knowledge distillation [16, 17]. An all-neural, end-to-end solution based on RNN-T  is presented in . The authors make several runtime optimizations to inference and perform post-training quantization, allowing the model to be successfully deployed to edge devices.
In this contribution, we turn our focus to refining the Transformer architecture so as to enable its use on edge devices. The absence of recurrent connections in Transformers provides a significant advantage in terms of speeding up computation, and therefore, quantizing a Transformer-based ASR system would be an important step towards on-device ASR. We report findings on direct improvements to the model through removing components which do not significantly affect performance, and finally reduce the numerical precision of model weights and activations. Specifically, we reduced the dimensionality of the inner representations throughout the model, removed convolutional layers employed prior to the decoder’s layers (as in Fig. 1), and finally performed 8-bit quantization to the model’s weights and activations. As verified in terms of recognition performance, our results on the Librispeech dataset  support the claim that one can recover the original performance even after greatly reducing model’s computational requirements.
The remainder of this work is organized as follows: section 2 gives an overview of Transformer-based ASR, and section 3 describes the details of our quantization scheme. Section 4 describes our experiments with the Librispeech dataset. Section 5 is a discussion of our results. Connection to prior work is presented in section 6. Finally, we draw conclusions and describe future directions in section 7.
2 Transformer networks for ASR
Casting ASR as a sequence-to-sequence task, the Transformer encoder takes as input a sequence of frame-level acoustic features , and maps it to a sequence of high-level representations . The decoder generates a transcription one token at a time. Each choice of output token is conditioned on the hidden states and previously generated tokens through attention mechanisms. The typical choice for acoustic features are frame-level log-Mel filterbank coefficents. The target transcripts are represented by word-level tokens or sub-word units such as characters or produced through byte pair encoding .
2.1 Transformer architecture
The encoder and decoder of the Transformer are stacks of Transformer layers. The layers of the encoder iteratively refine the representation of the input sequence with a combination of multi-head self-attention and frame-level affine transformations. Specifically, the inputs to each layer are projected into keys , queries , and values
. Scaled dot product attention is then used to compute a weighted sum of values for each query vector:
where is the dimension of the keys. We obtain multi-head attention by performing this computation times independently with different sets of projections, and concatenating:
are learned linear transformations, and . We use . The self-attention operation allows frames to gather context from all timesteps and build an informative sequence of high-level features. The outputs of multi-head attention go through a 2-layer position-wise feed-forward network with hidden size .
On the decoder side, each layer performs two rounds of multi-head attention: the first one being self-attention over the representations of previously emitted tokens (), and the second being attention over the output of the final layer of the encoder ( are previous layer outputs, are ). The output of the final decoder layer for token is used to predict the following token
. Other components of the architecture such as sinusoidal positional encodings, residual connections and layer normalization are described in.
2.2 Convolutional layers
, we apply frequency-time 2-dimensional convolution blocks followed by max pooling to our audio features, prior to feeding them into the encoder, as seen in Fig.1. We can achieve significant savings in computation given that the resulting length of the input is considerably reduced and the computation required for self-attention layers scales quadratically with respect to the sequence length.
Moreover, it has been shown that temporal convolutions are effective in modeling time dependencies , and serves to encode ordering into learned high level representations of the input signal. Based on these observations,  proposes to replace sinusoidal positional encodings in the Transformer with convolutions, employing 2D convolutions over spectrogram inputs and 1D causal convolutions over word embeddings in the decoder (pictured in Fig. 1).
3 Model Compression
A simple approach to reducing computational requirements is to reduce the precision requirements for weights and activations in the model. It is shown in 
that stochastic uniform quantization is an unbiased estimator of its input and quantizing the weights of a network is equivalent to adding Gaussian noise over parameters, which can induce a regularization effect and help avoid overfitting. Quantization has several advantages: 1) Computation is performed in fixed-point precision, which can be done more efficiently on hardware. 2) With 8-bit quantization, the model can be compressed up to 4 times its original size. 3) In several architectures, memory access dominates power consumption, and moving 8-bit data is four times more efficient when compared to 32-bit floating point data. All three factors contribute to faster inference, with 2-3x times speed up and further improvements are possible with optimized low precision vector arithmetic.
3.1 Quantization scheme
For our experiments, we apply the quantization scheme introduced in : we use a uniform quantization function which maps real values (weights and activations) in the range of to -bit signed integers:
with . In the case that is not in the range of , we first apply the clamp operator:
The de-quantization function is given by:
where refers to the quantized integer value corresponding to the real value .
During training, forward propagation simulates the effects of quantized inference by incorporating the de-quantized values of both weights and activations in the forward pass floating-point arithmetic operations. We then apply the quantization operation and the de-quantization operation according to eq.5 and eq. 7 respectively to each layer. The clamping ranges are computed differently for weights and activations. For a weight matrix , we set and to be and respectively. For activations, the clamping range depends on the , the input to the layer. We calculate by keeping track of and for each mini-batch during training, and aggregating them using an exponential moving average with smoothing parameter set to . Quantization of activations starts after a fixed number of steps (3000). This ensures that the network has reached a more stable stage and the estimated ranges do not exclude a significant fraction of values. We quantize to -bit precision in our experiments.
3.2 Quantization choices
We quantize all the matrix multiplication operations, inputs and the weights of the matrix multiplications. For other operations such as addition, quantization does not lead to computational gains during inference, so we do not quantize. Specifically, we quantize all the weights and activations, excluding the biases in the weights. The biases are summed with the INT32 output of matrix multiplications. In the multi-head attention module, we quantize the inputs (
), softmax layer (including numerator, denominator and division) and the scaled dot product’s output. In the position-wise feed forward network, we quantize the weights, its output and the output of ReLUs. The weights in the layer norms (), division operation and outputs of layer norm are also quantized.
We use the open-source, sequence modelling toolkitfairseq . We conduct our experiments on LibriSpeech 960h , and follow the same setup as : the input features are 80-dimensional log-Mel filterbanks extracted from 25ms windows every 10ms, and the output tokens come from a 5K subword vocabulary created with sentencepiece  “unigram”. For fair comparison, we also optimize with AdaDelta 
4.1 Comparison of Transformer variants
We perform preliminary experiments comparing full-precision Transformer variants and choose one to quantize. We start from Conv-Context  that proposes to replace sinusoidal positional encodings in the encoder and decoder with 2D convolutions (over audio features) and 1D convolutions (over previous token embeddings) respectively. Motivated by recent results in Transformer-based speech recognition  and language modelling , we allocate our parameter budget towards depth over width, and retrain their model under the configuration of Transformer Base , namely: 6 encoder/decoder layers, , 8 heads, and . We obtain a satisfactory trade-off between model size and performance (Table 1), and adopt this configuration for the remainder of this work.
Next, we propose removing the 1D convolutional layers on the decoder side, based on previous work  demonstrating that the autoregressive Transformer training setup provides enough of a positional signal for Transformer decoders to reconstruct order in the deeper layers. We observe that removing these layers do not affect our performance, and reduce our parameter count from 52M to 51M. Finally, we add positional encodings on top of this configuration and see, counter-intuitively, that our performance degrades. These results are pictured in Table 2.
WER (%) results of different hyperparameter configurations for Conv-Context. The first 2 rows are taken directly from.
|Model||1D Conv||Pos. enc.||Params||dev||test|
|+ Pos. enc.||✗||✓||6.0||14.6||6.0||14.5|
For quantization, we restrict our attention to our proposed simplified Transformer (no decoder-side convolutions or positional encodings), since it performs well and is the least complex of the Transformer variants. We compare the results of quantization-aware training to the full-precision model, as well as to the results of post-training quantization. In post-training quantization, we start from the averaged full-precision model, keep the weights fixed, and compute the clamping range for our activations over 1k training steps. To report the checkpoint-averaged result of quantization-aware training, we average the weights of quantization-aware training checkpoints, initialize our activation ranges with checkpoint averages, and adjust them over 1k training steps. In both cases, no additional updates are made to the weights.
Our results are summarised in Table 3. Our quantized models perform comparably to the full-precision model, and represent reasonable trade-offs in accuracy for model size and inference time. The last row of the table represents a result of 10x compression over the 138M parameter baseline with no loss in performance. Our quantization-aware training scheme did not result in significant gains over post-quantization.
5.1 Representing positional information
The 3 Transformer variants explored in this work differ in how they present token-positional information to the decoder. We study their behaviour to get a better understanding of why our proposed simplified model performs well.
We remark that sinusoidal position encodings hurt performance because of longer sequences at test time. It has been observed that decoder-side positional encodings do worse than 1D convolutions  (and also nothing at all, from our results). This performance drop is from under-generation; on dev-clean, our proposed model’s WER increases after adding positional encodings, with deletion rate increasing . Our plot in Fig. 2 shows that this can be attributed to the inability of sinusoidal positional encodings to generalize to lengths longer than encountered in the training set.
Examining the same plot, we notice utterances with large deletion counts in the outputs of models without sinusoidal positional encoding. An example is shown in Fig. 3. Our models without sinusoidal positional encoding exhibit skipping.
We hypothesize the issue lies in the time-axis translation-invariance of decoder inputs: repeated n-grams confuse the decoder into losing its place in the input audio. Cross-attention visualizations between inputs to the final decoder layer and encoder outputs (left column of Fig.4) support this hypothesis. We remark that being able to handle repetition is crucial for transcribing spontaneous speech. Imposing constraints on attention matrices or expanding relative positional information context are some possible approaches for addressing this problem.
Finally, we affirm the hypothesis proposed in  that the Transformer with no positional encodings reconstructs ordering in deeper layers. The second column of Fig. 4 show visualizations of cross-attention as we go up the decoder stack.
|Reference||This second part is divided into two, for in the first I speak of her as regards the nobleness of her soul relating some of her virtues proceeding from her soul. In the second I speak of her as regards the nobleness of her body narrating some of her beauties here love saith concerning her.|
|Conv_Context||The second part has divided into two for in the first I speak of her as regards the nobleness of her soul relating some of her virtues proceeding from her soul. In the second I speak of her as regards the nobleness of her body narrating some of her beauties here love saith concerning her.|
5.2 Training the Transformer
We observe no significant gain with quantization-aware training. Furthermore, it increases training time by more than 4x due to its expansion of our computational graph. We note that in post-quantization, the 1k steps used to fine-tune activation clamping ranges is very important. Without this step, system output is degenerate.
In our experiments, we found that training with large batch sizes (80k audio frames) was necessary for convergence. Similar optimization behaviour was observed across all experiments: a plateau at frame-level accuracy followed by a jump to 80% within a single epoch. This jump was not observed when training with smaller batch sizes.
6 Relation to Prior Work
Transformers for speech recognition.
Several studies have focused on adapting Transformer networks for end-to-end speech recognition. In particular,[20, 7] present models augmenting Transformers with convolutions.  focuses on refining the training process, and show that Transformer-based end-to-end ASR is highly competitive with state-of-the-art methods over 15 datasets. These studies focus only on performance, and do not consider trade-offs required for edge deployment.
Compression with knowledge distillation.  proposes a knowledge distillation strategy applied to Transformer ASR to recover the performance of a larger model with fewer parameters. Distilled models still work in 32-bit floating point, and do not take advantage of faster, more energy-efficient hardware available when working with 8-bit fixed-point. Additionally, we believe this work is orthogonal to ours, and the two methods can be combined for further improvement.
Transformer quantization. Quantization strategies for the transformer have been proposed in the context of machine translation.  proposes a quantization scheme that allow them to improve upon the original full-precision performance.
Necessity of positional encodings. For language modelling,  achieve better perplexity scores without positional encodings, and argue that the autoregressive setup used to train the Transformer decoder provides a sufficient positional signal.
In this paper, we proposed a compact Transformer-based end-to-end ASR system, fully quantized to enable edge deployment. The proposed compact version has a smaller hidden size and no decoder side convolutions or positional encodings. We then fully quantize it to 8-bit fixed point. Compared to the 138M baseline we started from, we achieve more than 10x compression with no loss in performance. The final model also takes advantage of efficient hardware to enable fast inference. Our training strategy and model configurations are not highly tuned. Future work includes exploring additional training strategies and incorporating text data, as to bring highly performant, single-pass, end-to-end ASR to edge devices.
We would like to thank our colleagues Ella Charlaix, Eyyüb Sari, and Gabriele Prato for their valuable insights and discussions.
Deep speech 2: end-to-end speech recognition in english and mandarin.
International conference on machine learning, pp. 173–182. Cited by: §2.2.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
-  (2016) End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4945–4949. Cited by: §1.
-  (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §2.2.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1.
Learning phrase representations using RNN encoder–decoder for statistical machine translation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. External Links: Cited by: §1.
-  (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. Cited by: §1, §2.2, §6.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1.
-  (2014) Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning, pp. 1764–1772. Cited by: §1.
-  (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
-  (2018) Streaming end-to-end speech recognition for mobile devices.. . Cited by: §1.
-  (2019) Language modeling with deep transformers. arXiv preprint arXiv:1905.04226. Cited by: §4.1, §4.1, §5.1, §6.
-  (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In , pp. 2704–2713. Cited by: §1, §3.1, §3.
-  (2019) A comparative study on transformer vs rnn in speech applications. arXiv preprint arXiv:1909.06317. Cited by: §1, §4.1, §6.
-  (2019) Knowledge distillation using output errors for self-attention end-to-end models. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6181–6185. Cited by: §1, §6.
-  (2016) Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947. Cited by: §1.
-  (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: §4.
-  (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.
-  (2019) Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660. Cited by: Figure 1, §1, §2.2, §2.2, §4.1, Table 1, §4, §5.1, §6.
-  (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53. Cited by: §4.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §1, §4.
-  (2018) Image transformer. In International Conference on Machine Learning, pp. 4052–4061. Cited by: §1.
-  (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §3.
-  (2019) Fully quantized transformer for improved translation. arXiv preprint arXiv:1910.10485. Cited by: §6.
-  Language models are unsupervised multitask learners. Cited by: §1.
-  (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Cited by: §2.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §4.1.
-  (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.