1 Introduction
Endtoend automatic speech recognition (ASR) systems combine the functionality of acoustic, pronunciation, and language modelling components into a single neural network. Early approaches to endtoend ASR employ CTC [8, 9]; however these models require rescoring with an external language model (LM) to obtain good performance [3]. RNN encoderdecoder [6, 28] equipped with attention [2], originally proposed for machine translation, is an effective approach for endtoend ASR [3, 5]. These systems see less of a performance drop in the noLM setting [3].
More recently, the Transformer [29] encoderdecoder architecture has been applied to ASR [7, 20, 15]. Transformer training is parallelizable across time, leading to faster training times than recurrent models [29]
. This makes them especially amenable to the large audio corpora encountered in speech recognition. Furthermore, Transformers are powerful autoregressive models
[26, 23], and have achieved reasonable ASR results without incurring the storage and computational overhead associated with using LM’s during inference [20].Although current endtoend technology has seen significant improvements in accuracy, computational requirements in terms of both time and space for performing inference with such models remains prohibitive for edge devices. Thus, there has been increased interest in reducing model sizes to enable ondevice computation. The model compression literature explores many techniques to tackle the problem, including: quantization [14], pruning [19, 11], and knowledge distillation [16, 17]. An allneural, endtoend solution based on RNNT [10] is presented in [12]. The authors make several runtime optimizations to inference and perform posttraining quantization, allowing the model to be successfully deployed to edge devices.
In this contribution, we turn our focus to refining the Transformer architecture so as to enable its use on edge devices. The absence of recurrent connections in Transformers provides a significant advantage in terms of speeding up computation, and therefore, quantizing a Transformerbased ASR system would be an important step towards ondevice ASR. We report findings on direct improvements to the model through removing components which do not significantly affect performance, and finally reduce the numerical precision of model weights and activations. Specifically, we reduced the dimensionality of the inner representations throughout the model, removed convolutional layers employed prior to the decoder’s layers (as in Fig. 1), and finally performed 8bit quantization to the model’s weights and activations. As verified in terms of recognition performance, our results on the Librispeech dataset [22] support the claim that one can recover the original performance even after greatly reducing model’s computational requirements.
The remainder of this work is organized as follows: section 2 gives an overview of Transformerbased ASR, and section 3 describes the details of our quantization scheme. Section 4 describes our experiments with the Librispeech dataset. Section 5 is a discussion of our results. Connection to prior work is presented in section 6. Finally, we draw conclusions and describe future directions in section 7.
2 Transformer networks for ASR
Casting ASR as a sequencetosequence task, the Transformer encoder takes as input a sequence of framelevel acoustic features , and maps it to a sequence of highlevel representations . The decoder generates a transcription one token at a time. Each choice of output token is conditioned on the hidden states and previously generated tokens through attention mechanisms. The typical choice for acoustic features are framelevel logMel filterbank coefficents. The target transcripts are represented by wordlevel tokens or subword units such as characters or produced through byte pair encoding [27].
2.1 Transformer architecture
The encoder and decoder of the Transformer are stacks of Transformer layers. The layers of the encoder iteratively refine the representation of the input sequence with a combination of multihead selfattention and framelevel affine transformations. Specifically, the inputs to each layer are projected into keys , queries , and values
. Scaled dot product attention is then used to compute a weighted sum of values for each query vector:
(1) 
where is the dimension of the keys. We obtain multihead attention by performing this computation times independently with different sets of projections, and concatenating:
(2)  
(3) 
The
are learned linear transformations
, and . We use . The selfattention operation allows frames to gather context from all timesteps and build an informative sequence of highlevel features. The outputs of multihead attention go through a 2layer positionwise feedforward network with hidden size .(4) 
On the decoder side, each layer performs two rounds of multihead attention: the first one being selfattention over the representations of previously emitted tokens (), and the second being attention over the output of the final layer of the encoder ( are previous layer outputs, are ). The output of the final decoder layer for token is used to predict the following token
. Other components of the architecture such as sinusoidal positional encodings, residual connections and layer normalization are described in
[29].2.2 Convolutional layers
Following previous work [7, 20, 1]
, we apply frequencytime 2dimensional convolution blocks followed by max pooling to our audio features, prior to feeding them into the encoder, as seen in Fig.
1. We can achieve significant savings in computation given that the resulting length of the input is considerably reduced and the computation required for selfattention layers scales quadratically with respect to the sequence length.Moreover, it has been shown that temporal convolutions are effective in modeling time dependencies [4], and serves to encode ordering into learned high level representations of the input signal. Based on these observations, [20] proposes to replace sinusoidal positional encodings in the Transformer with convolutions, employing 2D convolutions over spectrogram inputs and 1D causal convolutions over word embeddings in the decoder (pictured in Fig. 1).
3 Model Compression
A simple approach to reducing computational requirements is to reduce the precision requirements for weights and activations in the model. It is shown in [24]
that stochastic uniform quantization is an unbiased estimator of its input and quantizing the weights of a network is equivalent to adding Gaussian noise over parameters, which can induce a regularization effect and help avoid overfitting. Quantization has several advantages: 1) Computation is performed in fixedpoint precision, which can be done more efficiently on hardware. 2) With 8bit quantization, the model can be compressed up to 4 times its original size. 3) In several architectures, memory access dominates power consumption, and moving 8bit data is four times more efficient when compared to 32bit floating point data. All three factors contribute to faster inference, with 23x times speed up
[14] and further improvements are possible with optimized low precision vector arithmetic.3.1 Quantization scheme
For our experiments, we apply the quantization scheme introduced in [14]: we use a uniform quantization function which maps real values (weights and activations) in the range of to bit signed integers:
(5) 
with . In the case that is not in the range of , we first apply the clamp operator:
(6) 
The dequantization function is given by:
(7) 
where refers to the quantized integer value corresponding to the real value .
During training, forward propagation simulates the effects of quantized inference by incorporating the dequantized values of both weights and activations in the forward pass floatingpoint arithmetic operations. We then apply the quantization operation and the dequantization operation according to eq.
5 and eq. 7 respectively to each layer. The clamping ranges are computed differently for weights and activations. For a weight matrix , we set and to be and respectively. For activations, the clamping range depends on the , the input to the layer. We calculate by keeping track of and for each minibatch during training, and aggregating them using an exponential moving average with smoothing parameter set to . Quantization of activations starts after a fixed number of steps (3000). This ensures that the network has reached a more stable stage and the estimated ranges do not exclude a significant fraction of values. We quantize to bit precision in our experiments.3.2 Quantization choices
We quantize all the matrix multiplication operations, inputs and the weights of the matrix multiplications. For other operations such as addition, quantization does not lead to computational gains during inference, so we do not quantize. Specifically, we quantize all the weights and activations, excluding the biases in the weights. The biases are summed with the INT32 output of matrix multiplications. In the multihead attention module, we quantize the inputs (
), softmax layer (including numerator, denominator and division) and the scaled dot product’s output. In the positionwise feed forward network, we quantize the weights, its output and the output of ReLUs. The weights in the layer norms (
), division operation and outputs of layer norm are also quantized.4 Experiments
We use the opensource, sequence modelling toolkit
fairseq [21]. We conduct our experiments on LibriSpeech 960h [22], and follow the same setup as [20]: the input features are 80dimensional logMel filterbanks extracted from 25ms windows every 10ms, and the output tokens come from a 5K subword vocabulary created with sentencepiece [18] “unigram”. For fair comparison, we also optimize with AdaDelta [30]with learning rate=1.0 and gradient clipping at 10.0, and run for 80 epochs, averaging checkpoints saved over the last 30 epochs. The dropout rate was set to 0.15.
4.1 Comparison of Transformer variants
We perform preliminary experiments comparing fullprecision Transformer variants and choose one to quantize. We start from ConvContext [20] that proposes to replace sinusoidal positional encodings in the encoder and decoder with 2D convolutions (over audio features) and 1D convolutions (over previous token embeddings) respectively. Motivated by recent results in Transformerbased speech recognition [15] and language modelling [13], we allocate our parameter budget towards depth over width, and retrain their model under the configuration of Transformer Base [29], namely: 6 encoder/decoder layers, , 8 heads, and . We obtain a satisfactory tradeoff between model size and performance (Table 1), and adopt this configuration for the remainder of this work.
Next, we propose removing the 1D convolutional layers on the decoder side, based on previous work [13] demonstrating that the autoregressive Transformer training setup provides enough of a positional signal for Transformer decoders to reconstruct order in the deeper layers. We observe that removing these layers do not affect our performance, and reduce our parameter count from 52M to 51M. Finally, we add positional encodings on top of this configuration and see, counterintuitively, that our performance degrades. These results are pictured in Table 2.
Model  Layers  Params  dev  test  

Enc  Dec  clean  other  clean  other  
ConvContext  1024  16  6  315M  4.8  12.7  4.7  12.9 
6  6  138M  5.6  14.5  5.7  15.3  
512  6  6  52M  5.3  14.9  5.7  14.8 
WER (%) results of different hyperparameter configurations for ConvContext. The first 2 rows are taken directly from
[20].Model  1D Conv  Pos. enc.  Params  dev  test  

clean  other  clean  other  
ConvContext  ✓  ✗  52M  5.3  14.9  5.7  14.8 
Proposed  ✗  ✗  51M  5.6  14.2  5.5  14.8 
+ Pos. enc.  ✗  ✓  6.0  14.6  6.0  14.5 
4.2 Quantization
For quantization, we restrict our attention to our proposed simplified Transformer (no decoderside convolutions or positional encodings), since it performs well and is the least complex of the Transformer variants. We compare the results of quantizationaware training to the fullprecision model, as well as to the results of posttraining quantization. In posttraining quantization, we start from the averaged fullprecision model, keep the weights fixed, and compute the clamping range for our activations over 1k training steps. To report the checkpointaveraged result of quantizationaware training, we average the weights of quantizationaware training checkpoints, initialize our activation ranges with checkpoint averages, and adjust them over 1k training steps. In both cases, no additional updates are made to the weights.
Our results are summarised in Table 3. Our quantized models perform comparably to the fullprecision model, and represent reasonable tradeoffs in accuracy for model size and inference time. The last row of the table represents a result of 10x compression over the 138M parameter baseline with no loss in performance. Our quantizationaware training scheme did not result in significant gains over postquantization.
Model  Fully quantized  dev  test  

clean  other  clean  other  
Fullprecision  ✗  5.6  14.2  5.5  14.8 
Posttraining quant  ✓  5.6  14.6  5.6  15.1 
Quantaware training  ✓  5.4  14.5  5.5  15.2 
5 Discussion
5.1 Representing positional information
The 3 Transformer variants explored in this work differ in how they present tokenpositional information to the decoder. We study their behaviour to get a better understanding of why our proposed simplified model performs well.
We remark that sinusoidal position encodings hurt performance because of longer sequences at test time. It has been observed that decoderside positional encodings do worse than 1D convolutions [20] (and also nothing at all, from our results). This performance drop is from undergeneration; on devclean, our proposed model’s WER increases after adding positional encodings, with deletion rate increasing . Our plot in Fig. 2 shows that this can be attributed to the inability of sinusoidal positional encodings to generalize to lengths longer than encountered in the training set.
Examining the same plot, we notice utterances with large deletion counts in the outputs of models without sinusoidal positional encoding. An example is shown in Fig. 3. Our models without sinusoidal positional encoding exhibit skipping.
We hypothesize the issue lies in the timeaxis translationinvariance of decoder inputs: repeated ngrams confuse the decoder into losing its place in the input audio. Crossattention visualizations between inputs to the final decoder layer and encoder outputs (left column of Fig.
4) support this hypothesis. We remark that being able to handle repetition is crucial for transcribing spontaneous speech. Imposing constraints on attention matrices or expanding relative positional information context are some possible approaches for addressing this problem.Finally, we affirm the hypothesis proposed in [13] that the Transformer with no positional encodings reconstructs ordering in deeper layers. The second column of Fig. 4 show visualizations of crossattention as we go up the decoder stack.
Reference  This second part is divided into two, for in the first I speak of her as regards the nobleness of her soul relating some of her virtues proceeding from her soul. In the second I speak of her as regards the nobleness of her body narrating some of her beauties here love saith concerning her. 

Conv_Context  The second part has divided into two for in the first I speak of her as regards the nobleness of her soul relating some of her virtues proceeding from her soul. In the second I speak of her as regards the nobleness of her body narrating some of her beauties here love saith concerning her. 
5.2 Training the Transformer
We observe no significant gain with quantizationaware training. Furthermore, it increases training time by more than 4x due to its expansion of our computational graph. We note that in postquantization, the 1k steps used to finetune activation clamping ranges is very important. Without this step, system output is degenerate.
In our experiments, we found that training with large batch sizes (80k audio frames) was necessary for convergence. Similar optimization behaviour was observed across all experiments: a plateau at framelevel accuracy followed by a jump to 80% within a single epoch. This jump was not observed when training with smaller batch sizes.
6 Relation to Prior Work
Transformers for speech recognition.
Several studies have focused on adapting Transformer networks for endtoend speech recognition. In particular,
[20, 7] present models augmenting Transformers with convolutions. [15] focuses on refining the training process, and show that Transformerbased endtoend ASR is highly competitive with stateoftheart methods over 15 datasets. These studies focus only on performance, and do not consider tradeoffs required for edge deployment.Compression with knowledge distillation. [16] proposes a knowledge distillation strategy applied to Transformer ASR to recover the performance of a larger model with fewer parameters. Distilled models still work in 32bit floating point, and do not take advantage of faster, more energyefficient hardware available when working with 8bit fixedpoint. Additionally, we believe this work is orthogonal to ours, and the two methods can be combined for further improvement.
Transformer quantization. Quantization strategies for the transformer have been proposed in the context of machine translation. [25] proposes a quantization scheme that allow them to improve upon the original fullprecision performance.
Necessity of positional encodings. For language modelling, [13] achieve better perplexity scores without positional encodings, and argue that the autoregressive setup used to train the Transformer decoder provides a sufficient positional signal.
7 Conclusion
In this paper, we proposed a compact Transformerbased endtoend ASR system, fully quantized to enable edge deployment. The proposed compact version has a smaller hidden size and no decoder side convolutions or positional encodings. We then fully quantize it to 8bit fixed point. Compared to the 138M baseline we started from, we achieve more than 10x compression with no loss in performance. The final model also takes advantage of efficient hardware to enable fast inference. Our training strategy and model configurations are not highly tuned. Future work includes exploring additional training strategies and incorporating text data, as to bring highly performant, singlepass, endtoend ASR to edge devices.
8 Acknowledgements
We would like to thank our colleagues Ella Charlaix, Eyyüb Sari, and Gabriele Prato for their valuable insights and discussions.
References

[1]
(2016)
Deep speech 2: endtoend speech recognition in english and mandarin.
In
International conference on machine learning
, pp. 173–182. Cited by: §2.2.  [2] (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
 [3] (2016) Endtoend attentionbased large vocabulary speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4945–4949. Cited by: §1.
 [4] (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §2.2.
 [5] (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1.

[6]
(201410)
Learning phrase representations using RNN encoder–decoder for statistical machine translation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, Doha, Qatar, pp. 1724–1734. External Links: Link, Document Cited by: §1.  [7] (2018) Speechtransformer: a norecurrence sequencetosequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. Cited by: §1, §2.2, §6.

[8]
(2006)
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1.  [9] (2014) Towards endtoend speech recognition with recurrent neural networks. In International conference on machine learning, pp. 1764–1772. Cited by: §1.
 [10] (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1.
 [11] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
 [12] (2018) Streaming endtoend speech recognition for mobile devices.. . Cited by: §1.
 [13] (2019) Language modeling with deep transformers. arXiv preprint arXiv:1905.04226. Cited by: §4.1, §4.1, §5.1, §6.

[14]
(2018)
Quantization and training of neural networks for efficient integerarithmeticonly inference.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2704–2713. Cited by: §1, §3.1, §3.  [15] (2019) A comparative study on transformer vs rnn in speech applications. arXiv preprint arXiv:1909.06317. Cited by: §1, §4.1, §6.
 [16] (2019) Knowledge distillation using output errors for selfattention endtoend models. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6181–6185. Cited by: §1, §6.
 [17] (2016) Sequencelevel knowledge distillation. arXiv preprint arXiv:1606.07947. Cited by: §1.
 [18] (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: §4.
 [19] (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.
 [20] (2019) Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660. Cited by: Figure 1, §1, §2.2, §2.2, §4.1, Table 1, §4, §5.1, §6.
 [21] (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53. Cited by: §4.
 [22] (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §1, §4.
 [23] (2018) Image transformer. In International Conference on Machine Learning, pp. 4052–4061. Cited by: §1.
 [24] (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §3.
 [25] (2019) Fully quantized transformer for improved translation. arXiv preprint arXiv:1910.10485. Cited by: §6.
 [26] Language models are unsupervised multitask learners. Cited by: §1.
 [27] (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Cited by: §2.
 [28] (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
 [29] (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §4.1.
 [30] (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.