Deep Neural Networks (DNN) help learn multiple levels of representation of data in order to model complex relationships among them. Conventional Acoustic Model (AM) used in ASR framework is trained with neural network architectures such as Convolutional Neural Networks (CNN) , Recurrent Neural Networks (RNNs) , Time-delay Neural Networks (TDNN)  with the Kaldi  toolkit. Such models have often millions of parameters making them impractical to use with embedded devices such as Raspberry Pi. To embed ASR system on such devices, the footprint of the ASR system needs to be significantly reduced. One simple solution is to train a model with fewer parameters. However, reducing the model size usually decreases the performance of the system.
Previous research has shown several possible alternative approaches. Quantizing the model parameters from floating point values to integers is one popular approach. In , quantization methods are studied for CNN architectures in image classification and other computer vision problems. Results show that quantizing such models reduces the model size significantly without any impact on the performance. Another approach is to use teacher-student training to first train a larger model that is optimized for performance and use its output to train a smaller model. Alternately, in models such as parameter reduction is integrated as a part of training. In this paper, we study the effect of quantizing the parameters of an AM used in ASR with a focus on deploying it on embedded devices with low computational resources (especially, memory). We present the impact on the performance of the ASR system when the AM is quantized from float32 to int8 or int16. The results of the quantization process are then compared to other techniques used in parameter reduction for automatic speech recognition models. We believe that results obtained from our study have not been presented in literature, and can be of interest for researchers experimenting with interfacing Kaldi and Pytorch  tools for ASR tasks.
The rest of the paper is organized as follows. Section 2 describes briefly the current techniques used in parameter reduction of a model. This is followed by an overview of the quantization techniques and their application to AM training with Kaldi toolkit (Section 3). Section 4 presents the experiments and the results. Finally, the conclusion is provided in Section 5.
2 Related work
Speech recognition can be considered as a sequence-to-sequence mapping problem where a sequence of sounds is converted to a sequence of meaningful linguistic units (e.g. phones, syllables, words, etc.). In order to better distinguish different classes of sounds, it is useful to train with positive and negative examples. Hence, sequential discriminative criteria such as maximum mutual information (MMI) and state-level Maximum Bayes Risk (sMBR) can be applied. The former is now commonly known as lattice-free MMI (LF-MMI/Chain model) . This method could also be used without any Cross-Entropy (CE) initialization leading to lesser computation cost.
State-level sequence-discriminative training of DNNs starts from a set of alignments and lattices that are generated by decoding the training data with a Language Model (LM). For each training condition, the alignments and lattices are generated using the corresponding DNN trained using CE . The cross-entropy trained models are also used as the starting point for the sequence-discriminative training. Whereas in sMBR training word-level language model is used, in LF-MMI training phone-level LM is used. This simplification enables LF-MMI training to use GPU clusters and is considered as the state-of-the-art AM for an ASR system. Hence our experiments consider AM with TDNN architecture.
Parameter reduction is a process that removes certain layers of the neural network avoiding the loss of useful information of the network required for its decision process. This process can be applied to already trained neural networks or implemented during the training. Several different approaches can be considered:
Teacher-student approach to reduce the number of layers in the student neural network .
Reduce the size of the layers used in training the neural network through matrix factorization .
Reduce the hidden layer dimension (e.g. from 1024 to 512 in each layer of the neural network).
Reduce the number of hidden layers used in the network.
Quantization of model parameters (e.g. from 32 bit floating precision to 16 bit floating precision) .
Single Value Decomposition (SVD) is one of the most popular methods which can be applied to the trained models to factorize the learned weight matrix as a product of two much smaller factors. SVD then discards the smaller singular values followed by fine tuning of the network parameters to obtain a parameter-reduced model .
Another approach to enforce parameter reduction while training a neural network AM is by applying low-rank factorized layers . In semi-orthogonal factorization, the parameter matrix is factorized as a product of two matrices and , where
is a semi-orthogonal matrix andhas a smaller “interior” (i.e. rank) than that of . This technique enables training a smaller network from scratch instead of using a pre-trained network for parameter reduction. The LF-MMI training also provides a stable training procedure for semi-orthogonalized matrices.
While semi-orthogonal matrices have been studied with TDNN-F (a variant of TDNN with residual connections), it has not been compared with other model reduction techniques. In our experiments, we present the comparison with respect to varying the number of layers.
A popular technique to reduce the size of the model is through quantization. This approach is applied in computer vision problems and is supported by many deep learning frameworks like Pytorch  and TensorFlow . However, applying quantization to AM trained with LF-MMI criterion using Kaldi toolkit is not a straightforward approach. The following subsections explain the standard quantization process for DNNs and how it is applied to the AMs.
3.1 Overview of quantization process
Quantization is a process of mapping a set of real valued inputs to a discrete valued outputs. Commonly used quantization types are 16 bits and 8 bits. Quantizing model parameters typically involves decreasing the number of bits used to represent the parameters. Prior to this process the model may have been trained with IEEE float32 or float64. A model size can be reduced by a factor of 2 (with 16 bits quantization) and by a factor of 4 (with 8 bits quantization) if the original model uses float32 representation.
In addition to the quantization types, there are different quantization modes such as symmetric and asymmetric quantization. As mentioned earlier, a real valued variable in the range of is quantized to a range . In symmetric quantization, the range corresponds to . In asymmetric quantization the quantization range is . In the aforementioned intervals, for 16 bit quantization and for 8 bit quantization.
A real value , can be expressed as an integer given a scale and zero-point :
In the above equation, scale speciﬁes the step size required to map the floating point to integer and an integer zero-point represents the ﬂoating point zero .
Given the minimum and maximum of a vectorand the range of the quantization scheme, scale and zero-point is computed as below :
As mentioned in , for 8 bit integer quantization the values never reach -128 and hence we use and .
3.2 Quantization application
We implement the quantization algorithms in Pytorch as it provides better support than Kaldi for int8, uint8 and int16 types. The aim of our work is to port models trained in Kaldi to be functional in embedded systems. There already exist tools such as Pykaldi  that help users to load Kaldi acoustic models for inference in Pytorch. However, they do not allow to access the model parameters by default. To support this work, we implemented a C++ wrapper that allows to access the model parameters and input MFCC features as Pytorch tensors. The wrapper also allows us to write the models and ark (archive) files back to Kaldi format.
Once the model is loaded as a tensor, there exist several options: we can quantize only the weights of the models, or quantize both weights and activations.
3.2.1 Quantization of weights only
Weight-only quantization is an approach in which only the weights of the neural network model are quantized. This approach is useful when only the model size needs to be reduced and the inference is carried out in floating-point precision. In our experiments, the weights are quantized in Pytorch and the inference is carried out in Kaldi.
3.2.2 Quantization of weights and activations
In order to reduce the model from 32 bit precision to 8 bit precision, both the weights and activations must be quantized. Activations are quantized with the use of a calibration set to estimate the dynamic range of the activations. Our network architecture consists of TDNN layer followed by ReLu and Batchnorm layers. In our experiments, we quantize only the weights and input activations to the TDNN layer as depicted in Figure 1 (i.e., the integer arithmetic is applied only to the 1D convolution). Floating point operations are used in ReLu and Batchnorm layers in order to simplify the implementation, as the main focus of this paper is to only study the impact of quantization on AM weights and activations. The conventional word-recognition lattices are then generated by a Kaldi decoder (i.e. performance in Kaldi) with the use of Pytorch generated likelihoods.
|Model||Quantization (bits)||WQ||AQ||Params||Size||WER %|
|TDNN - fine tuned||8||Yes||Yes||7.9M||0.25x||18.5|
|Model||Quantization (bits)||WQ||AQ||Params||Size||WER %|
|TDNN - fine tuned||8||Yes||Yes||15.4M||0.25x||11.28|
3.2.3 Post quantization fine-tuning
Quantization is a process that reduces the precision of the model. This implies that noise is added when weights are quantized. In order to reduce the level of noise, a process of fine tuning is carried out. In this experiment, the quantized weights are first de-quantized and saved. This model is then loaded back to Kaldi and further trained for 2 epochs with a low learning rate. The process of quantizing and fine tuning is carried out in three iterations with an assumption that the final model when quantized converges to the baseline TDNN model.
|Model||No. of layers||Params||WER|
All our experiments conducted to reduce parameters of TDNN-based acoustic models are trained with Kaldi toolkit (i.e. nnet3 model architecture). AMs are trained with the LF-MMI training framework, considered to produce state-of-the-art performance for hybrid ASR systems. In the paper, we not only consider conventional triphone systems but also a monophone based system. In the former case, the output layer consists of senones obtained from clustering of context-dependent phones. In the latter case, the output layer consists of only monophone outputs, which can be considered as yet another approach to reduce the computational complexity of ASR systems. The triphone-based AM uses position-dependent phones which produces a total of 346 phones including the silence and noise phones. The monophone-based AM uses position-independent phones which comprises of 41 phones. The output of the triphone-based AM produces 5984 states while the monophone-based AM produces 41 states.
The AMs trained use conventional high-resolution MFCC features with speed perturbed data. We did not include i-vectors. The TDNN and TDNN-F models use 7 layers with the hidden layer dimension of 625.
In this study we also train TDNN-F model by increasing the number of layers until it reaches number of parameters of the baseline TDNN (7M params). Table 3 shows that by using twice as many layers as TDNN in TDNN-F, the same number of params are retained with an improved performance. The results presented in this table are rescored with a large LM trained on Librispeech.
The AMs are trained with 960h of Librispeech  data. The LMs are also trained on Librispeech which is available to downloaded from the web. Librispeech is a corpus of approximately 1000 hours of 16 kHz read English speech from the LibriVox project. The LibriVox project is responsible for the creation of approximately 8000 public domain audio books, the majority of which are in English. Most of the recordings are based on texts from Project Gutenberg2, also in the public domain.
The quantization is performed in Pytorch. Quantization experiments are carried out for 16 bit and 8 bit integers in symmetric mode. As discussed in Section 3, the model and the features from Kaldi are loaded as Pytorch tensors with the help of the C++ wrapper.
The word recognition performance for all experiments is performed on Librispeech test-clean evaluation set. The quantization experiments use a small LM while the comparison of varying the layers of TDNN-F AM uses a large LM.
4.1 Parameter reduction experiments
We compare floating-point vs. integer arithmetic inference for TDNN model with different quantization types (16-bit and 8-bit integer) and different quantization schemes, as discussed in Section 3.2. We also compare the quantization technique with the low-rank matrix factorization technique used during the training of the model.
Table 1 shows that weight-only quantization reduces the model size by 50% without a significant impact on the performance of the monophone-based AM. Quantizing both weights and activations reduces the model size with an increases WER compared to weight only quantization. Table 2 shows that quantizing both weights and activations outperforms the weight-only quantization in the triphone system. In both monophone and triphone systems, post quantization fine-tuning does not show any impact. The TDNN-F model reduces the model size by 40% with a loss in the recognition performance of 2.7% (absolute) compared to the baseline TDNN. However, compared to the 8-bit and 16-bit quantized model, the loss in WER of TDNN-F is negligible (10.7% WER for the quantized model vs 10.8% WER of TDNN-F).
4.2 Quantization error
The norm between the weights and its de-quantized version is the quantization error. Table 4 shows the error for monophone and triphone-based AMs with respect to int8 int16 quantization. The high variation of the error in the triphone system is due its large number of outputs.
We presented a study that shows the effect of quantizing the acoustic model parameters in ASR. The experimental results reveal that the parameter-quantization can reduce the model size significantly while preserving a reasonable word recognition performance. TDNN-F models provide a better performance when the number of layers is higher than for the TDNN models. Quantization of the acoustic models can be further explored through fusing the TDNN, ReLu and Batchnorm layers. Since fine-tuning did not bring any significant improvements in our experiments, our future work will consider an implementation of the quantization-aware training.
The quantization experiments are conducted in Pytorch, while the acoustic models are developed using popular Kaldi toolkit. Implemented C++ wrappers allowing to interface parameters of the Kaldi-based DNN acoustic models in Pytoch will be offered to other researchers through a Github project.
This work was supported by the CTI Project “SHAPED: Speech Hybrid Analytics Platform for consumer and Enterprise Devices”. We wish to acknowledge Arash Salarian for providing us with valuable insights and suggestions regarding quantization. The work was also partially supported by the ATCO2 project, funded by the European Union under CleanSky EC-H2020 framework.
-  LeCun, Yann, and Yoshua Bengio. “Convolutional networks for images, speech, and time series.” The handbook of brain theory and neural networks 3361.10 (1995): 1995.
-  Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” IEEE international conference on acoustics, speech and signal processing. IEEE, 2013.
-  Peddinti, Vijayaditya, Daniel Povey, and Sanjeev Khudanpur. “A time delay neural network architecture for efficient modeling of long temporal contexts.” Sixteenth Annual Conference of the International Speech Communication Association. 2015.
-  Daniel Povey, Arnab Ghoshal et al. “The Kaldi Speech Recognition Toolkit” IEEE 2011 workshop on automatic speech recognition and understanding. No. CONF. IEEE Signal Processing Society, 2011.
Benoit Jacob, Skirmantas Kligys et al.
“Quantization and Training of Neural Networks for Efﬁcient Integer-Arithmetic-Only Inference”
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
-  Veselý, Karel, et al. “Sequence-discriminative training of deep neural networks.” Interspeech. Vol. 2013. 2013.
-  Wong, Jeremy HM, and Mark John Gales, “Sequence student-teacher training of deep neural networks.” 2016.
-  Francis Keith, William Hartmann, Man-hung Siu, Jeff Ma, Owen Kimball “Optimising Multilingual Knowledge Transfer For Time-Delay Neural Networks with Low Rank Factorization,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
-  Raghuraman Krishnamoorthi “Quantizing deep convolutional networks for efﬁcient inference: A whitepaper” arXiv preprint arXiv:1806.08342 (2018).
Jian Xue, Jinyu Li, and Yifan Gong “Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition”Interspeech. 2013.
-  Dan Povey et al. “Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks” Interspeech. 2018
-  Adam Paszke, Sam Gross et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” Advances in Neural Information Processing Systems. 2019.
Martin Abadi, Paul Barham et al. “TensorFlow: A system for large-scale machine learning”12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.
-  Dogan Can, Victor R. Martinez, Pavlos Papadopoulos, Shrikanth S. Narayanan “PYKALDI: A Python Wrapper for KALDI” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur “Librispeech: An ASR Corpus Based On Public Domain Audio Books” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015.