Quantization of Acoustic Model Parameters in Automatic Speech Recognition Framework

06/16/2020
by   Amrutha Prasad, et al.
Idiap Research Institute
0

Robust automatic speech recognition (ASR) system exploits state-of-the-art deep neural networks (DNN) based acoustic model (AM) trained with Lattice Free-Maximum Mutual Information (LF-MMI) criterion and n-gram language models. These systems are quite large and require significant parameter reduction to operate on embedded devices. Impact of the parameter quantization on the overall word recognition performance is studied in this paper. Following three approaches are presented: (i) AM trained in Kaldi framework with conventional factorized TDNN (TDNN-F) architecture. (ii) the TDNN built in Kaldi is loaded into the Pytorch toolkit using a C++ wrapper. The weights and activation parameters are then quantized and the inference is performed in Pytorch. (iii) post quantization training for fine-tuning. Results obtained on standard Librispeech setup provide an interesting overview of recognition accuracy w.r.t. applied quantization scheme.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/29/2022

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Reducing the latency and model size has always been a significant resear...
12/14/2016

Recurrent Deep Stacking Networks for Speech Recognition

This paper presented our work on applying Recurrent Deep Stacking Networ...
07/19/2021

A baseline model for computationally inexpensive speech recognition for Kazakh using the Coqui STT framework

Mobile devices are transforming the way people interact with computers, ...
04/03/2021

ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit of Kaldi

The availability of open-source software is playing a remarkable role in...
08/02/2021

Automatic recognition of suprasegmentals in speech

This study reports our efforts to improve automatic recognition of supra...
10/13/2021

Continual learning using lattice-free MMI for speech recognition

Continual learning (CL), or domain expansion, recently became a popular ...
06/23/2022

Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Switchboard Corpus

State of the art time automatic speech recognition (ASR) systems are bec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNN) help learn multiple levels of representation of data in order to model complex relationships among them. Conventional Acoustic Model (AM) used in ASR framework is trained with neural network architectures such as Convolutional Neural Networks (CNN) [1], Recurrent Neural Networks (RNNs) [2], Time-delay Neural Networks (TDNN) [3] with the Kaldi [4] toolkit. Such models have often millions of parameters making them impractical to use with embedded devices such as Raspberry Pi. To embed ASR system on such devices, the footprint of the ASR system needs to be significantly reduced. One simple solution is to train a model with fewer parameters. However, reducing the model size usually decreases the performance of the system.

Previous research has shown several possible alternative approaches. Quantizing the model parameters from floating point values to integers is one popular approach. In [5], quantization methods are studied for CNN architectures in image classification and other computer vision problems. Results show that quantizing such models reduces the model size significantly without any impact on the performance. Another approach is to use teacher-student training to first train a larger model that is optimized for performance and use its output to train a smaller model. Alternately, in models such as 

[11] parameter reduction is integrated as a part of training. In this paper, we study the effect of quantizing the parameters of an AM used in ASR with a focus on deploying it on embedded devices with low computational resources (especially, memory). We present the impact on the performance of the ASR system when the AM is quantized from float32 to int8 or int16. The results of the quantization process are then compared to other techniques used in parameter reduction for automatic speech recognition models. We believe that results obtained from our study have not been presented in literature, and can be of interest for researchers experimenting with interfacing Kaldi and Pytorch [12] tools for ASR tasks.

The rest of the paper is organized as follows. Section 2 describes briefly the current techniques used in parameter reduction of a model. This is followed by an overview of the quantization techniques and their application to AM training with Kaldi toolkit (Section 3). Section 4 presents the experiments and the results. Finally, the conclusion is provided in Section 5.

2 Related work

Speech recognition can be considered as a sequence-to-sequence mapping problem where a sequence of sounds is converted to a sequence of meaningful linguistic units (e.g. phones, syllables, words, etc.). In order to better distinguish different classes of sounds, it is useful to train with positive and negative examples. Hence, sequential discriminative criteria such as maximum mutual information (MMI) and state-level Maximum Bayes Risk (sMBR) can be applied. The former is now commonly known as lattice-free MMI (LF-MMI/Chain model) [6]. This method could also be used without any Cross-Entropy (CE) initialization leading to lesser computation cost.

State-level sequence-discriminative training of DNNs starts from a set of alignments and lattices that are generated by decoding the training data with a Language Model (LM). For each training condition, the alignments and lattices are generated using the corresponding DNN trained using CE [6]. The cross-entropy trained models are also used as the starting point for the sequence-discriminative training. Whereas in sMBR training word-level language model is used, in LF-MMI training phone-level LM is used. This simplification enables LF-MMI training to use GPU clusters and is considered as the state-of-the-art AM for an ASR system. Hence our experiments consider AM with TDNN architecture.

Parameter reduction is a process that removes certain layers of the neural network avoiding the loss of useful information of the network required for its decision process. This process can be applied to already trained neural networks or implemented during the training. Several different approaches can be considered:

  • Teacher-student approach to reduce the number of layers in the student neural network [7].

  • Reduce the size of the layers used in training the neural network through matrix factorization [8].

  • Reduce the hidden layer dimension (e.g. from 1024 to 512 in each layer of the neural network).

  • Reduce the number of hidden layers used in the network.

  • Quantization of model parameters (e.g. from 32 bit floating precision to 16 bit floating precision) [5][9].

Single Value Decomposition (SVD) is one of the most popular methods which can be applied to the trained models to factorize the learned weight matrix as a product of two much smaller factors. SVD then discards the smaller singular values followed by fine tuning of the network parameters to obtain a parameter-reduced model [10].

Another approach to enforce parameter reduction while training a neural network AM is by applying low-rank factorized layers [11]. In semi-orthogonal factorization, the parameter matrix is factorized as a product of two matrices and , where

is a semi-orthogonal matrix and

has a smaller “interior” (i.e. rank) than that of . This technique enables training a smaller network from scratch instead of using a pre-trained network for parameter reduction. The LF-MMI training also provides a stable training procedure for semi-orthogonalized matrices.

While semi-orthogonal matrices have been studied with TDNN-F (a variant of TDNN with residual connections), it has not been compared with other model reduction techniques. In our experiments, we present the comparison with respect to varying the number of layers.

3 Quantization

A popular technique to reduce the size of the model is through quantization. This approach is applied in computer vision problems and is supported by many deep learning frameworks like Pytorch [12] and TensorFlow [13]. However, applying quantization to AM trained with LF-MMI criterion using Kaldi toolkit is not a straightforward approach. The following subsections explain the standard quantization process for DNNs and how it is applied to the AMs.

3.1 Overview of quantization process

Quantization is a process of mapping a set of real valued inputs to a discrete valued outputs. Commonly used quantization types are 16 bits and 8 bits. Quantizing model parameters typically involves decreasing the number of bits used to represent the parameters. Prior to this process the model may have been trained with IEEE float32 or float64. A model size can be reduced by a factor of 2 (with 16 bits quantization) and by a factor of 4 (with 8 bits quantization) if the original model uses float32 representation.

In addition to the quantization types, there are different quantization modes such as symmetric and asymmetric quantization. As mentioned earlier, a real valued variable in the range of is quantized to a range . In symmetric quantization, the range corresponds to . In asymmetric quantization the quantization range is . In the aforementioned intervals, for 16 bit quantization and for 8 bit quantization.

A real value , can be expressed as an integer given a scale and zero-point [5]:

(1)

In the above equation, scale specifies the step size required to map the floating point to integer and an integer zero-point represents the floating point zero [5].

Given the minimum and maximum of a vector

and the range of the quantization scheme, scale and zero-point is computed as below [9]:

(2)
(3)

As mentioned in [5], for 8 bit integer quantization the values never reach -128 and hence we use and .

3.2 Quantization application

We implement the quantization algorithms in Pytorch as it provides better support than Kaldi for int8, uint8 and int16 types. The aim of our work is to port models trained in Kaldi to be functional in embedded systems. There already exist tools such as Pykaldi [14] that help users to load Kaldi acoustic models for inference in Pytorch. However, they do not allow to access the model parameters by default. To support this work, we implemented a C++ wrapper that allows to access the model parameters and input MFCC features as Pytorch tensors. The wrapper also allows us to write the models and ark (archive) files back to Kaldi format.

Once the model is loaded as a tensor, there exist several options: we can quantize only the weights of the models, or quantize both weights and activations.

3.2.1 Quantization of weights only

Weight-only quantization is an approach in which only the weights of the neural network model are quantized. This approach is useful when only the model size needs to be reduced and the inference is carried out in floating-point precision. In our experiments, the weights are quantized in Pytorch and the inference is carried out in Kaldi.

3.2.2 Quantization of weights and activations

In order to reduce the model from 32 bit precision to 8 bit precision, both the weights and activations must be quantized. Activations are quantized with the use of a calibration set to estimate the dynamic range of the activations. Our network architecture consists of TDNN layer followed by ReLu and Batchnorm layers. In our experiments, we quantize only the weights and input activations to the TDNN layer as depicted in Figure 1 (i.e., the integer arithmetic is applied only to the 1D convolution). Floating point operations are used in ReLu and Batchnorm layers in order to simplify the implementation, as the main focus of this paper is to only study the impact of quantization on AM weights and activations. The conventional word-recognition lattices are then generated by a Kaldi decoder (i.e. performance in Kaldi) with the use of Pytorch generated likelihoods.

Figure 1: Block diagram of integer arithmetic inference with quantization of weights and activations. Input activations and weights are represented as 8-bit integer according to equation 1. The 1D convolution involves integer inputs and a 32-bit integer accumulator. The output of the convolution is mapped back to floating point and added with the bias.
Model Quantization (bits) WQ AQ Params Size WER %
Baseline TDNN - No No 7.9M 1x 8.1
TDNN 16 Yes No 7.9M 0.5x 10.7
TDNN 8 Yes No 7.9M 0.25x 10.7
TDNN 8 Yes Yes 7.9M 0.25x 17.3
TDNN - fine tuned 8 Yes Yes 7.9M 0.25x 18.5
TDNN-F - No No 3.14M 0.4x 10.8
Table 1: Comparing parameter reduction techniques for monophone-based TDNN acoustic model: Quantization (bits), Weight Quantization (WQ), Activation Quantization (AQ), Number of parameters, model size and Word-Error Rate (WER) [in %] when a small LM was used for decoding.
Model Quantization (bits) WQ AQ Params Size WER %
Baseline TDNN - No No 15.4M 1x 6.32
TDNN 16 Yes No 15.4M 0.5x 10.01
TDNN 8 Yes No 15.4M 0.25x 11.44
TDNN 8 Yes Yes 15.4M 0.25x 11.21
TDNN - fine tuned 8 Yes Yes 15.4M 0.25x 11.28
TDNN-F - No No 6.2M 0.4x 8.3
Table 2: Comparing parameter reduction techniques for triphone-based TDNN acoustic model: Quantization (bits), Weight Quantization (WQ), Activation Quantization (AQ), Number of parameters, model size and word error rate (WER) [in %] when a small LM was used for decoding.

3.2.3 Post quantization fine-tuning

Quantization is a process that reduces the precision of the model. This implies that noise is added when weights are quantized. In order to reduce the level of noise, a process of fine tuning is carried out. In this experiment, the quantized weights are first de-quantized and saved. This model is then loaded back to Kaldi and further trained for 2 epochs with a low learning rate. The process of quantizing and fine tuning is carried out in three iterations with an assumption that the final model when quantized converges to the baseline TDNN model.

Model No. of layers Params WER
TDNN 7 3.14M 5.4
TDNN-F 7 3.14M 7.3
TDNN-F 10 4.35M 5.7
TDNN-F 17 7.1M 5.2
Table 3: Comparison of TDNN-F model with varying number of layers in the monophone setup when a large LM was used for decoding.
Quantization (bits) Monophone Triphone
16 12.9 2.5
8 13.5 22.7
Table 4: Comparison of quantization error for the monophone and triphone-based TDNN acoustic models.

4 Experiments

All our experiments conducted to reduce parameters of TDNN-based acoustic models are trained with Kaldi toolkit (i.e. nnet3 model architecture). AMs are trained with the LF-MMI training framework, considered to produce state-of-the-art performance for hybrid ASR systems. In the paper, we not only consider conventional triphone systems but also a monophone based system. In the former case, the output layer consists of senones obtained from clustering of context-dependent phones. In the latter case, the output layer consists of only monophone outputs, which can be considered as yet another approach to reduce the computational complexity of ASR systems. The triphone-based AM uses position-dependent phones which produces a total of 346 phones including the silence and noise phones. The monophone-based AM uses position-independent phones which comprises of 41 phones. The output of the triphone-based AM produces 5984 states while the monophone-based AM produces 41 states.

The AMs trained use conventional high-resolution MFCC features with speed perturbed data. We did not include i-vectors. The TDNN and TDNN-F models use 7 layers with the hidden layer dimension of 625.

In this study we also train TDNN-F model by increasing the number of layers until it reaches number of parameters of the baseline TDNN (7M params). Table 3 shows that by using twice as many layers as TDNN in TDNN-F, the same number of params are retained with an improved performance. The results presented in this table are rescored with a large LM trained on Librispeech.

The AMs are trained with 960h of Librispeech [15] data. The LMs are also trained on Librispeech which is available to downloaded from the web. Librispeech is a corpus of approximately 1000 hours of 16 kHz read English speech from the LibriVox project. The LibriVox project is responsible for the creation of approximately 8000 public domain audio books, the majority of which are in English. Most of the recordings are based on texts from Project Gutenberg2, also in the public domain.

The quantization is performed in Pytorch. Quantization experiments are carried out for 16 bit and 8 bit integers in symmetric mode. As discussed in Section 3, the model and the features from Kaldi are loaded as Pytorch tensors with the help of the C++ wrapper.

The word recognition performance for all experiments is performed on Librispeech test-clean evaluation set. The quantization experiments use a small LM while the comparison of varying the layers of TDNN-F AM uses a large LM.

4.1 Parameter reduction experiments

We compare floating-point vs. integer arithmetic inference for TDNN model with different quantization types (16-bit and 8-bit integer) and different quantization schemes, as discussed in Section 3.2. We also compare the quantization technique with the low-rank matrix factorization technique used during the training of the model.

Table 1 shows that weight-only quantization reduces the model size by 50% without a significant impact on the performance of the monophone-based AM. Quantizing both weights and activations reduces the model size with an increases WER compared to weight only quantization. Table 2 shows that quantizing both weights and activations outperforms the weight-only quantization in the triphone system. In both monophone and triphone systems, post quantization fine-tuning does not show any impact. The TDNN-F model reduces the model size by 40% with a loss in the recognition performance of 2.7% (absolute) compared to the baseline TDNN. However, compared to the 8-bit and 16-bit quantized model, the loss in WER of TDNN-F is negligible (10.7% WER for the quantized model vs 10.8% WER of TDNN-F).

4.2 Quantization error

The norm between the weights and its de-quantized version is the quantization error. Table 4 shows the error for monophone and triphone-based AMs with respect to int8 int16 quantization. The high variation of the error in the triphone system is due its large number of outputs.

5 Conclusions

We presented a study that shows the effect of quantizing the acoustic model parameters in ASR. The experimental results reveal that the parameter-quantization can reduce the model size significantly while preserving a reasonable word recognition performance. TDNN-F models provide a better performance when the number of layers is higher than for the TDNN models. Quantization of the acoustic models can be further explored through fusing the TDNN, ReLu and Batchnorm layers. Since fine-tuning did not bring any significant improvements in our experiments, our future work will consider an implementation of the quantization-aware training.

The quantization experiments are conducted in Pytorch, while the acoustic models are developed using popular Kaldi toolkit. Implemented C++ wrappers allowing to interface parameters of the Kaldi-based DNN acoustic models in Pytoch will be offered to other researchers through a Github project.

6 Acknowledgements

This work was supported by the CTI Project “SHAPED: Speech Hybrid Analytics Platform for consumer and Enterprise Devices”. We wish to acknowledge Arash Salarian for providing us with valuable insights and suggestions regarding quantization. The work was also partially supported by the ATCO2 project, funded by the European Union under CleanSky EC-H2020 framework.

References

  • [1] LeCun, Yann, and Yoshua Bengio. “Convolutional networks for images, speech, and time series.” The handbook of brain theory and neural networks 3361.10 (1995): 1995.
  • [2] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” IEEE international conference on acoustics, speech and signal processing. IEEE, 2013.
  • [3] Peddinti, Vijayaditya, Daniel Povey, and Sanjeev Khudanpur. “A time delay neural network architecture for efficient modeling of long temporal contexts.” Sixteenth Annual Conference of the International Speech Communication Association. 2015.
  • [4] Daniel Povey, Arnab Ghoshal et al. “The Kaldi Speech Recognition Toolkit” IEEE 2011 workshop on automatic speech recognition and understanding. No. CONF. IEEE Signal Processing Society, 2011.
  • [5] Benoit Jacob, Skirmantas Kligys et al. “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference”

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

  • [6] Veselý, Karel, et al. “Sequence-discriminative training of deep neural networks.” Interspeech. Vol. 2013. 2013.
  • [7] Wong, Jeremy HM, and Mark John Gales, “Sequence student-teacher training of deep neural networks.” 2016.
  • [8] Francis Keith, William Hartmann, Man-hung Siu, Jeff Ma, Owen Kimball “Optimising Multilingual Knowledge Transfer For Time-Delay Neural Networks with Low Rank Factorization,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
  • [9] Raghuraman Krishnamoorthi “Quantizing deep convolutional networks for efficient inference: A whitepaper” arXiv preprint arXiv:1806.08342 (2018).
  • [10]

    Jian Xue, Jinyu Li, and Yifan Gong “Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition

    Interspeech. 2013.
  • [11] Dan Povey et al. “Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks” Interspeech. 2018
  • [12] Adam Paszke, Sam Gross et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” Advances in Neural Information Processing Systems. 2019.
  • [13]

    Martin Abadi, Paul Barham et al. “TensorFlow: A system for large-scale machine learning

    12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.
  • [14] Dogan Can, Victor R. Martinez, Pavlos Papadopoulos, Shrikanth S. Narayanan “PYKALDI: A Python Wrapper for KALDI” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
  • [15] Vassil Panayotov, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur “Librispeech: An ASR Corpus Based On Public Domain Audio Books” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015.