Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers

11/29/2021
by   Junhao Xu, et al.
Microsoft
0

The high memory consumption and computational costs of Recurrent neural network language models (RNNLMs) limit their wider application on resource constrained devices. In recent years, neural network quantization techniques that are capable of producing extremely low-bit compression, for example, binarized RNNLMs, are gaining increasing research interests. Directly training of quantized neural networks is difficult. By formulating quantized RNNLMs training as an optimization problem, this paper presents a novel method to train quantized RNNLMs from scratch using alternating direction methods of multipliers (ADMM). This method can also flexibly adjust the trade-off between the compression rate and model performance using tied low-bit quantization tables. Experiments on two tasks: Penn Treebank (PTB), and Switchboard (SWBD) suggest the proposed ADMM quantization achieved a model size compression factor of up to 31 times over the full precision baseline RNNLMs. Faster convergence of 5 times in model training over the baseline binarized RNNLM quantization was also obtained. Index Terms: Language models, Recurrent neural networks, Quantization, Alternating direction methods of multipliers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/29/2021

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

State-of-the-art language models (LMs) represented by long-short term me...
07/24/2017

Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM

Although deep learning models are highly effective for various learning ...
09/26/2019

Smart Ternary Quantization

Neural network models are resource hungry. Low bit quantization such as ...
02/01/2018

Alternating Multi-bit Quantization for Recurrent Neural Networks

Recurrent neural networks have achieved excellent performance in many ap...
11/29/2021

Mixed Precision of Quantization of Transformer Language Models for Speech Recognition

State-of-the-art neural language models represented by Transformers are ...
04/06/2016

Learning A Deep ℓ_∞ Encoder for Hashing

We investigate the ℓ_∞-constrained representation which demonstrates rob...
12/31/2021

Training Recurrent Neural Networks by Sequential Least Squares and the Alternating Direction Method of Multipliers

For training recurrent neural network models of nonlinear dynamical syst...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

RNNLMs are widely used in state-of-art speech recognition systems. The high memory consumption and computational costs limit their wider application on resource constrained devices. In order to address this issue for RNNLMs, and deep learning in general, a wide range of deep model compression appproaches including teacher-student based transfer learning 

[hinton2015distilling] [chebotar2016distilling] [huang2018knowledge], low rank matrix factorization [sainath2013low] [jaderberg2014speeding] [lebedev2014speeding] [tai2015convolutional] [sindhwani2015structured], sparse weight matrices [liu2015sparse] [han2015learning][han2015learning] [wen2016learning] have been proposed. In addition, a highly efficient family of compression techniques based on deep neural network quantization that are capable of producing extremely low bit representation [courbariaux2014training] [courbariaux2015binaryconnect] [courbariaux2016binarized], for example, binarized RNNLMs [liu2018binarized], are gaining increasing research interests.

Earlier forms of deep neural network (DNN) quantization methods compress well-trained full precision models off-line [gong2014compressing] [chen2015compressing]. In [chen2015compressing], a hash function was used to randomly group the connection weight parameters into several shared values. Weights in convolutional layers are quantized in [gong2014compressing]. In order to reduce the inconsistency in error cost function between the full precision model training and subsequent quantization stages, later researches aimed at directly training a low bit neural network from scratch [soudry2014expectation] [courbariaux2016binarized] [liu2018binarized].

The key challenge in these approaches is that gradient descent methods and back-propagation (BP) algorithm can not be directly applied in quantized model training when the weights are restricted to discrete values. To this end, there have been two solutions to this problem in the machine leaning community [soudry2014expectation] [courbariaux2016binarized]. A Bayesian approach was proposed in [soudry2014expectation]

to allow a posterior distribution over discrete weight parameters to be estimated, and the subsequent model quantization used samples drawn from the distribution. In 

[courbariaux2016binarized], low precision binarized parameters were first used in the forward pass to compute the error loss before full precision parameters are used in the backward pass to propagate the gradients. It was further suggested in [liu2018binarized] for RNNLMs that extra partially quantized linear layers containing binary weight matrices, full precision bias and additional scaling parameters need to be added to mitigate the performance degradation due to model compression. A compression ratio of 11.3 in [liu2018binarized] was reported on PTB and SWBD data without performance loss.

In this paper, by formulating quantized RNNLMs training as an optimization problem, a novel method based on alternating direction methods of multipliers (ADMM) [boyd2011distributed] [leng2018extremely] is proposed. Two sets of parameters: a full precision network, and the optimal quantization table, are considered in a decomposed dual ascent scheme and optimized in an alternating fashion iteratively using an augmented Lagrangian. This algorithm draws strength from both the decomposibility of the dual ascent schemes and the stable convergence of multiplier methods. In order to account for the detailed parameter distributions at a layer or node level within the network, locally shared quantization tables trained using ADMM are also proposed to allow fine-grained and flexible adjustment over the trade-off between model compression rate and performance.

The main contributions of this paper are summarized as below. First, to the best of our knowledge, this paper is the first work to introduce ADMM based RNNLMs quantization for speech recognition tasks. The previous research on low bit quantization [courbariaux2016binarized] of RNNLMs [liu2018binarized] focused on using different parameter precision in the error forwarding and gradient backward propagation stages. The earlier use of alternating methods in off-line quantization of well trained DNNs [xu2018alternating]

also does not allow low bit quantized models to be directly trained from scratch, as considered in this paper. Second, the previous application of ADMM DNN quantization was restricted to computer vision tasks 

[leng2018extremely]. In addition, a globally tied quantization table was applied to all parameters in the network, thus providing limited flexibility to account for detailed local parameter distributions as considered in this paper. We evaluate the performance of the proposed ADMM RNNLMs quantization method on two tasks targeting primarily on speech recognition applications: Penn Treebank and Switchboard, in comparison against the baseline binarized RNNLMs quantization in terms of the trade-off between model compression factor, perplexity and speech recognition error rate.

The rest of the paper is organized as follows. RNNLMs are reviewed in section 2. A general neural network quantization scheme is described in section 3. Section 4 presents our ADMM based RNNLMs quantization in detail. Experiments and results are shown in section 5. Finally, conclusions and future work are discussed in section 6.

2 Recurrent Neural Network LMs

The recurrent neural network language models (RNNLMs) we considerred in this paper computes the word probability by

(1)

where is the hidden state that attempts to encode the history information into a

-dimensional vector representation, where

is the number of hidden nodes.

In RNNLMs, a word is represented by a -dimensional one-hot vector , where is the vocabulary size. To process sparse data, the one-hot vector is first projected into a -dimensional size () continuous space [bengio2003neural] where is considered as he embedding size:

(2)

where is a projection matrix to be trained. After the word embedding layer, the hidden state is calculated recursively through a gating function: , which is a vector function that controls the amount of inherited information from in the current “memory” state

. Currently, long short-term memory (LSTM)

[hochreiter1997lstm] RNNLMs[sundermeyer2012lstm] definite the state of art performance.

In order to solve the problem of vanishing gradients, LSTM introduces another recursively computed variable , a memory cell, which aims to preserve the historical information over a longer time window. At time four gates are computed – the forget gate , the input gate , the cell gate and the output gate :

(3)
(4)
(5)
(6)

where for any . With the four gating outputs, we update

(7)
(8)

where is the Hadamard product.

3 Neural Network Quantization

The standard n-bit quantization problem for neural network considers for any full precision weight parameter, , finding its closest discrete approximation from the following the quantization table.

(9)

as

(10)

Further simplification to the above quantization table of Equation (9) leads to either the binarized  [rastegari2016xnor], or tertiary value  [li2016ternary] based quantization.

It is assumed in the above quantization that a global quantization table is applied to all weight parameters. In order to account for the detailed local parameter distributional properties, and more importantly flexibly adjust the trade off between model compression ratio and performance degradation, the following more general form of quantization is considered for each parameter within any of the weight cluster, for example, all weight parameters of the same layer,

(11)

can be used. The locally shared quantization table is given by

(12)

is used to represent the scaling factor of the original discrete quantization. It is shared locally among weight parameters clusters. The tying of quantization tables may be flexibly performed at either node, layer level, or in the extreme case individual parameter level (equivalent to no quantization being applied).

Intuitively, the larger quantization table is used, a smaller compression rate after quantization and reduced performance degradation is expected. A projection from the original full precision parameters to the quantized low precision values needs to be found using Equation (11) during the training phase.

4 RNNLMs Quantization Using ADMM

Alternating direction methods of multipliers (ADMM) is a powerful optimization technique. It decomposes a dual ascent problem into alternating updates of two variables. In the context of the RNNLMs quantization problem considered here, these refer to full precision model weights update and the discrete quantization table estimation. In addition to the standard Lagrangian term taking the form of a dot production between the multiplier variable and the quantization error , it is also useful for alternating direction methods to introduce an additional term to form an augmented Lagrangian [boyd2011distributed] to improve robustness and convergence speed of the algorithm. The overall Lagrange function is formulated as:

(13)

where is the crossentropy loss of the neural network, are the network parameters. represents the quantization of the parameters calculated from the projection (11). is the penalty parameter and denotes the Lagrangian multiplier.

Further rearranging Equation (13

) leads to the following loss function.

(14)

The algorithm when being performed at the iteration includes three stages. For simplicity, we assume a globally shared quantization table with a single scaling factor to be learned. The following iterative update can be extended when multiple shared quantization tables and associated scaling factors in Equation (12) are used.

1. Full precision weight update

The following equation is used to update the full precision weight parameters .

(15)

where are the quantized weights and error variable at the iteration. The gradient of the loss function in Equation (14) w.r.t is calculated as the following.

(16)

It is found in practice that the quadratic term of the augmented Lagrangian of Equation (14) can dominate the loss function computation and lead to a local optimum. One solution to this problem is to perform the gradient calculation one step ahead to improve the convergence. This is referred to as the extra-gradient method [Korpelevi1976An].

(17)

here represents the temporary variable to store the intermediate backward parameters, and and are separate learning rates.

2. Quantization variables update

The quantization discrete variables can be solved by minimizing the following:

(18)

where is calculated as

(19)

The scaling factor is then updated as

(20)

and are updated interatively in an alternating way until convergence is reached.

3. Error update

The Lagrange multiplier variable , now encoding the accumulated quantization errors computed at each iteration, is updated as

(21)

In all experiments of this paper the scaling factors are initialized to one. The above ADMM quantization algorithm can be executed iteratively until convergence measured in terms of validation data entropy is obtained. Alternatively, a fixed number of iterations, for example, 50, was used throughout this paper. The best performing model was then selected over all the intermediate quantizations obtained as each iteration.

5 Experiments

In this section, we evaluate the performance of quantized RNNLMs using the trade-off between the compression ratio and the perplexity (PPL) measure combined with word error rate (WER) obtained in automatic speech recognition (ASR) tasks. All the models are implemented using Python GPU computing library PyTorch 

[paszke2017automatic]. For all RNNLMs, the recurrent layer is set to be a single LSTM with 200 hidden nodes.

In all models, parameters are updated in mini-batch mode (10 sentences per batch) using the standard stochastic gradient descent (SGD) algorithm with an initial learning rate of 1, optionally within the ADMM based quantization of section 

4

. In our experiments, all RNNLMs were also further interpolated with 4-gram LMs 

[emami2007empirical] [park2010improved] [le2012structured]. The weight of the 4-gram LM is determined using the EM algorithm on a validation set.

5.1 Experiments on Penn Treebank Corpus

We first analyze the performance of ADMM based quantization as well as binarized LSTM language model (BLLM [liu2018binarized]), BLLM without linear layer and the standard full precision RNNLM the on the Penn Treebank (PTB) corpus, which consists of vocabulary, words for training, words for development, and words for testing. The PPL results are shown in Table 1.

RNNLMs Tying Quan. Set Model Comp. PPL
Size(MB) Ratio
STD        10ep. - - 16.53 1 114.4
Bin    50ep. - 0.52 31.8 131.6
[liu2018binarized]   400ep. 0.52 126.3
Bin+Lin 50ep. 0.53 31.2 121.5
[liu2018binarized]   400ep. 0.53 116.2
ADMM   (50ep.) NoTie 16.53 1 114.6
Node 0.65 25.4 117.2
1.32 12.5 119.3
1.32 12.5 118.4
2.01 8.2 115.9
Layer 0.52 31.8 121.8
1.06 15.6 123.1
1.06 15.6 122.7
1.57 10.5 120.4
Table 1:

Performance and compression ratio of quantized LSTM RNNLMs on PTB corpus: full precision baseline with no quantization (STD), binarized model w/o partially quantized linear layers (Binr+Lin or Bin) trained using 50 or 400 epochs with hidden size

, and ADMM quantized models with a layer, node or no tying of quantization tables of varying #bits.
RNNLMs Tying Quan. Set Model Comp. PPL WER WER+4gram
Size(MB) ratio swbd. callhm. swbd. callhm
STD 8 ep. - - 48.11 1 89.3 11.4 23.9 11.3 23.2
Bin. 50 ep. - 1.51 31.8 128.1 14.1 29.0 13.5 26.5
250ep. 123.4 13.6 27.8 12.7 25.7
Bin.+Lin. 50ep. 1.52 31.6 103.7 12.9 25.8 11.9 24.9
250ep. 96.7 12.2 25.2 11.9 24.5
ADMM 50ep. NoTie 48.11 1 89.9 11.4 23.9 11.3 23.2
Node 1.76 27.3 98.2 12.1 25.0 11.9 24.2
3.53 13.6 97.9 12.1 25.1 11.8 24.4
3.53 13.6 98.3 12.2 25.1 11.9 24.4
5.30 9.1 95.6 11.9 24.9 11.7 24.0
Layer 1.51 31.8 100.3 12.2 25.3 11.9 24.6
3.10 15.5 101.1 12.8 25.3 12.4 24.7
3.10 15.5 102.4 12.9 25.5 12.5 24.9
4.61 10.4 99.5 12.6 25.1 12.3 24.4
Table 2: Perplexity (PPL) and word error rate (WER) performance and compression ratio of quantized LSTM RNNLMs on SWBD dataset: full precision baseline with no quantization (STD), binarized model w/o partially quantized linear layers (Bin+Lin or Bin) trained using 50 or 400 epochs, and ADMM quantized models with a layer, node or no tying of quantization tables of varying #bits.

The performance of various quantized RNNLMs using binarization and ADMM optimization are presented in table 1. Full precision models are shown in table 1 including the full precision baseline with no quantization (STD), binarized model w/o partially quantized linear layers (Binr+Lin or Bin) trained using 50 or 400 epochs, and ADMM quantized models with a layer, node or no tying of quantization tables of varying #bits.

There are several trends that can be found in table 1. First, the baseline binarization can obtain model compression factor up to approximately time. In order to achievw the best perplexity of (line5 in table 1), it requires both the additional partially quantized linear layers and 400 epochs (s on GPU per epoch) to reach convergence in training. Also, there is a small perplexity increase by against the full precision baseline system. Second, the ADMM based quantization provides a much faster training, requiring 50 epochs ((s on GPU per epoch)) to converge for all ADMM quantized models, which is nearly 4 times faster in convergence over the Bin+lin binarized quantization baseline. Finally, the use of locally shared quantization tables in ADMM systems allow flexible adjustment in the trade-off between the model performance and compression ratio. To achieve the largest compression ratio of , the layer level tied binarized ADMM based quantization system (line 11 in table 1) should be used. On the other hand, the best performance can be obtained using node level tying with quantization table giving a perplexity score of 115.9 (line 10 in table 1).

5.2 Experiments on Conversational Telephone Speech

To further evaluate the performance of proposed ADMM based quantization of RNNLMs for speech recognition, we also used the Switchboard (SWBD) conversational telephone dataset. The SWBD system has 300 hour of conversational telephone speech from Switchboard I for acoustic modeling, 3.6M words of acoustic transcription and 30k words lexicon for language modeling. The acoustic model is a minimum phone error (MPE) trained hybrid DNN, of which the details can be found in 

[liu2018limited]. The baseline 4-gram language model was used to generate the n-best lists. The Hub5’00 data set with Switchboard (swbd) and CallHome (callhm) subsets were used in evaluation. Perplexity and n-best lists rescoring word error rate performance of various quantized RNNLMs using binarization and ADMM optimization are shown in table 2.

A similar set of experiments as in table 1 were then conducted on the SWBD data. Similar trends can be found. First, the ADMM based quantization convergences 3 times faster than the binarized RNNLM (Bin+lin) (line 5 in table 2). As can be seen from figure 1, our ADMM based quantization system only needs about 50 epochs (23min per epoch) to reach convergence, while the baseline binarized RNNLM (Bin and Bin+Lin) takes about as many as 250 epochs (10min per epoch). Second, the flexibility of ADMM based quantization in adjusting the trade-off between the compression ratio and model performance is clearly shown again in table 2. The largest compression ratio of was obtained using layer level tying and binary quantization set (line 11 in table 2). The lowest word error rate 24.0 on callhm data and a compression ratio of was achieved by the node level tying quantization using as the quantization set (line 10 in table 2).

Figure 1: Convergence speed comparison between the baseline full precision model, binarized RNNLM w/o additional linear layers and ADMM system with layer level tying and binary quantization set (line 11 in table 2) on SWBD

6 Conclusions

This paper investigates the use of alternating direction methods of multipliers (ADMM) based optimization method to directly train low-bit quantized RNNLMs from scratch. Experimental results conducted on multiple tasks suggest the proposed technique can achieve faster convergence than the baseline binarized RNNLMs quantization, while producing comparable model compression ratios. Future research will investigate the application of ADMM based quantization techniques to more advanced forms of neural language models and acoustic models for speech recognition.

7 Acknowledgement

This research is supported by Hong Kong Research Grants Council General Research Fund No.14200218 and Shun Hing Institute of Advanced Engineering Project No.MMT-p1-19.

References