Toward Computation and Memory Efficient Neural Network Acoustic Models with Binary Weights and Activations

Neural network acoustic models have significantly advanced state of the art speech recognition over the past few years. However, they are usually computationally expensive due to the large number of matrix-vector multiplications and nonlinearity operations. Neural network models also require significant amounts of memory for inference because of the large model size. For these two reasons, it is challenging to deploy neural network based speech recognizers on resource-constrained platforms such as embedded devices. This paper investigates the use of binary weights and activations for computation and memory efficient neural network acoustic models. Compared to real-valued weight matrices, binary weights require much fewer bits for storage, thereby cutting down the memory footprint. Furthermore, with binary weights or activations, the matrix-vector multiplications are turned into addition and subtraction operations, which are computationally much faster and more energy efficient for hardware platforms. In this paper, we study the applications of binary weights and activations for neural network acoustic modeling, reporting encouraging results on the WSJ and AMI corpora.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/11/2018

A Fast-Converged Acoustic Modeling for Korean Speech Recognition: A Preliminary Study on Time Delay Neural Network

In this paper, a time delay neural network (TDNN) based acoustic model i...
08/02/2016

Knowledge Distillation for Small-footprint Highway Networks

Deep learning has significantly advanced state-of-the-art of speech reco...
09/14/2017

Binary-decomposed DCNN for accelerating computation and compressing model without retraining

Recent trends show recognition accuracy increasing even more profoundly....
06/12/2020

AlgebraNets

Neural networks have historically been built layerwise from the set of f...
11/30/2017

Towards Accurate Binary Convolutional Neural Network

We introduce a novel scheme to train binary convolutional neural network...
05/25/2017

Gated XNOR Networks: Deep Neural Networks with Ternary Weights and Activations under a Unified Discretization Framework

There is a pressing need to build an architecture that could subsume the...
01/12/2018

Not All Ops Are Created Equal!

Efficient and compact neural network models are essential for enabling t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have been shown to be extremely powerful in a wide range of machine learning tasks, evidenced by recent significant progress in tasks such as speech recognition 

[1, 2], machine translation [3, 4] and image recognition [5]

. However, computations in neural networks are usually much more expensive than prior approaches, as they involve a large number of matrix-vector multiplications followed by nonlinear activation functions. Furthermore, model training and inference with neural networks also require a significant amount of memory due to the large size of the mode, as in nowadays neural network models, the number of hidden units can be thousands for each layer. As a result, neural network models are usually trained on GPUs with significant speedups via parallelization, and the models are usually deployed in the cloud for inference to address the memory issue.

Computation and memory efficient neural networks have been an appealing research topic for both deep learning and application researchers, as they enable local deep learning applications such as deploying speech and image recognition on embedded devices without access to the cloud. This problem has been addressed by many researchers in different ways. A large fraction of the prior works aims at training a small mode – in terms of the number of model parameters – that can approach the accuracy of a larger model. In this work, we are inspired by 

[6, 7] to investigate using binary weights and activations to replace the real-valued weights and activations in neural networks. The motivation is that compared to real-valued weights, binary weights require significantly fewer bits for storage, thereby cutting down the memory footprint. From a computational perspective, binary weighs or activations turn the matrix-vector multiplications into additions and subtractions, which are much faster and more energy efficient for hardware. With both binary weights and activations, the computation will be even simpler and faster as the multiplications become only XOR operations, which can be implemented very efficiently on hardware.

Compared to the pilot study of this idea for image classification on relative small datasets [6, 7] (i.e., MNIST, CIFAR-10 and SVHN), in this paper, we investigate neural networks with binary weights and activations in the context of large vocabulary speech recognition. In particular, we focus on feedforward neural networks as they are simpler in terms of training algorithms and are computationally cheaper for fast turnaround for experiments. Our training algorithms are slightly differently from [6, 7] due to the difference in our model and the task itself, which are detailed in Section 2. Our study is mainly based on the WSJ1 corpus with some additional experiment carried out on the AMI database.

1.1 Related Work

In both speech recognition and deep learning in general, there have been a number of approaches for small memory footprint and computation efficient neural networks. One approach is teacher-student training, also known as model compression [8] or knowledge distillation [9], where a large and computationally expensive teacher model (or an ensemble of models) is used to predict the soft targets for training the smaller student model. As discussed in [9], the soft targets provided by the teacher encode the generalization power of the teacher model, and the student model trained using these labels is observed to perform better than the same model trained with hard labels [10, 11]. Some successful examples of using this approach for speech recognition are [12, 13, 14, 15].

Motivated by the argument that neural networks with dense connections are over-parameterized, another branch of works is to replace the full-rank linear matrices in neural networks by products of low-rank structured matrices. Particular examples include the Toeplitz-like structured transforms studied in [16], and the discrete cosine transform (DCT) used in [17] to approximate the weight matrices in neural networks. With those structured transforms, the number of trainable parameters is significantly smaller, thereby reducing the amount of memory required for model inference. However, the computation cost and energy consumption may not be reduced with these approaches. Another approach is to train a thinner and deeper network directly, with highway [18]

or residual connections 

[19] to overcome the optimization issue [20]. The resulting model is much more compact yet still accurate, e.g., it can achieve comparable recognition accuracy with around 10% of the model parameters of a regular model on the AMI speech recognition corpus.

The binary weight and activation approach in this paper differs from previous works in that it does not aim at cutting down the number of model parameters to save memory and computation, but to reduce the number of bits to save the weights and activations, and to turn the multiplications into additions and subtractions to save computation. Obviously, this approach is complementary to prior ones, and combinations with those approaches are possible, but they are not studied in this work.

2 Binary Neural Network

The key building block in neural networks is the linear matrix-vector multiplication followed by a nonlinear activation function such as

(1)
(2)

where and are activation vectors before and after the nonlinear function ;

are the weight matrix and bias vector for the

-th layer. Most neural networks use real-valued weights and activations to preserve high precision. In this work, we explore the use of binary values for the weights and activations for acoustic modeling.

2.1 Binary Weights

In most of our work, we consider the binary pair instead of . While they are almost the same in terms of hardware implementation, neural networks with binary weights as may have a larger expressive power because of the subtraction operation corresponding to

. In terms of model training, there are two ways to binarize the weight elements as discussed in 

[6]: stochastic and deterministic approaches. The stochastic approach is to set a weight to be or based on a probabilistic distribution, e.g., , where

is the Sigmoid function that maps the real-valued

to . The deterministic approach is to simply set to be its sign. As the deterministic approach is simpler for model training and inference, we focus on this approach in this paper. To be specific, the binarization function is the Sign function (also known as Hard Tanh function)

(3)

where is the binary weight and

is the real-valued weight. Training the neural network with binary weights using stochastic gradient descent (SGD) is straightforward, as shown by Algorithm

1 following [6]. Note that we use to represent the gradient for as in the algorithm. Since we use to represent the activation vector before applying the nonlinear operation, the gradient through the nonlinear function can be written as (line 9). There are a few subtle points in this algorithm. Firstly, the gradients are not binary but are always real-valued (line 10 - 12). Secondly, the binary weights are only used for forward and backward propagations (line 3 and 10), and we always update the real-valued weights (line 17). The idea is to accumulate the gradient updates over multiple mini-batches. If we directly update the binary weight, then the gradient will make no effect in training if it is not large enough to flip the sign of the corresponding weight. For real-valued weights, however, the update will be accumulated, and the sign of the weight may be flipped after seeing a few more mini-batches.

Since we use the binary weights for forward and backward propagations, while the real-valued weights are updated, there is an obvious mismatch between the gradient accumulation and model update. Such mismatch can cause optimization instability or even divergence. In order to narrow the gap, we applied two optimization tricks in our experiments. The first one is clipping the real-valued weights after each update to prevent them from drifting far away from as used in [6], i.e.,

(4)

The second trick is that we optionally set each weight

to its sign based on the probability

(line 19 - 21), where is the absolute value of . Note that after clipping, . This is similar to the stochastic binarization approach as mentioned before. The idea is that when is close to , we set the weigh to its sign with a high probability to bridge the gap between the gradient accumulation and the model update. This weight will have less opportunity to change its sign, and can be considered to be locked for a few mini-batches. The weight that is closer to 0 will have a smaller probability to be set as its sign, so that it will still be active in training. This approach may have a similar effect to Dropout [21] to prevent coadaptation of the weights. Note that the process is stochastic, and a non-active weight may be active again after a few mini-batches. We refer to this approach as semi-stochastic binarization in this work.

Figure 1: The blue line represents the Sign function, which is not differentiable. We use an Identity function (the orange line) to approximate it for back-propagation. When the input is beyond the range , the approximation error is considered to be large, and the corresponding gradient is set to be zero to disable the update.
1:A minibatch of input and output samples, the loss of this minibatch and the learning rate . corresponds to the input feature vector.
2:function Forward-Backward Propagation
3:     for  to  do // Forward prop
4:          // Binarize the weight matrix
5:         
6:          // : nonlinear function for -th layer
7:     end for
8:     

// Gradients from the loss function

9:     for  to  do // Backward prop
10:          = // Gradients through
11:         
12:         
13:         
14:     end for
15:end function
16:function Update
17:     for  to  do
18:          // Update real-valued weights
19:         
20:         for each element in  do // Optional
21:              
22:         end for
23:     end for
24:end function
Algorithm 1 Forward-Backward propagation for feedforward neural networks with binary weights for all hidden layers.

2.2 Binary Activations

To binarize the activations, we use the same Sign function as Eq. (3) to binarize the activations. So in this case, the Sign function is the nonlinear function in Eq. (2

) for the binarization layer. Unlike the case of binary weights, however, using the Sign function as the nonlinear activation function will break the backpropagation algorithm as it is not differentiable. To address this problem, we use a function that is differentiable to approximate the Sign function during the backpropogation. In practice, we find that the

Identity

function works well, though other options may exist. The advantage of using the Identity function for the approximation is that it directly passes through the gradient without any change during backpropogation, which saves computation. However, this is an obviously biased estimate, and as Figure

1 shows, the approximation error grows for larger input. In our experiments, training the neural network with binary activations did not converge by applying this approximation directly. To reduce the approximation error, we apply a mask to the gradient, so that when the absolute value of an element in is above a threshold , where , we set the corresponding gradient to be zero, meaning that the approximate error is unacceptable for that element, and the gradient is too noisy. In this case, the gradients through the binarization layer become

(5)

where is the gradient before the binarization layer, and is the gradient after the binarization layer; is the elementwise multiplication. The mask is computed as

(6)

In our experiments, only a small fraction of the hidden units in are above the threshold, so the gradient matrix is very sparse. We may be able to take advantage of the sparsity of to speed up the training, however, this was not investigated in this paper. Courbariaux et al. [7] used the same approach for the approximated estimation, but the authors did not provide clear explanations for the motivation behind it, and was hard coded to . The algorithm of forward and backward propagation with binary activations is summarized as Algorithm 2.

1:A minibatch of input and output samples, the loss of this minibatch , the learning rate and the threshold . corresponds to the input feature vector.
2:function Forward-Backward Propagation
3:     for  to  do // Forward prop
4:         
5:         if  then
6:              ) // Binary activation function
7:         else
8:               // Softmax function
9:         end if
10:     end for
11:      // Gradients from the loss function
12:     for  to  do // Backward prop
13:         if  then
14:               // Gradient through binarization layer
15:         else
16:               =

// Gradient through Softmax layer

17:         end if
18:         
19:         
20:         
21:     end for
22:end function
23:function Update // Standard parameter update
24:     for  to  do
25:         
26:         
27:     end for
28:end function
Algorithm 2 Forward-Backward propagation for feedforward neural networks with binary activations for all hidden layers

2.3 Binary Neural Networks

Neural networks with both binary weights and activations are referred to as binary neural networks in this paper. Binary neural networks can further save computational cost as the matrix-vector multiplications become only XOR operations in this case, and that can be quickly computed by hardware implementations. Training binary neural networks can be done by combining Algorithm 1 and 2. However, in our experiments, we observed that the gradients of the weight

can easily explode, resulting in divergence in training. We address the problem by clipping the norm of the gradients following the practice in training recurrent neural networks 

[22] as

(7)

where denotes the norm of , and is the threshold. Note that a fully binary neural network is unlikely to work, and using binary weights for the Softmax layer is extremely harmful from our experience. In this work, binary neural networks only have binary weights and activations in the intermediate hidden layers.

3 Experiments and Results

Most of our experiments were performed on the WSJ1 corpus, which has around 80 hours of training data, and we performed some additional experiments on the AMI meeting speech transcription task detailed in section 3.3

. In our experiments, we did not measure the computational cost, as the efficient computation with binary weights and activations relies on hardware implementations. Standard CUDA kernels for computations on GPUs do not have efficient ways to deal with multiplications with binary values. For the experiments on WSJ1, we used 40 dimensional log-mel filter banks with first oder delta coefficients as features, which were then spliced by a context of 11 frames. For acoustic modeling, we used feedforward neural networks with 6 hidden layers, with 3307 units in the Softmax layer, which is the number of the tied hidden Markov model triphone states. For the baseline models, we used Sigmoid activations for the hidden layers. Following the Kaldi recipe 

[23], we used the expanded dictionary and a trigram language model for decoding. All of our models were trained using SGD with exponential learning rate decay, and we used the cross entropy training criterion for all our systems. The algorithms for training neural networks with binary weights and activations are implemented within the Kaldi toolkit.

Figure 2: Convergence curves of boosted training for neural networks with binary weights on the WSJ1 dataset. corresponds to the system without boosted training.
Model Input Softmax dev93 eval92
Baseline 6.8 3.8
Binary weights () fixed fixed 7.7 4.8
Binary weights () fixed fixed 8.0 4.5
Binary weights () fixed fixed 8.0 4.4
Binary weights () real real 10.4 6.7
Binary weights () binary fixed 12.0 7.3
Binary weights () binary binary 19.0 12.0
Table 1: WERs (%) of binary weight networks on WSJ1. The number of hidden units is 1024 for experiments in this table.

Figure 3: Convergence curves of neural networks with binary activations on the WSJ1 dataset.
Update layer WER
ID Model Input Softmax dev93 eval92
1 () 1 8.2 4.4
2 () 2 7.8 4.8
3 () 3 8.0 4.5
4 () 1 9.1 5.3
5 () 1 9.6 5.8
6 () 1 8.1 4.8
7 () 1 11.3 7.0
8 () 1 10.7 6.7
9 () 1 20.4 14.5
10 () 1 12.1 7.0
11 () 1 20.5 12.7
Table 2: WERs (%) of networks using binary activations. denotes the Sigmoid activation, means binary activation, and represents the Softmax operation.

3.1 Results with binary weights

In our experiments, training neural networks with binary weights from random initialization usually did not converge, or converged to very poor models. We addressed this problem by initializing our models from a well-trained neural network model with real-valued weights. This approach worked well, and was used in all our experiments. We trained the baseline neural network with an initial learning rate of 0.008, following the nnet1 recipe in Kaldi. We then reduced the initial learning rate to 0.001 when running Algorithm 1 to train the neural network with binary weights. This initial learning rate was found to be a good tradeoff between convergence speed and model accuracy. Table 1 shows the word error rates (WERs). Here, we explored several settings: the weights in the input layer and Softmax layer were binary, real-valued or fixed from initialization. As shown by Table 1, fixing the weights of the input and Softmax layers to be initialized values and only updating the binary weights of the intermediate hidden layers achieves the best results, which are around 1% absolute worse than our baseline. We also did experiment to update those real-valued weights jointly with the binary weights using the same learning rate, but obtained much worse results. The reason may be that the gradients of the real-valued weights and binary weights are in very different ranges, and updating them using the same learning rate is not appropriate. Adaptive learning rate approaches such as Adam [24] and Adagrad [25]

may work better in this case, but they are not investigated this work. In order to have a complete picture, we have also tried using binary weights for the input layer and the Softmax layer. As expected, we achieved much lower accuracy, confirming that reducing the resolution of the input features and activations for the Softmax classifier are harmful for classification.

We then studied the semi-stochastic binarization approach in Algorithm 1 (line 19 - 21). Applying this step very frequently is harmful to the SGD optimization as it can counteract the SGD update. In our experiments, we set a probability

to control the frequency of this operation. More precisely, after each SGD update, we draw a sample from a uniform distribution between 0 and 1, and if its value is smaller than

, then the semi-stochastic binarization approach will be applied. Therefore, a larger means more frequent operations and vice versa. Figure 2 shows the convergence curves of training with and without this approach, suggesting that the semi-stochastic binarization can speed up the convergence. However, as Table 1 shows, we did not achieve consistent improvements on the dev93 and eval92 sets. Note that, we used the dev93 set to choose the language model score for the eval92 set, and from our observations, the results of development and evaluation sets are usually not well aligned, demonstrating that there may be a slight mismatch between the two evaluation sets. The semi-stochastic binarization approach will be revisited on the AMI dataset.

Model Size dev93 eval92
Baseline 1024 6.8 3.8
Baseline 2048 6.5 3.5
Binary weights 1024 7.7 4.8
Binary weights 1024 Not Converged
Binary activations 1024 8.2 4.4
Binary activations 1024 7.2 4.1
Binary neural network 1024 15.6 10.7
Binary activations 2048 7.3 4.4
Binary weights 2048 7.5 4.4
Binary neural network 2048 Not Converged
Table 3: WERs (%) of neural networks with binary weights and activations on the WSJ1 dataset. We set for the system with binary weights, and for the system with binary activations.

3.2 Results with binary activations

Following the previous experiments, we also initialized the model from the well-trained real-valued neural network for the binary activation networks. Again, we set the initial learning rate for the binary activation network to 0.001. In the first set of experiments, the activation functions of the first and last hidden layers of the network were fixed to Sigmoid activations, and only those of the hidden layers in between were replaced by binary activations. We first studied the impact of the hyper-parameter in our model training. As mentioned before, smaller corresponds to more sparse gradients, while larger indicates a larger approximation error. From our experiments, setting between 1 and 3 did not make a big difference in terms of WERs for both evaluation sets. Figure 3 shows that is a good tradeoff between convergence speed and model generalization ability. With , however, the model training did not converge in our experiments due to the large approximation error.

We also looked at updating the weights in the input and Softmax layers in this case. As the table shows, keeping both layers fixed still works the best. Again, this may be due to the fact that the gradients from Sigmoid and binary activations are in different ranges. In the future, we shall revisit this problem with adaptive learning rate approaches. We then investigated using binary activations for the first hidden layer (row 6 - 8) and the last hidden layer (row 9 - 11). Surprisingly, when the weights in both input layer and Softmax layer are fixed, using binary activations for the first hidden layer can achieve comparable accuracy in our experiments. However, using binary activations for the last hidden layer degraded the accuracy remarkably, which is expected as the resolution of the features for the Softmax layer is very low in this case.

Table 3

shows results of networks with a larger number of hidden units and binary neural networks with both binary weights and activations. For all the experiments in this table, the weights in the input and Softmax layer were fixed, and the first and last hidden layers used Sigmoid activations. Using a larger number of hidden units works slightly better for both binary weight and binary activation systems. For the binary neural network system, we applied the gradient clipping approach as explained in section

2 to prevent divergence in training, and set to 15. However, we only managed to train the network with 1024 hidden units and achieved much inferior accuracies. Training fully binary neural networks is still a challenge from our study.

We also did some experiments to compare to for binarization as shown in Table 3. With Sigmoid activations, using (0, 1) for binary weights can cause training divergence as the elements of the activation vector are always positive. Using for binary activations, however, we achieved lower WER. The reason may be that the network was initialized from Sigmoid activations, and is much closer to Sigmoid compared to . In fact, binary function can be viewed as the hard version of Sigmoid. Using for binary activations may work better with networks initialized from Tanh activations, and that will be investigated in our future works.

Model dev eval
Baseline 26.1 27.5
Binary weights () 30.3 32.7
Binary weights () 30.0 32.2
Binary weights () 29.6 31.7
Binary weights () 29.6 31.9
Binary activations () 30.1 32.5
Binary activations () 29.9 32.3
Binary activations () 30.2 32.4
Binary activations () 29.8 32.0
Binary activations () 27.5 29.5
Binary activations () 28.0 30.2
Binary activations () 29.8 32.2
Binary neural network Not Converged
Table 4: WERs (%) of neural network with binary weights and activations on the AMI dataset. The number of hidden units is 2048, and denotes binarization.

3.3 Results on the AMI dataset

As we mentioned before, we did not observe consistent trends on the development and the evaluation sets of WSJ1, possibly due to certain mismatch between the two. This hindered us from drawing strong conclusions. To gain further insights on the techniques that we have explored, we performed some experiments on the AMI corpus. We focused on the IHM (individual head microphone) condition. It also has around 80 hours of training data, but the dev and eval sets are much larger (over 8 hours). Again, we built our baseline following the Kaldi recipe. We used MFCC features followed by feature-space MLLR transformation, and a trigram language model for decoding. The neural network models have 6 hidden layers, and the Softmax layer has 3972 units. As in the WSJ1 experiments, we initialized our model from the baseline model for networks with binary weights and binary activations. The initial learning rate is 0.008 for the baseline system, and 0.001 for binary weight and activation systems.

The experimental results are shown in Table 4. For the binary weight systems, we revisited the semi-stochastic binarization approach. While the convergence curves were similar to Figure 2 in this case (not shown in this paper), we obtained small but consistent improvements on both of dev and eval sets. In particular, with , the improvement is around 1% absolute on the eval set as shown by Table 4. Since the model was initialized from a Sigmoid network, the binary activation system with worked much better than its counterpart with when . Again, we did experiments by tuning the threshold for binary activation systems. Unlike the experiments on WSJ1, the models are relatively tolerant to changing with for binarization, and we only observed divergence when . This may be because that we used different features in these experiments, causing differences in the distributions of . However, this is not the case for binarization with , as the system degrades rapidly when increases. The reason may be that the binarization function with is not symmetric, and the approximation error using an identity function is significant for large when inputs are negative. Again, we failed to train binary neural networks with 2048 hidden units due to divergence in training.

4 Conclusion

Neural networks with binary weights and activations are appealing for deploying deep learning applications on embedded devices. In this paper, we investigated this kind of neural networks for acoustic modeling. In particular, we have presented practical algorithms to training neural networks with binary weights and activations, and discussed optimization techniques to handle training divergence. On both WSJ1 and AMI datasets, we have achieved encouraging recognition WERs compared to the baseline models. However, this study is still in the early stage, and there is still much room to explore. For example, we only considered feedforward neural networks in this work, leaving other neural architectures such as convolutional neural network and recurrent neural networks, as open problems. Training the network with both binary weights and activations is still challenging from our results, and more work is needed to address the optimization challenge for this case.

5 Acknowledgements

We thank the NVIDIA Corporation for the donation of a Titan X GPU used in this work, and Karen Livescu for proofreading and comments that have improved the manuscript.

References