Deep Neural Networks have been the main driving force behind recent advancement in machine learning, notably in computer vision applications. While deep learning has become the standard approach for many tasks, performing inference on low power and constrained memory hardware is still challenging. This is especially challenging in autonomous driving in electric vehicles where high precision and high throughput constraints are added on top of the low power requirements.
that require an order of magnitude less memory and no multiplications, leading to significantly faster inference on dedicated hardware. The problem arises when trying to backpropagate errors as the weights are discrete. One heuristic suggested inCourbariaux et al. (2015) is to use stochastic weights , sample binary weights according to for the forward pass and gradient computation and then update the stochastic weights instead. Another idea, used by Hubara et al. (2016) and Rastegari et al. (2016)
, is to apply a “straight-through” estimator. While these ideas were able to produce good results, even on reasonably large networks such as ResNet-18 (He et al., 2016), there is still a large gap in prediction accuracy between the full-precision network and the discrete networks.
In this paper, we attempt to train neural networks with discrete weights using a more principled approach. Instead of trying to find a good “derivative” to a non-continuous function, we show how we can find a good smooth approximation and use its derivative to train the network. This is based on the simple observation that if at layer we have stochastic (independent) weights , then the pre-activations
are approximately Gaussian according to the (Lyapunov) central limit theorem (CLT). This allows us to model the pre-activation using a smooth distribution and use the reparameterization trick(Kingma & Welling, 2014) to compute derivatives. The idea of modeling the distribution of pre-activation, instead of the distribution of weights, was used in Kingma et al. (2015) for Gaussian weight distributions where it was called the local reparameterization trick. We show here that with small modifications it can be used to train discrete weights and not just continuous Gaussian distributions.
The top histogram in each subfigure shows the pre-activation of a random neuron,, which is calculated in a regular feed-forward setting when explicitly sampling the weights. The bottom shows samples from the approximated pre-activation using Lyapunov CLT. (a) refers to the first hidden layer whereas (b) refers to the last. We can see the approximation is very close to the actual pre-activation when sampling weights and performing standard feed-forward. In addition, we see it even holds for the first hidden layer, where the number of elements is not large (in this example, 27 elements for a convolution).
We experimented with both binary and ternary weights, and while the results on some datasets are quite similar, ternary weights were considerably easier to train. From a modeling perspective restricting weights to binary values forces each neuron to affect all neurons in the subsequent layer making it hard to learn representations that need to capture several independent features.
In this work, we present a novel and simple method for training neural networks with discrete weights. We show experimentally that we can train binary and ternary networks to achieve state-of-the-art results on several datasets, including ResNet-18 on ImageNet, compared to previously proposed binary or ternary training algorithms. On MNIST and CIFAR-10 we can also almost match the performance of the original full-precision network using discrete weights.
2 Related Work
The closest work to ours, conceptually, is expectation-backpropagation (Soudry et al., 2014) in which the authors use the CLT approximation to train a Bayesian mean-field posterior. On a practical level, however, our training algorithm is entirely different: their forward pass is deterministic while ours is stochastic. In addition, we show results on much larger networks. Another close line of work is Wang & Manning (2013) where the authors used the CLT to approximate the dropout noise and speed-up training, but do not try to learn discrete weights.
Recently there is great interest in optimizing discrete distributions by using continuous approximations such as the Gumbel-softmax (Maddison et al., 2017; Jang et al., 2017), combined with the reparameterization trick. However, since we are looking at the pre-activation distribution, we use the simpler Gaussian approximation as can be seen in Fig 1. In appendix A we compare our method to the Gumbel-softmax relaxation and show that our method leads to much better training.
One approach for discrete weight training suggested in Courbariaux et al. (2015) is to use stochastic weights, sample binary weights and use it for the forward pass and gradient computation, computing gradients as if it was a deterministic full precision network, and then update the stochastic weights. The current leading works on training binary or ternary networks, Rastegari et al. (2016); Li et al. (2016); Hubara et al. (2016), are based on a different approach. They discretize during the forward pass, and back-propagate through this non-continuous discretization using the “straight-through” estimator for the gradient. We note that Rastegari et al. (2016); Hubara et al. (2016); Soudry et al. (2014)binarize the pre-activation as well.
Another similar work based on the “straight-through” estimator is Zhu et al. (2017). The authors proposed training a ternary quantization network, where the discrete weights are not fixed to but are learned. This method achieves very good performance, and can save memory compared to a full precision network. However, the use of learned discrete weights requires using multiplications which adds computational overhead on dedicated hardware compared to binary or ternary networks.
An alternative approach to reducing memory requirements and speeding up deep networks utilized weight compression. Louizos et al. (2014); Han et al. (2016) showed that deep networks can be highly compressed without significant loss of precision. However, the speedup capabilities of binary or ternary weight networks, especially on dedicated hardware, is much higher.
3 Our method
We will now describe in detail our algorithm, starting with the necessary background. We use a stochastic network model in which each weight is sampled independently from a multinomial distribution . We use to mark the set of all parameters and to mark the distribution over
. Given a loss functionand a neural network parametrized by our goal is to minimize
The standard approach for minimizing with discrete distributions is by using the log-derivative trick (Williams, 1992):
For continuous distributions Kingma & Welling (2014) suggested the reparameterization trick - instead of optimizing for we parametrize where is drawn from a known fixed distribution (usually Gaussian) and optimize for . We can sample and use the Monte-Carlo approximation:
In further work by Kingma et al. (2015)
, it was noticed that if we are trying to learn Bayesian networks, sampling weights and running the model with different weights is quite inefficient on GPUs. They observed that if the weight matrix111This is done per-layer. To simplify notation we ignore the layer index is sampled from independent Gaussians , then the pre-activations are distributed according to
This allows us to sample pre-activations instead of weights, which they called the local reparameterization trick, reducing run time considerably.
3.2 Discrete local reparameterization
The work by Kingma et al. (2015) focused on Gaussian weight distributions, but we will show how the local reparameterization allows us to optimize networks with discrete weights as well. Our main observation is that while eq. 4 holds true for Gaussians, from the (Lyapunov) central limit theorem we get that should still be approximated well by the same Gaussian distribution . The main difference is that and are the mean and variance of a multinomial distribution, not a Gaussian one. Once the discrete distribution has been approximated by a smooth one, we can compute the gradient of this smooth approximation and update accordingly.
This leads to a simple algorithm for training networks with discrete weights - let be the parameters of the multinomial distribution over . At the forward pass, given input vector , we first compute the weights means and variances . We then compute the mean and variance of , and . Finally we sample and return , where stands for the element-wise multiplication. We summarize this in Algorithm 1.
During the backwards phase, given we can compute
and similarly compute to backpropagate. We can then optimize using any first order optimization method. In our experiments we used Adam (Kingma & Ba, 2015).
4 Implementation Details
In section 3 we presented our main algorithmic approach. In this section, we discuss the finer implementation details needed to achieve state-of-the-art performance. We present the numerical experiments in section 5.
For convenience, we represent our ternary weights using two parameters, and , such that and . For binary networks we simply fix .
In a regular DNN setting, we usually either initialize the weights with a random initializer, e.g Xavier initializer (Glorot & Bengio, 2010) or set the initial values of the weights to some pretrained value from a classification network or another similar task. Here we are interested in initializing distributions over discrete weights from pretrained continuous deterministic weights, which we shall denote as . We normalized the pretrained weights beforehand to be mostly in the range, by dividing the weights in each layer by
, the standard deviation of the weights in layer.
We first explain our initialization in the ternary setting. Our aim is threefold: first, we aim to have our mean value as close as possible to the original full precision weight. Second, we want to have low variance. Last, we do not want our initial distributions to be too deterministic, in order not to start at a bad local minima.
If we just try to minimize the variance while keeping the mean at then the optimal solution would have
and the rest of the probability is ator , depending on the sign of . This is not desirable, as our initialized distribution in such case could be too deterministic and always assign zero probability to one of the weights. We therefore modify it and use the following initialization:
We initialize the probability of to be
are hyperparameters (set to 0.05 and 0.95 respectively in our experiments). Using the initialization proposed in eq.6, gets the maximum value when , and decays linearly to when . Next, we set so that the initialized mean of the discrete weight is equal to the original full precision weight:
In order for the expression in eq. 7 to be equal to , we need to set:
For a binary setting, we simply set .
4.2 Increasing Entropy
During some of our experiments with ternary networks, we found that many weight distributions converged to a deterministic value, and this led to suboptimal performance. This is problematic for two main reasons: First, this could lead to vanishing gradients due to saturation. Second, due to the small number of effective random variables, our main assumption - that the pre-activation distribution can be approximated well by a Gaussian distribution - is violated. An example of such case can be seen in fig.1(a).
This can be solved easily by adding regularization on the distribution parameters (before sigmoid, and ) to penalize very low entropy distributions. We emphasize that this regularization term was used mainly to help during training, not to reduce over-fitting. From here on, we shall denote this regularization hyper-parameter as probability decay, to distinguish it from the weight decay used on the last full-precision layer. We note that in practice, only an extremely small probability decay is needed, but it had a surprisingly large effect on the overall performance. An example of such a case can be seen in fig. 1(b). A closer examination of the average entropy throughout the training and the effect of the proposed regularization, can be found in appendix B.
4.3 Reducing Entropy
In the binary setting, we experienced the opposite in some experiments. Since we do not have a zero weight, we found that many of the weight probabilities learned were stuck at mid-values around 0.5 so that the expected value of that weight would remain around zero. That is not a desirable property, since we aim to get probabilities closer to 0 or 1 when sampling the final weights for evaluation. For this case, we add a beta density regularizer on our probabilities, with . From here on we shall denote this regularization hyper-parameter as beta parameter. It is important to note that we are using as our regularizer and not (with ) as might be more standard. This is because has a much stronger pull towards zero entropy. In general we note that unlike ternary weights, binary weights require more careful fine-tuning in order to perform well.
For evaluation, we sample discrete weights according to our trained probabilities and perform the standard inference with the sampled weights. We can sample the weights multiple times and pick those that performed best on the validation set. We note that since our trained distributions have low entropy, the test accuracy does not change significantly between samples, leading to only a minor improvement in overall performance.
We conducted extensive experiments on the MNIST, CIFAR-10 and ImageNet (ILSVRC2012) benchmarks. We present results in both binary and ternary settings. In the binary setting, we compare our results with BinaryConnect (Courbariaux et al., 2015), BNN (Hubara et al., 2016)
, BWN and XNOR-Net(Rastegari et al., 2016). In the ternary setting, we compare our results with ternary weight networks (Li et al., 2016).
In previous works with binary or ternary networks, the first and last layer weights remained in full precision. In our experiments, the first layer is binary (or ternary) as well (with an exception of the binary ResNet-18), leading to even larger memory and energy savings. As previous works we keep the final layer in full precision as well.
MNIST is an image classification benchmark dataset, containing 60K training images and 10K test images from 10 classes of digits . Images are in gray-scale. For this dataset, we do not use any sort of preprocessing or augmentation. Following Li et al. (2016), we use the following architecture:
is a fully connected layer. We adopt Batch Normalization(Ioffe & Szegedy, 2015) into the architecture after every convolution layer. The loss is minimized with Adam. We use dropout (Srivastava et al., 2014) with a drop rate of . Weight decay parameter (on the last full precision layer) is set to
. We use a batch size of 256, initial learning rate is 0.01 and is divided by 10 after 100 epochs of training. For thebinary setting, beta parameter is set to . For the ternary setting, probability decay parameter is set to . We report the test error rate after 190 training epochs. The results are presented in Table 1.
|Binary Weight Network|
|LR-net, binary (Ours)||0.53%||6.82%|
|Ternary Weight Network|
|LR-net, ternary (Ours)||0.50%||6.74%|
|Full precision reference||0.52%||6.63%|
CIFAR-10 is an image classification benchmark dataset (Krizhevsky, 2009), containing 50K training images and 10K test images from 10 classes. Images are
in RGB space. Each image is preprocessed by subtracting its mean and dividing by its standard deviation. During training, we pad each side of the image with 4 pixels, and a randomcrop is sampled from the padded image. Images are randomly flipped horizontally. At test time, we evaluate the single original image without any padding or multiple cropping. As in Li et al. (2016), we use the VGG inspired architecture:
Where is a ReLU convolution layer, is a max-pooling layer, and is a fully connected layer. It is worth noting that Courbariaux et al. (2015) and Hubara et al. (2016) use a very similar architecture with two differences, namely adopting an extra fully connected layer and using an L2-SVM output layer, instead of softmax as in our experiments. We adopt Batch Normalization into the architecture after every convolution layer. The loss is minimized with Adam. We use dropout with a drop rate of . Weight decay parameter is set to . We use a batch size of 256, initial learning rate is 0.01 and is divided by 10 after 170 epochs of training. For the binary setting, beta parameter is set to . For the ternary setting, probability decay parameter is set to . We report the test error rate after 290 training epochs. The results are presented in Table 1.
5.3 ImageNet11footnotetext: MNIST and CIFAR-10 results for BWN are taken from Li et al. (2016) where they are marked as BPWNs
|ImageNet (top-5)||ImageNet (top-1)|
|LR-net, binary (Ours)|
Ternary Weight Network
|LR-net, ternary (Ours)||15.2%||36.5%|
|Full precision reference|
ImageNet 2012 (ILSVRC2012) is a large scale image classification dataset (Deng et al., 2009), consisting of 1.28 million training images, and 50K validation images from 1000 classes. We adopt the proposed ResNet-18 (He et al., 2016), as in Rastegari et al. (2016) and Li et al. (2016). Each image is preprocessed by subtracting the mean pixel of the whole training set and dividing by the standard deviation. Images are resized so that the shorter side is set to 256. During training, the network is fed with a random crop and images are randomly flipped horizontally. At test time, the center crop is taken. We evaluate only with the single center crop.
The loss is minimized with Adam. Weight decay parameter is set to , we use a batch size of 256 and initial learning rate is 0.01. For the binary setting, we found that the beta regularizer is not needed and got the best results when beta parameter is set to . Learning rate is divided by 10 after 50 and 60 epochs and we report the test error rate after 65 training epochs. For the ternary setting, probability decay parameter is set to . For this setting, learning rate is divided by 10 after 30 and 44 epochs and we report the test error rate after 55 training epochs. The results are presented in Table 2.
5.4 Kernel Visualization
We visualize the first 25 kernels from the first layer of the network trained on MNIST. Fig. 2(a), 2(b) and 2(c) shows these kernels for the binary, ternary and full precision networks respectively. We can see that our suggested initialization from full precision networks takes advantage of those trained weights and preserves the structure of the kernels.
5.5 Binary vs Ternary
In our experiments we evaluate binary and ternary versions of similar networks on similar datasets. When trying to compare the two, it is important to note that the binary results on MNIST and CIFAR-10 are a bit misleading. While the final accuracies are similar, the ternary network work well from the start while the binary networks required more work and careful parameter tuning to achieve similar performance. On harder tasks, like ImageNet classification, the performance gap is significant even when the ternary network discretizes the first layer while the binary did not.
We believe that ternary networks are a much more reasonable model as they do not force each neuron to affect all next-layer neurons, and allow us to model low-mean low-variance stochastic weights. The difference in computation time is small as ternary weights do not require multiplications, giving what we consider a worthwhile trade-off between performance and runtime.
In this work we presented a simple, novel and effective algorithm to train neural networks with discrete weights. We showed results on various image classification datasets and reached state-of-the-art results in both the binary and ternary settings on most datasets, paving the way for easier and more efficient training and inference of efficient low-power consuming neural networks.
Moreover, comparing binary and ternary networks we advocate further research into ternary weights as a much more reasonable model than binary weights, with a modest computation and memory overhead. Further work into sparse ternary networks might help reduce, or even eliminate, this overhead compared to binary networks.
- Courbariaux et al. (2015) Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, NIPS, 2015.
Deng et al. (2009)
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
ImageNet: A Large-Scale Hierarchical Image Database.
Conference on Computer Vision and Pattern Recognition, CVPR, 2009.
Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS, 2010.
- Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In The International Conference on Learning Representations, ICLR, 2016.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, CVPR, 2016.
- Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, NIPS. 2016.
- Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, ICML, 2015.
- Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In The International Conference on Learning Representations, ICLR, 2017.
- Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In The International Conference on Learning Representations, ICLR, 2015.
- Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In The International Conference on Learning Representations, ICLR, 2014.
- Kingma et al. (2015) Diederik P. Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, NIPS, 2015.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. In Tech Report, 2009.
- Li et al. (2016) Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. Advances in Neural Information Processing Systems, NIPS Workshop, 2016.
- Louizos et al. (2014) Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, NIPS, 2014.
Maddison et al. (2017)
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.In The International Conference on Learning Representations, ICLR, 2017.
Rastegari et al. (2016)
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi.
Xnor-net: Imagenet classification using binary convolutional neural networks.European Conference on Computer Vision, ECCV, 2016.
- Soudry et al. (2014) Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In Advances in Neural Information Processing Systems, NIPS, 2014.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
- Wang & Manning (2013) Sida I. Wang and Christopher D. Manning. Fast dropout training. In International Conference on Machine Learning, ICML, 2013.
Ronald J Williams.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 1992.
- Zhu et al. (2017) Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. The International Conference on Learning Representations, ICLR, 2017.
Appendix A Comparison to Gumbel-softmax
We compare our continuous relaxation to a standard continous relaxation using Gumbel-softmax approximation (Jang et al., 2017). We ran both experiments on CIFAR-10 using the exact same parameters and initialization we use in Sec. 5, for 10 epochs. For the Gumbel-softmax we use a temperature . We also tried other parameters in the range and did not find anything that worked noticeably better. We also did not use annealing of the temperature as this is only 10 epochs. Comparing the results in Fig. 4, it is clear that our proposed method works much better then simply using Gumbel-softmax on the weights.
The reason our method works better is twofold. First, we enjoy the lower variance of the local reparamatrization trick since there is one shared source of randomness per pre-activation. Second, the Gumbel-softmax has a trade-off: when the temperature goes down the approximation gets better but the activation saturates and the gradients vanish so it is harder to train. We do not have this problem as we get good approximation with non-vanishing gradients.
Appendix B Average Entropy
In order to ensure the correctness of our proposed regularization in a ternary setting, we compare the average entropy of the learned probabilities throughout the training, with and without the probability decay term. We examine the average entropy during the training procedure of a network trained on CIFAR-10 in a ternary setting. For the experiment with probability decay, we set the probability decay parameter to , as in Sec. 5.
Fig. 4(a) shows the average entropy of the third convolutional layer, during the later parts of the training. We can see that the proposed regularization causes a slight increase in entropy, which is what we aimed to achieve. Note that the ’kink’ in the graph is due to lowering the learning rate after 170 epochs. For some of the layers this regularization has a much more significant effect. Fig. 4(b) shows the average entropy throughout the training for the last convolutional layer. For this layer, the entropy at the end of the training is higher than the entropy in initialization. Looking at the probabilities themselves helps better explain this behavior.
Fig 5(a) shows a histogram of the initialized probability of this layer. Fig 5(b) shows a histogram of these learned probabilities at the end of the training. It can be seen that for this layer there were many weights that were assigned a very high (or very low) probability value for . However, the probability decay regularization pushed the learned values to be less deterministic, causing the increase in entropy observed in Fig. 4(b).