1 Introduction
Deep Neural Networks (DNNs) are widely used for a variety of machine learning tasks currently, including computer vision, natural language processing, and others. Convolutional Neural Networks (CNNs) are especially suitable for computer vision applications, including object recognition, classification, detection, image segmentation, etc. Smart mobile devices equipped with highresolution cameras have opened the door to potential mobile computer vision applications. However, a typical CNN has millions of parameters and perform billions of arithmetic operations for each inference, e.g., AlexNet
[1]has 61M parameters (249MB of memory with floating point weights) and performs 1.5B highprecision operations to classify one image. The enormous demand for memory storage and computation power hinders their deployment on resourceconstrained embedded devices with limited memory and computation power. It is an active research topic to adapt DNNs/CNNs for deployment on embedded devices.
A number of authors have proposed techniques for reducing size of DNNs/CNNs, including model pruning [6], i.e., removing edges with small weight magnitudes, and weight compression, i.e., reducing the precision of weights by using a small number of bits to edge weight and/or activation. These two approaches are orthogonal and can be applied separately or together. We focus on weight compression in this paper. The low bitwidth translates to small memory footprint and high arithmetic computational efficiency. Courbariaux et al. presented BinaryConnect [3] for training a DNN with binary weights 1 or 1, and BinaryNet [4]
for training a DNN with both binary weights and binary activations. Both BinaryConnect and BinaryNet can achieve good performance on small datasets such as MNIST, CIFAR10 and SVHN, but performed worse than their fullprecision counterparts by a wide margin on largescale datasets like ImageNet. Rastegari et al.
[2] presented Binary Weight Networks and XNORNet, two efficient approximations to standard CNNs, which are shown to outperform BinaryConnect and BinaryNet by large margins on ImageNet. In BinaryWeightNetworks, the convolutional filters are approximated with binary values (1,1); in XNORNetworks, both filters and input to convolutional layers are binary. However, there is still a significant performance gap between these network models and their fullprecision counterparts. Also, binarization of both activations and weights generally leads to dramatic performance degradation compared to binarization of weights only.
To strike a balance between model compression rate and model capacity, Li et al. [5] presented Ternary Weight Networks with weights constrained to (1, 0, 1), each weight encoded with two bits. Compared with DNN models with binary weights, Ternary Weight Networks can achieve better performance due to increased weight precision. However, Ternary Weight Networks make use of only three values (1, 0, 1) out of the four possible values that can be encoded with two bits.
In this paper, we propose TwoBit Networks (TBNs) to further explore the tradeoff between model compression rate and model capacity. (We focus on CNNs in this paper, although our techniques can be adapted to apply to general DNNs, including Recurrent Neural Networks.) We constrain weights to four values (2, 1, 1, 2), which can be encoded with two bits. Compared with existing weight compression methods, TBNs make more efficient use of weight bitwidth to achieve higher model capacity and better performance. Arithmetic operations can be implemented with additions, subtractions, and shifts, which are very efficient and hardwarefriendly.
We propose a training algorithm for DNNs based on Stochastic Gradient Descent. During each iteration, a set of realvalued weights are discretized into twobit values, which are used by the following forward pass and backward pass. Then, the realvalued weights are updated with gradients computed by the backward pass. During inference, only the twobit weights are used. Experimental results show that or method achieves better performance on ImageNet than other weight compression methods.
2 TwoBits Networks
In a CNN with layers, each layer performs a convolution
on its input tensor
and each of its convolution filters , where and represents shape of the input tensor and the filter, respectively, including channels, width, and height. Let denote the number of elements in , and denote the element of , with . For brevity, we drop indexes and when they are unnecessary.Each realvalued convolution filter is approximated with a binary filter and a scaling factor , so that . A convolution operation can be approximated by:
(1) 
where denotes a convolution operation without multiplication.
Ideally, the TBN should mimic its fullprecision counterpart closely, provided that quantization error between and its approximation is minimized. We seek to minimize the L2norm of the quantization error for each convolution filter:
(2) 
The optimization can be divided into two steps. First, the realvalued weights are discretized to find the twobit weights. Then, the optimal scaling factor is found to minimize the quantization error, given the twobit weights. For simplicity, we adopt deterministic discretization:
(3) 
Substitute the twobit weights (3) into the expression for quantization error, the expression can be simplified into:
(4) 
where , , is a constant independent of , and denotes the magnitude of .
Taking the derivative of w.r.t. and setting to zero, we obtain the optimal scaling factor:
(5) 
3 Training TwoBit Networks
We describe details of the training algorithm for TwoBit Networks based on Stochastic Gradient Descent (SGD). Algorithm 1 shows the pseudocode for each training iteration. In order to keep track of tiny weight updates of each iteration, we adopt a similar trick as [3] and maintain a set of realvalued convolution filters throughout the training process. First, approximate filters are computed from the realvalued filters for all convolutional layers (Lines 39). Note that fullyconnected layers can be treated as convolutional layers [7]. The realvalued weights are discretized into twobit weights (Line 5), and an optimal scaling factor is computed for each filter (Line 6). Then, a forward pass is run on the network inputs (Line 10), followed by a backward pass, which backpropagates errors through the network to compute gradients w.r.t. the approximate filters (Line 11). Unlike conventional CNNs, the forward and backward passes use approximate filters instead of the realvalued filters. Finally, the realvalued filters are updated with the gradients (Line 12). During inference, only the twobit filters and the optimal scaling factors are used.
4 Experiments
We use the wellknown ImageNet dataset (ILSVRC2012) to evaluate performance of TwoBit Networks. ImageNet is a computer vision benchmark dataset with a large number of labeled images, divided into a training set and a validation set. The training set consists of 1.2M images from 1K categories, including animals, plants, and other common items; the validation set contains 50K images.
We use Deep Residual Networks (DRNs) [8] as the CNN architecture in our experiments, which achieved stateoftheart performance on ImageNet. Compared with other CNN architectures, DRNs can have a large number of convolutional layers (from 18 to 152), and a shortcut connection that performs linear projection exists alongside each group of two consecutive convolutional layers, in order to reformulate the layers as learning functions with reference to the layer inputs. For simplicity, we adopt ResNet18, which has 18 convolutional layers and is the smallest model presented in their paper.
The experiments are conducted with Torch7 [9] on an NVIDIA Titan X. At training time, images are randomly cropped with 224
224 windows. We run the training algorithm for 58 epochs with batch size of 256. We use SGD with momentum of 0.9 to update parameters and batch normalization
[10] to speed up convergence. The weight decay is set to 0.0001. The learning rate starts at 0.1 and is divided by 10 at epoches 30, 40, and 50. At inference time, we use the 224224 center crops for forward propagation.We compare our method with stateoftheart weight compression methods, including Ternary Weight Network [5], Binary Weight Network and XNORNet [2]. Fig. 1 shows performance results (classification accuracy) on ImageNet. Our method outperforms the other weight compression methods, with top5 accuracy of 84.5% and top1 accuracy of 62.6%. We attribute the improved performance to the increased model capacity due to more efficient use of the twobit representation for weights.
Fig. 2 shows memory size requirement of TwoBit Networks compared to doubleprecision floating point representation for three different architectures(AlexNet, ResNet18 and VGG19). The dramatic reduction in memory size requirement makes TwoBit Networks suitable for deployment on embedded devices.
5 Conclusion
We have presented TwoBit Networks for model compression of CNNs, which achieves a good tradeoff between model size and performance compared with stateoftheart weight compression methods. Compared to the recent work on binary weights and/or activations, our method achieves higher model capacity and better performance with slightly larger memory size requirement. This work is partially supported by NSFC Grant #61672454. The TitanX GPU used for this research was donated by the NVIDIA Corporation.
Wenjia Meng, Zonghua Gu, Ming Zhang and Zhaohui Wu (College of Computer Science, Zhejiang University, Hangzhou, China, 310027)
Email: zgu@zju.edu.cn
References
 [1] Krizhevsky, A., Sutskever, I., Hinton, G.E., "Imagenet classification with deep convolutional neural networks,", Advances in neural information processing systems. (2012) 1097 C1105.
 [2] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks," CoRR, vol. abs/1603.0, 2016.
 [3] M. Courbariaux, Y. Bengio, and J.P. David, "BinaryConnect: Training Deep Neural Networks with binary weights during propagations," in Advances in Neural Information Processing Systems 28, 2015, pp. 31233131.
 [4] M. Courbariaux and Y. Bengio, "Binary Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1," CoRR, vol. abs/1602.0, 2016.
 [5] F. Li and B. Liu, "Ternary Weight Networks," CoRR, vol. abs/1605.04711, 2016.
 [6] S. Han, J. Pool, J. Tran, and W. Dally, "Learning both Weights and Connections for Efficient Neural Network," in Advances in Neural Information Processing Systems 28, 2015, pp. 11351143.

[7]
J. Long, E. Shelhamer, and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation," in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [8] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," CoRR, vol. abs/1512.03385, 2015.
 [9] R. Collobert, K. Kavukcuoglu, and C. Farabet, "Torch7: A matlablike environment for machine learning," in BigLearn, NIPS Workshop, 2011, no. EPFLCONF192376.
 [10] S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," Proc. 32nd Int. Conf. Mach. Learn., pp. 448456, 2015.
Comments
There are no comments yet.