Deep convolutional neural networks (DCNNs) have shown significant ability to learn powerful feature representations directly from image pixels. However, their success has come with requirements for large amounts of memory and computational power, and the practical applications of most DCNNs are limited on smaller embedded platforms and in mobile applications. In light of this, substantial research efforts are being invested in saving bandwidth and computational power by pruning and compressing redundant parameters generated by convolution kernels [Boureau, Ponce, and LeCun2010, Zhou et al.2017].
One typically promising method is to compress the representations via approximating floating point weights of the convolution kernels by binary values [Rastegari et al.2016, Courbariaux et al.2016, Courbariaux, Bengio, and David2015]. Recently, Local Binary Convolution (LBC) layers have been introduced in [Xu, Boddeti, and Savvides2016], to approximate the non-linearly activated responses of standard convolutional layers. In [Courbariaux, Bengio, and David2015], a BinaryConnect scheme using real-valued weights as a key reference is exploited for the binarization process. In [Courbariaux et al.2016, Rastegari et al.2016]
, XNOR-Net is presented where both the kernel weights and inputs attached to the convolution are approximated with binary values, thus allowing an efficient implementation of the convolutional operations. In[Zhou et al.2016, Lin, Zhao, and Pan2017], DoReFa-Net exploits bit convolution kernels with low bitwidth parameter gradients to accelerate both training and inference. While ABC-Net [Lin, Zhao, and Pan2017] adopts multiple binary weights and activations to approximate full-precision weights such that the prediction accuracy degradation can be alleviated. More recently, a simple fixed scaling method incorporated in a 1-bit convolutional layer is employed to binarize CNNs, obtaining closet-to-baseline results with minimal changes [McDonnell2018]. Modulated convolutional networks (MCNs) are presented in [Wang et al.2018] to merely binarize the kernels, which achieves better results than the baselines.
While reducing storage requirements greatly, these BNNs generally have significant accuracy degradation, compared to those using the full-precision kernels. This is primarily due to the following two reasons. (1) The binarization of CNNs could be essentially solved based on the discrete optimization, but it has long been neglected in previous work. Discrete optimization methods can often provide strong guarantees about the quality of the solutions they find and lead to much better performance in practice [Felzenszwalb and Zabih2007, Kim et al.2017, Laude et al.2018]. (2) The loss caused by the binarization of CNNs has not been well studied.
In this paper, we propose a new discrete back propagation via projection (DBPP) algorithm to efficiently build our projection convolutional neural networks (PCNNs) and obtain highly-accurate yet robust BNNs. In the theoretical framework, for the first time, we achieve a projection loss by taking advantage of our DBPP algorithm, i.e., capacity of discrete optimization, on model compression. The advantages of the projection loss also lie in that, on the one hand, it can be jointly learned with the conventional cross-entropy loss in the same pipeline as back propagation; on the other hand, it can enrich diversity and thus improve the modeling capacity of PCNNs. As shown in Fig.1, we develop a generic projection convolution layer which can be easily used in existing convolutional networks, where both quantized kernels and the projection are jointly optimized in an end-to-end manner. Due to the projection matrices only used for optimization but not for reference, resulting in a compact and portable learning architecture, the PCNNs model can be highly compressed and also efficient which outperforms all other state-of-the-art BNNs. The contributions of this paper include:
(1) A new discrete back propagation via projection (DBPP) algorithm is proposed to build BNNs in an end-to-end manner. By exploiting multiple projections, we learn a set of diverse quantized kernels that thus compress the full-precision models in a better way.
(2) A projection loss is theoretically achieved in DBPP, and we develop a generic projection convolutional layer to efficiently binarize existing convolutional networks, such as VGGs and Resnets.
(3) PCNNs achieve the best classification performance compared to other state-of-the-art BNNs on the ImageNet and CIFAR datasets.
Existing CNNs compression works generally follow three pathways, which are quantized neural networks (QNNs) [Li et al.2017, He, Zhang, and Sun2017, Zhou et al.2018, Wu et al.2018, Han, Mao, and J. Dally2016], sparse connections [Denton et al.2014, Tai et al.2015] and designing new CNN architectures [G. Howard et al.2017, Zhang et al.2017, N. Iandola et al.2017]. Recent research efforts on quantized neural networks (QNNs) have considerably reduced the memory requirement and computation complexity in DCNNs by using codebook-based network pruning [Li et al.2017], Huffman encoding [Han, Mao, and J. Dally2016], Hash functions [Chen et al.2015], low bitwidth weights and activations [Hubara et al.2016], and further generalized low bitwidth gradients [Zhou et al.2016] and errors [Wu et al.2018]. BinaryNet based on BinaryConnect is proposed to train DCNNs with binary weights, where the activations are triggered at run-time while the parameters are computed at training time [Courbariaux et al.2016]. The XNOR-Net is introduced to approximate the convolution operation using primarily binary operations, which reconstructs unbinarized kernels using binary kernels with a single scaling factor [Rastegari et al.2016].
|: full-precision kernel||: projection matrix||: set of||: discrete set|
|: quantized kernel||: duplicated||: trade-off scaler for||: discrete value in|
|: kernel index||: projection index||: layer index||: iteration index|
|: number of kernels||: total projection number||: number of layers||: plane index|
Our target is similar to those binarization methods that all attempt to reduce memory consumption and replace most arithmetic operations with binary operations. However, the way we quantize the kernels is different in two respects. First, we use an efficient discrete optimization based on projection, in which the continuous values tend to converge to a set of nearest discrete values within our framework. Second, multiple projections are introduced to bring diversity into BNNs and further improve the performance.
Discrete Back Propagation via Projection
Discrete optimization is one of hottest topics in mathematics and is widely used to solve computer vision problems [Kim et al.2017, Laude et al.2018]. In this paper, we propose a new discrete back propagation algorithm, where a projection function is exploited to binarize or quantize the input variables in a unified framework. Due to a flexible projection scheme in use, we actually obtain diverse binarized models of higher performance than previous ones. In Table 1, we describe the main notation used in the following sections.
In our work, we define the quantization of the input variable as a projection onto a set,
where each element , satisfies the constraint , and is the discrete value of the input variable. Then we define the projection of onto as:
which indicates that the projection aims to find the nearest discrete value for each continuous value .
whose gradient exists, we minimize it based on the discrete optimization method. Conventionally, the discrete optimization problem is solved by searching for an optimal set of discrete values with respect to the minimization of a loss function. We propose that in theth iteration, based on the projection in Eq. 2, is quantized to as:
which is used to define our optimization problem as:
where is a projection matrix111Since the kernel parameters are represented as a matrix, is also represented as a matrix., and denotes the Hadamard product. The new minimization problem in (4
) is hard to solve using back propagation (e.g., deep learning paradigm) due to the new constraint. To solve the problem within the back propagation framework, we define our update rule as:
where the superscript is dropped from , is the gradient of with respect to , and is the learning rate. The quantization process , i.e., , is equivilent to finding the projection of onto as:
where is the total number of projection matrices. The new added part (right) shown in (7) is a quadratic function, and is referred to as the projection loss.
Projection Convolutional Neural Networks
Projection convolutional neural networks (PCNNs), shown in Fig. 1, work by taking advantage of DBPP for model quantization. To this end, we reformulate our projection loss shown in (7) into the deep learning paradigm as:
where , denotes the th kernel tensor of the th convolutional layer in the th iteration. is the quantized kernel of via projection as:
where is a tensor, calculated by duplicating a learned projection matrix along the channels, which thus fits the dimension of . is the gradient at calculated based on , i.e., . The iteration index is omitted in the following for simplicity.
In PCNNs, both the cross-entropy loss and projection loss are used to build the total loss as:
Forward Propagation based on Projection Convolution Layer
For each full-precision kernel , the corresponding quantized kernels are concatenated to construct the kernel which actually participates in convolution operation.
where denotes the concatenation operation on tensors.
In PCNNs, the projection convolution is implemented based on and to calculate the feature map of the next layer:
where is the traditional 2D convolution. Although our convolutional kernels are 3D-shaped tensor, we design the following strategy to fit the traditional 2D convolution:
where denotes the convolutional operation. is the th channel of the th feature map in the th convolutional layer and denotes the th feature map in the th convolutional layer.
It should be emphasized that we can utilize multiple projections to enrich the diversity of convolutional kernels , though the single projection already achieves much better performance. This is due to the DBPP in use, which is clearly different from [Lin, Zhao, and Pan2017] in that it is based on a single quantization scheme. Within our convolutional scheme, there is no dimension disagreement on feature maps and kernels in two successive layers. Thus, we can replace the traditional convolutional layers with ours to change widely-used networks, such as VGGs and Resnets. At the inference time, we only store the set of quantized kernels instead of full-precision ones, that is, the projection matrices are not used for the inference, which achieves the reduction of storage.
According to Eq. 10, what should be learned and updated are the full-precision kernels and projection matrix () using the updated equations described below.
We define as the gradient of the full-precision kernel , and have:
where is the learning rate for the convolutional kernels.
Likewise, the gradient of the projection parameter consists of the following two parts:
where is the learning rate for . We further have:
where indicates the th plane of the tensor along the channels. It shows that the proposed algorithm can be trainable in an end-to-end manner, and we summarize the training procedure in Alg. 1. In implementation, we use the mean of in the forward process, but keep the original in the backward propagation.
Note that in PCNNs for BNNs, we set =2 and =. Two binarization processes are used in PCNNs. The first one is the kernel binarization, which is done based on the projection onto , whose elements are calculated based on the mean of absolute values of all the full-precision kernels per layer [Rastegari et al.2016] as:
where is the total number of kernels.
PCNNs are evaluated on the object classification task with CIFAR10/100 [Krizhevsky, Nair, and Hinton2014] and ILSVRC12 ImageNet datasets [Deng et al.2009]. Our DBPP algorithm can be applied to any DCNNs. For fair comparison with other state-of-the-art BNNs, we use Wide-Resnet (WRN) [Zagoruyko and Komodakis2016], Resnet18 [He et al.2016] and VGG16 [Simonyan and Zisserman2014] as our full-precision backbone networks, to build our PCNNs by replacing their full-precision convolution layers with our projection convolution.
Datasets and Implementation details
CIFAR 10/100 [Krizhevsky, Nair, and Hinton2014] are natural image classification datasets containing a training set of 50K and a testing set of 10K color images across the 10/100 classes. For CIFAR10/100 and parameter study, we employ WRNs to evaluate our PCNNs and report the accuracies. Unlike CIFAR 10/100, ILSVRC12 ImageNet object classification dataset [Deng et al.2009] is more challenging due to its large scale and greater diversity. There are 1000 classes and 1.2 million training images and 50k validation images in it. For comparison of our method to the state-of-the-art on the ImageNet dataset, we adopt Resnet18 and VGG16 to validate the superiority and effectiveness of PCNNs.
WRN is a network structure similar to Resnet with a depth factor to control the feature map depth dimension expansion through 3 stages, within which the dimensions remain unchanged. For simplicity we fix the depth factor to . Each WRN has a parameter which indicates the channel dimension of the first stage and we vary it between 16 and 64 leading to two network structures, 16-16-32-64 and 64-64-128-256. Other setting including training details is the same as [Zagoruyko and Komodakis2016] except that we only add dropout layers with a ratio 0.3 to the 64-64-128-256 structure in case of overfitting. WRN-22 indicates a network with 22 convolutional layers and similarly for WRN-40.
Resnet18 and VGG16:
For these two networks, we simply replace the convolutional layers with the projection convolution layer and keep other components unchanged. To train the two networks, we choose SGD as the optimization algorithm with a momentum of 0.9 and a weight decay . The initial learning rate for is 0.01 and for and other parameters the initial learning rates are 0.1, with a degradation of 10% for every 20 epochs before it reaches the maximum epoch of 60.
As mentioned above, the proposed projection loss has the ability to control the process of quantization, similar to clustering. We compute the distributions of the full-precision kernels and visualize the results in Figs. 2 and 3. The hyper-parameter is designed to balance the projection loss and the cross-entropy loss. We vary it from to and finally set it to in Fig. 2, where the variance becomes larger as decreasing . When =0, only one cluster is obtained, where the kernel weights distribute tightly around the threshold=0. This could result in instability during binarization, because little noise may cause a positive weight to be negative and vice versa.
We also show the evolution of the distribution about how the projection loss works in the training process in Fig. 3. A natural question is: do we always need a large ? As a discrete optimization problem, the answer is no and the experiment in Table 2 can verify it, i.e., both the projection loss and cross-entropy loss should be considered at the same time with a good balance. For example, when is set to , the accuracy is higher than those with other values. Thus, we fix to in the following experiments.
For PCNN-22 in Table 2, the PCNNs model is trained for 200 epochs and then used to conduct inference. In Fig. 4, we plot the training and testing loss with =0 and =1e-4, respectively. It clearly shows that PCNNs with =1e-4 (blue curves) converge faster than PCNNs with =0 (yellow curves) when the epoch number .
In Fig. 5, we visualize four channels of the binary kernels in the first row, feature maps produced by in the second row, and corresponding feature maps after binarization in the third row when =4, which illustrates the diversity of kernels and feature maps in PCNNs. Thus, the multiple projection functions can capture the diverse information and result in a high performance based on the compressed models
|PCNN (=16, =1)||0.27M||89.17||62.66|
|PCNN (=16, =2)||0.54M||91.27||66.86|
|PCNN (=16, =4)||1.07M||92.79||70.09|
|PCNN (=64, =1)||4.29M||94.31||76.93|
|PCNN (=64, =4)||17.16M||95.39||78.13|
Results on CIFAR-10/100 datasets
We first compare our PCNNs with the original WRNs with initial channel dimension =16 and 64 in Table 3. We compare the performance of different models with similar parameter amount. Thanks to the multiple projections in use (=4), our results on both datasets are comparable with the full-precision networks. Then, we compare our results with other state-of-the-arts such as BinaryConnect [Courbariaux, Bengio, and David2015], BNN [Courbariaux et al.2016], LBCNN [Xu, Boddeti, and Savvides2016], BWN, XNOR-Net [Rastegari et al.2016] and the model in [McDonnell2018]. It is observed that at least a 1.5% accuracy improvement is gained with our PCNNs, and in most cases large margins are achieved, which indicates that DBPP is really effective for the task of model compression. As shown in Table 3, the increase of can also boost the performance and exhibits a better modeling ability than others by the flexible projection scheme with more feature diversity gain. However, for small models like PCNN (=16), the increase of seems more significant for avoiding accuracy degradation, but for large models like PCNN (=64), it results in relatively small improvement. So we suggest to use small =1 to avoid additional computation in large models, like the PCNNs based on Resnet18 and VGG16, which can still outperform the others.
Results on ILSVRC12 ImageNet classification dataset
|PCNN||1||32||63.5, 66.1||85.1, 86.7|
For the ImageNet dataset, we employ two data augmentation techniques sequentially: 1) randomly cropping patches of from the original image, and 2) horizontally flipping the extracted patches in the training. While in the testing, the Top-1 and Top-5 accuracies on the validation set with single center crop are measured. We modify the architecture of Resnet18 following [Liu et al.2018] with additional PReLU [He et al.2015] and the final results of our PCNNs are finetuned based on the pretrained models with only kernel weights binarized, halving the learning rate in the training.
In Table 4, we compare our PCNNs with several other state-of-the-art models. The first part of the comparison is based on Resnet18 with 69.3% Top-1 accuracy on the full-precision model. Although BWN [Rastegari et al.2016] and DoReFa-Net [Zhou et al.2016] achieve Top-1 accuracy with degradation of less than 10%, it should be noted that they apply full-precision and 4-bit activation respectively. With both of the weights and activations binarized, the BNN model in [Courbariaux et al.2016], ABC-Net [Lin, Zhao, and Pan2017] and XNOR-Net [Rastegari et al.2016] fail to maintain the accuracy and are inferior to our PCNN. For example, compared with the result of XNOR-Net, PCNN increases the Top-1 accuracy by 6.1%. For a fair comparison, we set the activation of our PCNNs to full-precision and vary from 1 to 2, and these two results consistently outperform BWN. We also compare our method with the full-precision VGG16 model, and the accuracy drop is tolerable if only weights are binarized (see the last two rows). Note that our DBPP algorithm still works very well on the large dataset, particularly when =1, which further validates the significance of our method.
In short, we achieved a new state-of-the-art performance compared to other BNNs, and much closer performance to full-precision models in the extensive experiments, which clearly validate the superiority of DBPP for the BNNs calculation.
Memory Usage and Efficiency Analysis
|Model||Memory usage||Memory saving||Speedup|
Memory use is analyzed by comparing our approach with the state-of-the-art XNOR-Net [Rastegari et al.2016] and the corresponding full-precision network. The memory usage is computed as the summation of bits times the number of full-precision kernels and 1 bit times the number of the binary kernels in the networks. As shown in Table 5, our proposed PCNNs, along with XNOR-Net, reduces the memory usage by times compared with the full-precision Resnet18. Note that when is set to , the parameter amount of our model is almost the same as XNOR-Net. The reason is that the projection parameters are only used when training for enriching the diversity in PCNNs, whereas they are not used when inference. For efficiency analysis, if all of the operands of the convolutions are binary, then the convolutions can be estimated by XNOR and bitcounting operations [Courbariaux et al.2016], which gains speedup in CPUs [Rastegari et al.2016]. For , the memory usage and computation cost for convolution are linear to .
Conclusion and future work
We have proposed an efficient discrete back propagation via projection (DBPP) algorithm to obtain our projection convolutional neural networks (PCNNs), which can significantly reduce the storage requirement for computationally limited devices. PCNNs have shown to obtain much better performance than other state-of-the-art BNNs on ImageNet and CIFAR datasets. As a general convolutional layer, the PCNNs model can also be used in other deep models and different tasks, which will be explored in our future work.
- [Boureau, Ponce, and LeCun2010] Boureau, Y.-L.; Ponce, J.; and LeCun, Y. 2010. A theoretical analysis of feature pooling in visual recognition. In ICML, 111–118.
- [Chen et al.2015] Chen, W.; Wilson, J.; Tyree, S.; Weinberger, K.; and Chen, Y. 2015. Compressing neural networks with the hashing trick. In ICML, 2285–2294.
- [Courbariaux, Bengio, and David2015] Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 3123–3131.
- [Courbariaux et al.2016] Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830.
- [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR, 248–255.
- [Denton et al.2014] Denton, E. L.; Zaremba, W.; Bruna, J.; LeCun, Y.; and Fergus, R. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 1269–1277.
- [Felzenszwalb and Zabih2007] Felzenszwalb, P., and Zabih, R. 2007. Discrete optimization algorithms in computer vision. Tutorial at CVPR.
- [G. Howard et al.2017] G. Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, Tobias Andreetto, M.; and Adam, H. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. In CVPR.
- [Han, Mao, and J. Dally2016] Han, S.; Mao, H.; and J. Dally, W. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR.
- [He et al.2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 1026–1034.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
- [He, Zhang, and Sun2017] He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), volume 2.
- [Hubara et al.2016] Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. JMLR 18(187):1–30.
- [Kim et al.2017] Kim, S.; Min, D.; Lin, S.; and Sohn, K. 2017. Dctm: Discrete-continuous transformation matching for semantic flow. In ICCV, 4529–4538.
- [Krizhevsky, Nair, and Hinton2014] Krizhevsky, A.; Nair, V.; and Hinton, G. 2014. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html.
- [Laude et al.2018] Laude, E.; Lange, J.-H.; Sch pfer, J.; Domokos, C.; Laura, L.-T.; Schmidt, F. R.; Andres, B.; and Cremers, D. 2018. Discrete-continuous admm for transductive inference in higher-order mrfs. In CVPR, 4539–4548.
- [Li et al.2017] Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Peter Graf, H. 2017. Pruning filters for efficient convnets. In ICLR.
- [Lin, Zhao, and Pan2017] Lin, X.; Zhao, C.; and Pan, W. 2017. Towards accurate binary convolutional neural network. In NIPS, 345–353.
- [Liu et al.2018] Liu, Z.; Wu, B.; Luo, W.; Yang, X.; Liu, W.; and Cheng, K.-T. 2018. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, 747–763.
- [McDonnell2018] McDonnell, M. D. 2018. Training wide residual networks for deployment using a single bit for each weight. In ICLR.
- [N. Iandola et al.2017] N. Iandola, F.; Han, S.; W. Moskewicz, M.; Ashraf, K.; J. Dally, W.; and Keutzer, K. 2017. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5mb model size. In ICLR.
- [Rastegari et al.2016] Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 525–542.
- [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- [Tai et al.2015] Tai, C.; Xiao, T.; Zhang, Y.; Wang, X.; and E, W. 2015. Convolutional neural networks with low-rank regularization. In Computer Science.
- [Wang et al.2018] Wang, X.; Zhang, B.; Li, C.; Ji, R.; Han, J.; Cao, X.; and Liu, J. 2018. Modulated convolutional networks. In CVPR, 840–848.
- [Wu et al.2018] Wu, S.; Li, G.; Chen, F.; and Shi, L. 2018. Training and inference with integers in deep neural networks. In ICLR.
- [Xu, Boddeti, and Savvides2016] Xu, J. F.; Boddeti, V. N.; and Savvides, M. 2016. Local binary convolutional neural networks. In CVPR, 19–28.
- [Zagoruyko and Komodakis2016] Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
- [Zhang et al.2017] Zhang, X.; Zhou, X.; Lin, M.; and Sun, J. 2017. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083.
- [Zhou et al.2016] Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; and Zou, Y. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160.
- [Zhou et al.2017] Zhou, Y.; Ye, Q.; Qiu, Q.; and Jiao, J. 2017. Oriented response networks. In CVPR, 4961–4970.
- [Zhou et al.2018] Zhou, A.; Yao, A.; Wang, K.; and Chen, Y. 2018. Explicit loss-error-aware quantization for low-bit deep neural networks. In CVPR, 9426–9435.