Widening and Squeezing: Towards Accurate and Efficient QNNs

by   Chuanjian Liu, et al.
HUAWEI Technologies Co., Ltd.

Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters. Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques. However, we find the representation capability of quantization features is far weaker than full-precision features by experiments. We address this problem by projecting features in original full-precision networks to high-dimensional quantization features. Simultaneously, redundant quantization features will be eliminated in order to avoid unrestricted growth of dimensions for some datasets. Then, a compact quantization neural network but with sufficient representation ability will be established. Experimental results on benchmark datasets demonstrate that the proposed method is able to establish QNNs with much less parameters and calculations but almost the same performance as that of full-precision baseline models, e.g. 29.9% top-1 error of binary ResNet-18 on the ImageNet ILSVRC 2012 dataset.


page 1

page 2

page 3

page 4


Precision Highway for Ultra Low-Precision Quantization

Neural network quantization has an inherent problem called accumulated q...

Least squares binary quantization of neural networks

Quantizing weights and activations of deep neural networks results in si...

Training Binary Neural Networks through Learning with Noisy Supervision

This paper formalizes the binarization operations over neural networks f...

Searching for Accurate Binary Neural Architectures

Binary neural networks have attracted tremendous attention due to the ef...

FATNN: Fast and Accurate Ternary Neural Networks

Ternary Neural Networks (TNNs) have received much attention due to being...

Mixed-Precision Inference Quantization: Radically Towards Faster inference speed, Lower Storage requirement, and Lower Loss

Based on the model's resilience to computational noise, model quantizati...

HadaNets: Flexible Quantization Strategies for Neural Networks

On-board processing elements on UAVs are currently inadequate for traini...

1 Introduction

Deep neural networks especially the convolutional neural networks get state of the art performance in various computer vision applications, such as image classification 

[23, 40, 41, 12, 25] and object detection [6, 35, 30], semantic segmentation [33, 1, 11], etc. Applications embedded on mobile devices can benefit from the advantages of low latency, better privacy and offline operation. However, deploying deep models on resource constrained mobile devices is challenging due to high memory and computation cost. Motivated by this demand, researchers have proposed many model compression and accelerate methods to improve the applicability of the learned deep models, e.g., pruning [9, 10, 7, 8, 29], quantization [43, 19, 39], decomposition [21, 24] and lightweight structure [18, 47, 20].

One of the most widely used model compression approaches is quantization. To reduce the complexities of deep CNNs, a number of recent works have been proposed for quantizing weights or activations or both of them. Wherein, binary neural networks with weights and activations constrained to or

have many advantages. In contrast to non-binary networks, binary network needs smaller disk memory and replace most arithmetic operations with bit-wise operations XNOR and POPCOUNT, which is power-efficiency and drastically reduce memory size and accesses at run-time.

BinaryNet [3] introduced BNNs with binary weights and activations at run-time and how to compute the parameters gradients at train-time. Its accuracy drops significantly in contrast to full precision nets on ImageNet [38]. XNOR-Net [34]

used the real-valued version of the weights and activations as a key reference for the binarization process. The classification accuracy of a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. But the performance of XNOR-Net drops a lot. BinaryConnect 

[2] trained a DNN with binary weights during the forward and backward propagations. It achieved state-of-the-art results on small datasets, but experiments [34] show that this method is not very successful on large-scale datasets. BNN [4] proposed a new regularization function that encourages training weights around binary values to reduce this accuracy gap. [50] leverages ensemble methods to improve the performance of BNNs with limited efficiency cost.

Although the aforementioned methods have made tremendous efforts to increase the performance of binary neural networks by adjusting their architectures (e.g

. change the order of ReLU and batch normalization 

[46]) and adding more regularizations [4]. In fact, most algorithms for learning binary neural networks can be regarded as a binary feature embedding task in which dimensions of binary features and original high-bit features are exactly the same. The representation ability of features in the binary space will be definitely lower than that of features in the high-bit space, if there are no obvious redundancy in original features. We are therefore motivated to increase the number of binary filters to a suitable value for obtaining binary features with the same effectiveness.

To this end, we first provide two experiments to show the representation ability of binary features. Firstly, to discover the intrinsic representations of deep features via binary feature embedding, features in the given full-precision neural networks will be first projected into a high-dimensional binary space using an orthogonal transformation for retaining their pair-wise Euclidean distances. Then, redundancy in the original features are recognized through a learned selection mask. Based on the obtained compact binary features, we re-configure the neural network with an acceptable increment on its number of filters. Experiments on benchmarks demonstrate that, binary neural networks established using the proposed method are able to achieve the similar performance as that of full-precision baseline models with significantly lower memory usage and OPs. Secondly, we show the accuracy of widened binary networks is higher in contrast to deepened binary networks. Then we proposed the quantization methods and new network design architecture.

To summarize, our main contributions are as follows:

  • We analyze the feature transformation from full precision to low bit representations, and the optimization results prove that the transformation process is effective. Then we use another experiment to show that widened network gets more benfit in contrast to deepened network.

  • The quantization method are proposed. Besides, network pruning are use to search for the efficient and accurate quantized network architecture. Knowledge distillation is used to improve the performance of quantized network.

  • Experiments on benchmark classification and detection dataset verify the effectiveness of proposed method.

In the rest of the paper, we first revisit the related works in this area and describe the preliminaries and motivation. Next we introduce our approach to learn accurate and efficient quantization networks. Then the experiments are presented and analyzed to show the effectiveness of our method. Finally, the conclusion is made.

2 Related Works

This paper solves the limit representation ability of quantization networks by use more features. Previous studies have focused on designing new network architecture or innovative quantization function or finding the best distribution of quantized values. In this section, we revisit the existing methods for establishing compact models, including network quantization, network pruning and knowledge distillation.

2.1 Network Quantization

BinaryConnect [2] directly optimizes the loss of the network with weights replaced by , and it approximates the sign function with the ”hard tanh” function in the backward process to avoid the zero-gradient problem. The binary weight network (BWN) [3] adds scale factors for the weights during binarization. Ternary weight network (TWN) [26] introduces ternary weights and achieves improved performance. XNOR-Net [34] proposed to add a real-valued scaling factor to each output channel of a binary convolution. Trained ternary quantization (TTQ) [49] proposes learning both ternary values and scaled gradients for 32-bit weights. DoReFa-Net [48] proposes quantizing 32-bit weights, activations and gradients using different bit widths. Gradients are approximated by a customized form based on the mean of the absolute values of the full-precision weights. In [7], pruning, quantization and Huffman coding are used to compress model. Bi-Real net [31] connects the real activations (after the 1-bit convolution and/or BatchNorm layer, before the sign function) to activations of the consecutive block, through an identity shortcut to enhance representational capability. ABC-Net [28] and binary ensemble method [50] use more more convolutions operations per layer to improve accuracy. Although these works have made great progress, the performances of low-bit quantization networks, especially 1-bit neural networks are still much worse than the full-precision counterparts.

2.2 Network Pruning

Network pruning is an effective technique to compress and accelerate CNNs, and thus allows us to deploy efficient networks on hardware devices with limited storage and computation resources. Structured pruning methods [14, 27, 16, 8] target the pruning of convolutional filters or whole layers, and thus the pruned networks can be easily developed and applied. For example, Liu et al. [32] leveraged a regularization on the scale factors to select channels. He et al. [15] utilized a geometric median-based criterion to cut out unimportant filters. In this paper, we utilize the network pruning technique to slim the widened QNNs to achieve lower memory and computational cost.

2.3 Knowledge Distillation

Knowledge distillation is one of the most popular solutions for model compression. The idea is to improve the performance of small model with transferred soft targets provided by the large model. Hinton et al. [17] proposed the knowledge distillation approach to compress the knowledge of a large and computational expensive model to a single computational efficient neural network. Since then, knowledge distillation has been widely adopted and many methods are proposed. For example, Romero et al. [36] proposed FitNet, which extracted the feature maps of the intermediate layer as well as the final output to teach the student network. After that, Zagoruyko et al. [45] defined Attention Transfer based on attention maps to improve the performance of the student network.

In this paper, we join the pruning and distillation with quantization networks. The network pruning methods are used to find the efficient quantized network. Then we use knowledge distillation to improve the accuracy of compact quantization networks.

3 Preliminaries

Firstly, binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. So we try to find the lower bounds of binary embedding for the feature maps in one layer of CNN. Secondly, we show the performance of widened binary network is better than deepened binary networks.

3.1 Binary Feature Embedding

Our goal is to find the minimum dimensionality (i.e. the number of filters) of binary neural networks for preserving the performance of full-precision neural networks.

For an arbitrary convolutional layer in a pre-trained deep neural network, the convolution operation for a given instance can be formulated as


where is the input data after converting to a matrix according to filter size and parameters in this layer (i.e. the original images or activations of the previous layer), stacks convolution filters, and is the output feature maps, and are the width and height of out feature maps, respectively. is the output channel number, and is the bias term, which is often eliminated for simplicity.

For the neural network binarization problem, we denote the approximated binary feature maps as , where is the number of filters in the binarize layer, and could be either a linear [13]

or non-linear transformation 

[37, 42]. Commonly, we can utilize a linear transformation to accomplish this as suggested in [44, 13], i.e. , where . Thus, the binarization on feature maps can be formulated as


where is the binarized feature maps, is the Frobenius-norm for matrices. Note that, the number of filters in the binary network could be either larger or smaller than , which will be discussed in the follow.

The above function only force features (or activations) in the given convolutional layer are binary, which does not inherit the functionality of features learned on massive training data. Therefore, we propose to retain the relationship between features of every two samples, which is commonly the most important characteristic in visual recognition tasks such as image classification, detection, and segmentation, i.e.


where calculates the Euclidean distances between features of all samples in the training set. Since the number of samples in training set is usually very large (e.g., 1M in the ImageNet [38]), is an extremely huge matrix, e.g. , which cannot be efficiently optimized. Fortunately, if

is an orthogonal matrix when

is square matrix or , the Euclidean distances between any two samples features can be completely preserved. Given two features and generated using the original network, we have


Therefore, we reformulate Fcn. 3 as


where is an identity matrix, and is a hyper-parameter for balancing two terms in the above function. Then, we can utilize Fcn. 5 to binarize the given full-precision network for maintaining its performance.

Further, we want to find the lower bounds of binary embedding for the feature maps, which means the column sparsity of binary representation . So we introduce mask which is used to select features of . represent the multiplication of element in and column in . The -norm of mask can be used to find the lower bounds. Finally, the follow function Fcn. 6 is used to get the optimal binary embedding of features.


Alternative optimization is use to solve Fcn. 6:

Solve . Since elements in binary features are independent, which could be simply obtained by


where outputs the signs of the input data.

Solve .

For fixed binary variables

, and projection , the optimization of can be formulated as:


which aims to eliminate some column with larger reconstruction errors.

Solve . For the solved mask and binary variables

, the loss function for optimizing the projection matrix

can be written as:


We take use of vgg-small on CIFAR-10 and only optimize the final convolutional features, results are in Table. 1. Mini-batch SGD is used to solve and . To fully excavate the representation ability, we set be times of initially. In Table. 1, the optimized number of channels are shown in the third line. In the low level layers, more features are expected and less feature are needed in the high level layers. Then we retrain the optimized binary nets and get accuracy.

Methods 2 3 4 5 6 acc
baseline 128 256 256 512 512 0.9394
binary 410 332 614 420 25 0.9244
Table 1: Results of binary embedding. The number in the first row means the layer of vgg-small.

Though this optimization method provides one method to find the number of binary features, it is hard to optimize and we have to optimize all convolutional features layer by layer. So one simple but efficient method is expected.

3.2 Widen or Deepen?

We did experiments on different width binary networks and depth binary networks on CIFAR-10. The full precision ResNet-20 is chosen as the baseline. We use times width binary ResNet-20, and binary ResNet-32, ResNet-56 and ResNet-110 are chosen as different depth networks.

Figure 1: An illustration of the widened binary networks and deepened binary networks.

In figure 1, the performance of -times width ResNet-20 with less parameters is better than ResNet-110. -times width binary ResNet-20 uses much less parameters and achieves almost the same accuracy as full-precision ResNet-20. The accuracy of and -times width model are further higher with more channels in each layer. We conjecture the reason lies on that the representation ability of binary features is lower than the full precision features, so we expected to use more binary features to improve performance.

4 Approach

According to above experiments, we expected to use as less as quantized features to get high precision model. So we use the net slimming method on widened quantization networks to search for efficient architectures.

4.1 Implementation of Quantization Layer

In this paper, Weight and activation are quantized respectively. Without loss of generality, we assume be the real weight and be the real activation, and be the quantized weight except that binary weights are expressed by , quantized activation is represented with

. For binary weight, the common binary method in XNOR-net are used. The optimal estimation for

is and . Where returns the sign of input. is the mean value of weight. The implementation uses Straight-Through Estimator (STE) to back propagate the gradients.

The process of -bits () quantization of weight contains three steps:

  • , the value of weight are project to ;

  • , where , the quantized weight are in ;

  • , the value are project to .

In step , we also use the STE for gradient back-propagation.

For bits quantization of activation, we first clamp the value in to , and then use to get the quantized value. Where . The back-propagation also take use of STE.

Figure 2: Typical block of quantized networks.

The typical block are shown in Figure 2. Generally, the first layer and the last layer are not quantized, because there are too much information loss if the input images are quantized and metric loss in output. During training, the features after batch normalization layer and the kernel weight are quantized respectively. Then they are transferred to the convolutional layer. After training, the binary weight are saved. When inference, only the feature are need to re-quantization to compute the result.

4.2 Network Architecture Design

In the preliminary experiments, we illustrate that the representation ability of binary feature is not powerful and the best solution is to widen the binary network. Then some problem appears, how many times should we widen the network?

Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. The pruned architecture itself has both low FLOPs and high accuracy. Besides, pruning method can be seen as an architecture search paradigm which is used to find the best and efficient architecture.

In order to use the advantage of network pruning, we choose at least times width to get better accuracy in contrast to the standard network. Then we use the network slimming [32] method to prune networks and get efficient models. In particular, network slimming imposes a sparsity penalty on the scale factors in batch normalization layers. The training objective during pruning is


where is the original loss function for specific task, e.g. cross entropy loss for classification task and mean square error (MSE) loss for regression task, and is the sparsity regularization hyper-parameter. During training, insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy. Then we retrain or fine-tune the compact model to get high accuracy.

Knowledge distillation is one model compression method in which a small model is trained to mimic a pre-trained, larger model (or ensemble of models). In distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. Denote the pre-softmax outputs (


. logits) of teacher model and student model as

and respectively, and the softmax predictions as and . The knowledge distillation loss can be formulated as:


where is the cross entropy loss,

is the groud-truth one-hot label vector and

is the trade-off hyper-parameter to balance these two terms. Moreover, and are the softened predictions of teacher model and student model:


where is the temperature parameter. We take advantage of knowledge distillation to improve the accuracy of the compact models. The teacher model could be the full-precision model or widened binary network.

5 Experiments

In this section, we will implement experiments to validate the effectiveness of the proposed quantization method on several benchmark classification image datasets and one detection task. Experimental results will be analyzed to further help to understand the benefits of the proposed approach.

5.1 Datasets and Settings

To verify the effectiveness of the proposed quantization method, we conduct experiments on several benchmark visual datasets, including CIFAR-10 [22], CIFAR-100 [22], ImageNet ILSVRC 2012 dataset [38], and PASCAL VOC0712 object detection benchmark [5]. CIFAR-10 dataset is utilized for analyzing the properties of the proposed method, which consists of colour images in classes, with training images and test images. CIFAR-100 dataset has the same number of images except that is has classes. A common data augmentation scheme including random crop and mirroring are adopted. ImageNet is a large-scale image dataset which contains over training images and validation images belonging to classes. The common data preprocessing strategy including random crop and flip are applied during training. We also conduct object detection experiments on PASCAL VOC0712 dataset. Following common practice, we train models on trainval set (about images) and evaluate on the VOC07 test split with images.

We make experiments on one or several widths from for each -bit quantization networks. When , the quantized network gets the same width with standard network. For all baseline experiments, we set weight decay as -. For CIFAR-10 and CIFAR-100, ResNet-20 [12] is selected as the baseline network structure. ResNet-18 [12] and VGG16 [40] is use to test the performance of ImageNet. SSD [30] detection model with VGG16 [40] as backbone is used to verify the performance on detection task.

5.2 Cifar-10

We do experiments on bit and bit quantization networks. In Table 2, the baseline ( bit) top-1 accuracy is . For bit binary network, the accuracy become higher with the increase of network width. When , the accuracy of binary network surpasses the baseline. For bit quantization network, the accuracy surpasses the baseline when . This result shows that more features are needed when quantization networks get less bits.

32 1 0.27 92.19
1 1 0.27 84.14
1 2 1.07 90.34
1 3 2.41 91.98
1 4 4.28 92.98
1 8 17.12 94.22
4 1 0.27 90.23
4 2 1.07 93.01
4 4 4.28 94.39
Table 2: Results of different quantization bits and widths on CIFAR-10.

Then we want to use as less quantized features as possible to get high accuracy. Widely used network pruning method is one effective approach to get light-weight but with high accuracy network. Beside, networks pruning can be seen as one network architecture search method which is used to find efficient network. In this paper, we take use of network slimming method [32] to get small but accurate quantized networks. In Figure 3, the slimming results of different batch norm scale regularization factors are illustrated for ResNet-20. In this experiments, we use binary network and set all , and the threshold of slimming is set to . From Table 3 and Figure 3, we can find that with the increase of batch norm scale regularization factor, there are more channels been pruned and the accuracy become lower. For each residual block, there are two convolutional layers. The first convolutional layer has more channels been pruned, the reason lies that the shortcut connection prevents too many information been discarded. More shortcut connections can improve the performance of quantization networks.

ResNet has shortcut connection which is identity mapping. For each residual block, the input layer has the same number of channels with the output layer. However, the pruned results usually don’t have this characteristic. In our implementation of pruned residual block, we use the channels of output layer as the reference value and try to pad or shrink the channels of input layer to make sure they have the same number of channels. Due to the difference between pruned channels of input layer and output layer, the reconstructed pruned quantization network can no longer use the pre-trained weights. We retrain all pruned quantization resnet to get

. is the result when all pruned channels are set to .

Figure 3: Network slimming results with different batch norm scale regularization on CIFAR-10.
0.0001 91.98 17.48 2.98 90.54 92.71
0.0003 90.76 32.56 1.93 89.04 91.94
0.0005 89.37 43.93 1.29 54.76 90.93
88.35 48.58 1.07 66.41 90.42
0.0007 87.98 53.38 0.89 49.52 89.85
0.0008 87.19 55.89 0.78 53.95 89.41
0.001 85.18 66.50 0.49 29.91 87.78
Table 3: Results of different batch norm scale regularization factors. means the value of scale regularization factor, and mean the accuracy before and after pruning. is the ratio of pruned channels, is the number of parameters in the pruned network. is the accuracy that we retrain the pruned network from scratch. We use threshold for .

As the pruned result of batch norm scale regularization factor has the same number of parameters with networks, we try to improve its accuracy with knowledge distillation method. The full precision network with accuracy () and binary network () are chosen as teacher model, respectively. The values in the full connection layer are used as Hinton et al. [17]. Results are shown in Table 4. With knowledge distillation, the accuracy of pruned binary network improves by and gets which is close to the full precision . The result of teacher has higher accuracy in contrast to full precision teacher.

full 3 0.2 90.72
full 5 0.3 91.0
full 10 0.2 91.12
3 0.2 90.91
5 0.3 91.22
10 0.2 91.39
Table 4: Results of knowledge distillation for CIFAR-10. means the teacher is binary network. is the temperature and is the balance item between knowledge distillation and cross entropy.

In conclusion, by taking use of times parameters, the performance of binary ResNet-20 is very close to full precision ResNet-20.

5.3 Cifar-100

bit and bit quantization ResNet-20 are tested. In Table 5, the baseline ( bit) top-1 accuracy is . For bit binary network, the accuracy become higher with the increase of network width. When , the accuracy of binary network surpasses the baseline. For bit quantization network, the accuracy surpasses the baseline when . This result shows that more features are needed when quantization networks get less bits.

32 1 0.28 69.78
1 1 0.28 50.44
1 2 1.08 62.62
1 3 2.43 67.61
1 4 4.31 70.45
1 8 17.17 74.68
4 1 0.28 63.35
4 2 1.08 70.25
4 4 4.31 73.85
Table 5: Results of different quantization bits and widths on CIFAR-100.

Then we take use of network slimming method to prune the widened networks. Table 6 and Figure 4 illustrate the pruned network and their accuracy. In contrast to CIFAR-10, CIFAR-100 contains more categories. So more parameters and features are wanted in the pruned network. The threshold of prune is set to in all experiments. When the batch norm scale regularization factor is , the pruned binary ResNet-20 gets top-1 accuracy and this network has less parameters in contrast to the binary ResNet-20.

Figure 4: Network slimming results with different batch norm scale regularization on CIFAR-100.
0.0001 69.12 10.32 3.54 66.57 69.54
0.0002 67.64 17.11 2.98 59.63 69.36
0.0003 67.36 20.97 2.73 61.9 68.46
0.0004 65.99 26.31 2.37 61.24 68.28
0.0005 64.87 31.36 2.03 57.66 67.98
0.001 57.07 50.84 0.99 24.62 62.07
Table 6: Results of different batch norm scale regularization factors on CIFAR-100. means the value of scale regularization factor, and mean the accuracy before and after pruning. is the ratio of pruned channels, is the number of parameters in the pruned network. is the accuracy that we retrain the pruned network from scratch.

Knowledge distillation is used to improve the performance of pruned model further. binary ResNet-20 with accuracy is chosen as the teacher. Temperature and balance item are set to and , respectively. The top-1 accuracy improves to which is higher than the baseline full precision accuracy . Besides, we fine-tune the retrained binary ResNet-20 by frozen all trainable parameters except the batch norm layer. The accuracy improves to further.

5.4 ImageNet

For ImageNet, we only did experiments on binary network ResNet-18 and VGG16. Firstly, the results of widened networks are shown in Table 7 and Table 8. With the increase of width, the batch size become smaller due to limited GPU memory. For ResNet-18, the accuracy of binary network is higher than the full precision network. For VGG16, the accuracy of binary network is comparable to the full precision network. These experiments also verify that more quantized feature is benefit to the improvement of quantization networks performance.

32 1 1024 70.79 89.5
1 1 1024 52.6 76.84
1 2 1024 63.73 85.3
1 3 1024 68.07 87.92
1 4 1024 69.74 89.05
1 5 512 71.08 89.74
Table 7: Results of different width ResNet-18 on ImageNet.
32 1 1024 71.41 90.47
1 1 1024 65.99 86.57
1 2 512 69.85 89.33
1 4 256 71.01 90.02
Table 8: Results of different width VGG16 on ImageNet.

In order to get efficient and accurate quantization network, network slimming and knowledge distillation are used in ResNet-18. In these experiments, the width is set to . Table 9 and Figure 5 present the results of network slimming. When batch norm scale regularization factor is , the remained channels are about times of the baseline. Full precision ResNet-18 with accuracy is chosen as the teacher. Temperature and balance item are set to and , respectively. The top-1 accuracy improves to which is comparable to the baseline full precision accuracy.

Figure 5: Network slimming results with different batch norm scale regularization on ImageNet.
0.0001 68.34 17.95 67.55 68.32
0.0002 65.27 33.06 60.84 69.19
0.0005 65.10 33.44 58.93 69.18
Table 9: Results of different batch norm scale regularization factors on ImageNet. means the value of scale regularization factor, and mean the accuracy before and after pruning. is the ratio of pruned channels, is the accuracy that we retrain the pruned network from scratch.

5.5 Object detection

To verify the generalization of our method, we further conduct experiments for object detection task. SSD [30] is one-stage object detection method which is widely used due to its efficiency and high accuracy. We use -bit neural network to replace the backbone, extras convolutional layers and detection head. The results are shown in Table 10

, where the mean Average Precision (mAP) is adopted as evaluation metric. The baseline mAP is got by SSD with full precision VGG16 backbone. Method

and replace only the full precision VGG16 backbone with and binary VGG16 backbone, respectively. For method and , we replace all full precision convolutional layers with binary convolutional layers except the first input convolutional layer and the last classification and localization layer. From Table 10, even if only use binary backbone represented by , we can get mAP. In case , the mAP only decreases percent to . SSD with binary VGG16 backbone gets mAP. These experiments prove that more quantization features are still helpful to improve the performance for object detection task.

mAP (%)
Baseline FP32 SSD 76.81
A: Binary backbone 1 68.74
B: Binary backbone & head 1 67.44
C: Binary backbone 2 72.79
D: Binary backbone & head 2 71.75
Table 10: Results of different quantization method on PASCAL VOC07 test dataset.

6 Conclusion

To improve the performance of quantization networks, one efficient quantization network design and training method is presented in this paper. The quantization of full precision features will lead to critical information loss and low accuracy. In order to improve this, we take use of network pruning method on widened networks to search for efficient and accurate quantized model. Beside, knowledge distillation is used to improve the performance further. Experiments conducted on benchmark models and datasets verify the effectiveness of proposed method and the experiment results get comparable performance with full precision models. In addition, the proposed method can be combined with latest ideas such as bi-real net, new quantized activation functions and values to get more effective and accurate quantized networks.


  • [1] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §1.
  • [2] M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In NIPS, Cited by: §1, §2.1.
  • [3] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §1, §2.1.
  • [4] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia (2018) BNN+: improved binary network training. arXiv preprint arXiv:1812.11800. Cited by: §1, §1.
  • [5] M. Everingham and J. Winn (2011) The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep. Cited by: §5.1.
  • [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §1.
  • [7] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1, §2.1.
  • [8] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In NIPS, Cited by: §1, §2.2.
  • [9] S. J. Hanson and L. Pratt (1989) Comparing biases for minimal network construction with back-propagation. In NIPS, Cited by: §1.
  • [10] B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In NIPs, Cited by: §1.
  • [11] K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §1.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §5.1.
  • [13] X. He and P. Niyogi (2004) Locality preserving projections. In Advances in neural information processing systems, pp. 153–160. Cited by: §3.1.
  • [14] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018) Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: §2.2.
  • [15] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4340–4349. Cited by: §2.2.
  • [16] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2.2.
  • [17] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.3, §5.2.
  • [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • [19] K. Hwang and W. Sung (2014) Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In IEEE Workshop on Signal Processing Systems, Cited by: §1.
  • [20] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1.
  • [21] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2016) Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, Cited by: §1.
  • [22] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • [24] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2015) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In ICLR, Cited by: §1.
  • [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • [26] F. Li, B. Zhang, and B. Liu (2016) Ternary weight networks. External Links: 1605.04711 Cited by: §2.1.
  • [27] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.2.
  • [28] X. Lin, C. Zhao, and W. Pan (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: §2.1.
  • [29] C. Liu, Y. Wang, K. Han, C. Xu, and C. Xu (2019) Learning instance-wise sparsity for accelerating deep models. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    pp. 3001–3007. Cited by: §1.
  • [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, Cited by: §1, §5.1, §5.5.
  • [31] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K. Cheng (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 722–737. Cited by: §2.1.
  • [32] Z. Liu, J. Li, B. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §2.2, §4.2, §5.2.
  • [33] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
  • [34] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) XNOR-net: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §1, §2.1.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1.
  • [36] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.3.
  • [37] S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §3.1.
  • [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §1, §3.1, §5.1.
  • [39] M. Shen, K. Han, C. Xu, and Y. Wang (2019) Searching for accurate binary neural architectures. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  • [40] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §1, §5.1.
  • [41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §1.
  • [42] J. B. Tenenbaum, V. De Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §3.1.
  • [43] V. Vanhoucke, A. Senior, and M. Z. Mao (2011) Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS, Cited by: §1.
  • [44] F. Yu, S. Kumar, Y. Gong, and S. Chang (2014) Circulant binary embedding. In

    International conference on machine learning

    pp. 946–954. Cited by: §3.1.
  • [45] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §2.3.
  • [46] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §1.
  • [47] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: §1.
  • [48] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. External Links: 1606.06160 Cited by: §2.1.
  • [49] C. Zhu, S. Han, H. Mao, and W. J. Dally (2016) Trained ternary quantization. External Links: 1612.01064 Cited by: §2.1.
  • [50] S. Zhu, X. Dong, and H. Su (2019) Binary ensemble neural network: more bits per network or more networks per bit?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4923–4932. Cited by: §1, §2.1.