Accurate and Compact Convolutional Neural Networks with Trained Binarization

09/25/2019
by   Zhe Xu, et al.
31

Although convolutional neural networks (CNNs) are now widely used in various computer vision applications, its huge resource demanding on parameter storage and computation makes the deployment on mobile and embedded devices difficult. Recently, binary convolutional neural networks are explored to help alleviate this issue by quantizing both weights and activations with only 1 single bit. However, there may exist a noticeable accuracy degradation when compared with full-precision models. In this paper, we propose an improved training approach towards compact binary CNNs with higher accuracy. Trainable scaling factors for both weights and activations are introduced to increase the value range. These scaling factors will be trained jointly with other parameters via backpropagation. Besides, a specific training algorithm is developed including tight approximation for derivative of discontinuous binarization function and L_2 regularization acting on weight scaling factors. With these improvements, the binary CNN achieves 92.3 ImageNet, our method also obtains 46.1 with Resnet-18 surpassing previous works.

READ FULL TEXT VIEW PDF

page 4

page 5

page 6

page 7

page 9

page 10

page 11

page 12

12/31/2018

BNN+: Improved Binary Network Training

Deep neural networks (DNN) are widely used in many applications. However...
08/08/2020

Towards Lossless Binary Convolutional Neural Networks Using Piecewise Approximation

Binary Convolutional Neural Networks (CNNs) can significantly reduce the...
04/15/2020

A Hybrid Method for Training Convolutional Neural Networks

Artificial Intelligence algorithms have been steadily increasing in popu...
10/03/2020

Nonconvex Regularization for Network Slimming:Compressing CNNs Even More

In the last decade, convolutional neural networks (CNNs) have evolved to...
04/04/2022

Soft Threshold Ternary Networks

Large neural networks are difficult to deploy on mobile devices because ...
02/27/2019

Modulated binary cliquenet

Although Convolutional Neural Networks (CNNs) achieve effectiveness in v...
06/07/2017

Network Sketching: Exploiting Binary Structure in Deep CNNs

Convolutional neural networks (CNNs) with deep architectures have substa...

1 Introduction

Convolutional neural networks (CNNs) have achieved great success in a wide range of real-world applications such as image classification [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Simonyan and Zisserman(2015), He et al.(2016)He, Zhang, Ren, and Sun], object detection [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik, Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi, Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg], visual relation detection [Zhang et al.(2017a)Zhang, Kyaw, Chang, and Chua, Zhang et al.(2017b)Zhang, Kyaw, Yu, and Chang] and image style transfer [Gatys et al.(2016)Gatys, Ecker, and Bethge, Chen et al.(2017)Chen, Yuan, Liao, Yu, and Hua] in recent years. However, the powerful CNNs are also accompanied by the large model size and high computational complexity. For example, VGG-16 [Simonyan and Zisserman(2015)] requires over 500MB memory for parameter storage and 31GFLOP for a single image inference. This high resource demanding makes it difficult to deploy on mobile and embedded devices.

To alleviate this issue, several kinds of approaches have been developed to compress network and reduce computational cost. The first approach is to design compact network architectures with similar performance. For example, SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] utilized 1x1 convolution layer to squeeze the output channel number before computationally expensive 3x3 convolution operations. More recently, MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] and ShuffleNet [Zhang et al.(2018)Zhang, Zhou, Lin, and Sun] introduced efficient depth-wise and group convolution to replace the normal complex convolution operation. The second approach is to reduce or compress network parameters. [Lebedev et al.(2015)Lebedev, Ganin, Rakhuba, Oseledets, and Lempitsky] and [Novikov et al.(2015)Novikov, Podoprikhin, Osokin, and Vetrov]

achieved to compress network weights using tensor decomposition. Connection pruning was employed in 

[Han et al.(2015)Han, Pool, Tran, and Dally] to reduce parameters by up to 13 times for AlexNet and VGG-16. The third category is to quantize network parameters presented in [Zhuang et al.(2018)Zhuang, Shen, Tan, Liu, and Reid, Dong et al.(2017)Dong, Ni, Li, Chen, Zhu, and Su, Lin et al.(2016)Lin, Courbariaux, Memisevic, and Bengio, Zhu et al.(2017)Zhu, Han, Mao, and Dally, Zhou et al.(2017)Zhou, Yao, Guo, Xu, and Chen, Hou and Kwok(2018)]. Network quantization can reduce memory requirement efficiently because parameter values are represented with less bits. At the same time, it can alleviate the computational cost issue since floating-point calculations are transferred into fixed-point calculations with less computation resources.

Furthermore, as an extreme case of parameter quantization, binary convolutional neural networks quantize both weights and activations with 1 bit [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi]. It has attracted large research interests because binary CNNs can reduce network size by 32 times compared with full precision and replace the multiplication with bitwise logic operations. As a result, binary CNNs are suitable for accelerator implementations on hardware platforms such as FPGA [Umuroglu et al.(2017)Umuroglu, Fraser, Gambardella, Blott, Leong, Jahre, and Vissers]. However, network binarization may decrease accuracy noticeably due to extremely limited precision. It is still a challenge and needs further exploration towards better inference performance.

In this paper, an improved training approach for binary CNNs is proposed which is easy to be implemented on hardware platforms. Our approach includes three novel ideas:

  1. Trainable scaling factors are introduced for weight and activation binarization. Previous works such as XNOR-Net 

    [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] set the mean value of weights as the scaling factor, however it results in minimum quantization error but cannot ensure the best inference accuracy. Instead, we employ the trainable scaling factors for both weights and activations and update them jointly with other network parameters.

  2. Derivative approximation is discussed for binary network training. Since the derivative of binarization function is like an impulse function, it is not suitable for backpropagation. We propose to use a higher order approximation for weight binarization and a long-tailed approximation for activation binarization as a trade-off between tight approximation and smooth backpropagation.

  3. The regularization term is now acting on the weight scaling factors directly. In our approach, weight scaling factors represent the actual binary filters, the regularization should be modified accordingly for better generalization capability.

The proposed binary network approach achieves better inference accuracy and faster convergence speed. The rest of this paper is organized as follows. Section 2 reviews previous related works. The proposed binary network training approach is introduced in detail in Section 3. Experimental results are provided in Section 4. Finally, Section 5 gives the conclusion.

2 Related Work

Previous works [Zhuang et al.(2018)Zhuang, Shen, Tan, Liu, and Reid, Dong et al.(2017)Dong, Ni, Li, Chen, Zhu, and Su, Lin et al.(2016)Lin, Courbariaux, Memisevic, and Bengio, Zhu et al.(2017)Zhu, Han, Mao, and Dally, Zhou et al.(2017)Zhou, Yao, Guo, Xu, and Chen, Hou and Kwok(2018)] already demonstrated that quantization can reduce much memory resources and computational complexity for various CNN structures. One extreme case of quantization is to constrain real value with only 1 bit, i.ebinarization. It can be further divided into two subcategories: one is only binarizing weights and leaving activations full-precision or quantized, another is binarizing both weight and activation values.

Weight-only binarization methods: Courbariaux et al [Courbariaux et al.(2015)Courbariaux, Bengio, and David] firstly proposed to train networks constraining weights to only two possible values, -1 and +1. They introduced a training method, called BinaryConnect, to deal with binary weights in both forward and backward propagations and obtained promising results on small datasets. In [Cai et al.(2017)Cai, He, Sun, and Vasconcelos], Cai et albinarized weights and proposed a half-wave Gaussian quantizer for activations. The proposed quantizer exploited the statistics of activations and was efficient for low-bit quantization. Later Wang et al [Wang et al.(2018)Wang, Hu, Zhang, Zhang, Liu, and Cheng] further extended [Cai et al.(2017)Cai, He, Sun, and Vasconcelos]

’s idea with sparse quantization. Besides, they proposed a two-step quantization framework: code learning and transformation function learning. In code learning step, weights were of full-precision and activations were quantized based on Gaussian distribution and sparse constraint. Then the weights binarization was solved as a non-linear least regression problem in the second step. In 

[Hu et al.(2018)Hu, Wang, and Cheng], Hu et alproposed to transfer binary weight networks training problem into a hashing problem. This hashing problem was solved with alternating optimization algorithm and the binary weight network was then fine-tuned to improve accuracy.

Complete binarization methods: Although weight-only binarization methods already save much memory resources and reduce multiplications, further binarization on activation can transfer arithmetic to bit-wise logic operations enabling fast inference on embedded devices. As far as our knowledge goes, [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio] was the first work binarizing both weights and activations to -1 and +1. The work obtained 32 times compression ratio on weights and 7 times faster inference speed with comparative results to BinaryConnect on small datasets like CIFAR-10 and SVHN. However, later results showed that this training method was not suitable for large datasets with obvious accuracy degradation. Later Rastegari et al [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] proposed XNOR-Net to improve the inference performance on large-scale datasets. It achieved better trade-off between compression ratio and accuracy in which scaling factors for both weights and activations were used to minimize the quantization errors. DoReFa-Net [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] proposed by Zhou et alinherited the idea of XNOR-Net and provided a complete solution for low-bit quantization and binarization of weights, activations and gradients.

In [Tang et al.(2017)Tang, Hua, and Wang], Tang et al

explored several strategies to train binary neural networks. The strategies included setting small learning rate for binary neural network training, introducing scaling factors for weights using PReLU activation function and utilizing regularizer to constraint weights close to +1 or -1. They showed the binary neural networks achieved similar accuracy to XNOR-Net with simpler training procedure. In 

[Lin et al.(2017)Lin, Zhao, and Pan], Lin et alproposed ABC-Net towards accurate binary convolutional neural network. Unlike other approaches, they proposed to use the linear combination of multiple binary weights and binary activations for better approximation of full-precision weights and activations. With adequate bases, the structure could finally achieve close results to full-precision networks. In recent work [Liu et al.(2018)Liu, Wu, Luo, Yang, Liu, and Cheng], Liu et alproposed Bi-Real Net in which the real activations are added to the binary activations through a shortcut connection to increase the representational capability. Moreover, they provided a tight approximation of sign function and proposed to pre-train full-precision neural networks as the initialization for binary network training.

3 Proposed Binary Network Training

Training an accurate binary CNN has two major challenges: one is the extremely limited value range because of binary data, another is the difficult backpropagation in training procedure caused by the derivative of binarization function. In this section, the proposed binary CNN training approach is introduced in order to address the above two issues.

3.1 Binarization with Trainable Scaling Factors

We first briefly review the binarization operation in one convolution layer, which is shown in Figure 1. As binary weighs are difficult to update with gradient-based optimization methods due to the huge gap between different values, it is common to reserve full-precision weights during training process. The binary weights are then obtained from real values via the binarization function. The input of the convolution operation is actually the activation output of the previous layer, the binary convolution is represented as

(1)

Since the input and weights are both binary, the convolution operation can be implemented with bitwise logic and operations to get integer results similar as [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi]

. After batch normalization, the integer results

become real values within a certain range and they are binarized in the activation step to generate binary output feature map for the next convolution layer.

Figure 1: Forward process in binary convolution layer.

We use a sign function and a unit step function for weight and activation binarization, respectively. and are

(2)

However, and highly restrict the results to be only three possible values . To increase the value range, [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] and [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] proposed to use the average of absolute values of each output channel as scaling factors for weights. This minimizes the binarization errors but does not ensure the optimal inference performance. Instead we propose to set trainable scaling factors directly and these scaling factors are updated through backpropagation as part of the network parameters.

Given the real value weight filter where and stand for output and input channels respectively, we set a scaling factor . Then the weight binarization is represented as

(3)

stands for the binary weight at output channel . Importing the scaling factor enables binary weights to have different value magnitudes in each output channel.

Similarly, we set the scaling factor for binary activation

(4)

We set a threshold as a shift parameter trying to extract more information from the original full-precision . By importing parameter , only important values will be activated and other small values are set to be zero. It should be noted that the shift parameter will also be updated during network training. is identical for a single activation layer to make it compatible with hardware implementation which will be discussed in Section 3.4.

3.2 Training Algorithm

Figure 2: Different approximations for derivatives of and .

As we discussed before, and are used to binarize weights and activations, respectively. They are illustrated in Figure 2(a) and 2(b). To update network parameters, their derivatives are required. However as we can see in Figure 2(c), their derivatives are like an impulse function whose value is zero almost everywhere and infinite at zero. Thus it cannot be applied for backpropagation and parameter updating directly. In [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou], a clip function was employed to approximate the binarization functions. The derivate of is represented in Eq. 5

based on straight through estimator (STE) 

[Bengio et al.(2013)Bengio, Léonard, and Courville] and it is illustrated in Figure 2(d).

(5)

Instead of Eq. 5, [Liu et al.(2018)Liu, Wu, Luo, Yang, Liu, and Cheng] utilized a piecewise polynomial function as a tighter approximation. In this paper, we further develop this higher-order estimator approach. For used in weight binarization, we give derivative approximation whose active range is between illustrated in Figure 2(e). It is a piecewise linear function.

(6)

For the derivative of , an overtight approximation may not be an optimal choice because it will affect the backpropagation to shallow layers far away from the network output. For an overtight approximation, the gradient value will be near zero in a large scope resulting in gradient vanishing. To alleviate this problem, we use a long-tailed higher-order estimator whose active range is between , shown in Figure 2(f). It is a piecewise function with tight approximation near zero and small constant value in a wider range.

(7)

Based on Eq. 6 and 7, the backpropagation can be performed smoothly during binary network training. It should be noted that there certainly exist other estimators as a trade-off between tight approximation and smooth backpropagation, such as Swish-like function [Elfwing et al.(2018)Elfwing, Uchibe, and Doya]. In this section, we provide a simple yet efficient approximation and leave other options as our future work.

3.3 Regularization on Scaling Factors

Since deep CNNs usually have a tremendous parameter set, a good regularization term is necessary during training process for robust generalization capability.

regularization, also called ridge regression, is widely used in full-precision networks. It uses squared magnitude of weights as penalty term to help avoid over-fitting issue.

In our binary network training approach, weight scaling factors,

, stand for the actual binary filters for feature extraction. Thus, the

regularization term is modified to restrict the scaling factors accordingly. The total loss function is then represented as

(8)

where is the task-related loss such as cross entropy loss. is the weight scaling factors in which the superscript stands for different layers. stands for other parameters adopted in CNN structure. The second term is the new regularization with weight decay parameter . The regularization term tends to decrease the magnitude of scaling factor .

3.4 Compatibility with Hardware Implementation

In this section we show that our binary network can be easily implemented on hardware platforms via bitwise operations. To simplify the discussion, we take 3x3 convolution as an example and let the input channel to be 1. is the convolution kernel where indicates the output channel, the input of convolution is actually the binary activation of the previous layer. The 33 convolution operation is

(9)

where and are trained scaling factors for weights and activations respectively. For the same output channel, is a constant so it can be integrated into the following batch normalization.

and are both binary values thus can be implemented with bitwise operations. In each bit, the result of basic multiplication is one of three values . If we let binary "" stands for the value in , the truth table of binary multiplication is then shown in Table 1, in which means the result is and means the result is . It is easy to see that

(10)

With operation counting number of "" in a binary sequence, we can get the binary convolution

(11)

Figure 3 shows the hardware architecture of binary convolution based on Eq. 11. As we can see, for a 33 convolution with 9 multiplications and 8 additions, the binary calculation only requires 2 operations and 1 subtraction.

0 0 0 0
0 1 0 0
1 0 0 1
1 1 1 0
Table 1: Truth Table of Binary Multiplication .
Figure 3: Hardware architecture of the binary convolution.

4 Experimental Results

In this section, the performance of our proposed method is evaluated on image classification task. The performance of other tasks such as object detection and pose estimation will be included in the future work. Two image classification datasets: CIFAR-10 

[Krizhevsky and Hinton(2009)] and ImageNet ILSVRC12 [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.] are used for evaluation. The CIFAR-10 dataset consists of 50,000 training images with size 3232 belonging to 10 classes and 10,000 testing images. The large ImageNet ILSVRC12 consists of 1000 classes with about 1.2 million images for training and 50,000 images for validation. We build the same VGG-Small network as [Courbariaux et al.(2015)Courbariaux, Bengio, and David, Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] for evaluation on CIFAR-10 dataset. For ImageNet, we implement AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and Resnet-18 [He et al.(2016)He, Zhang, Ren, and Sun] networks. All networks are trained from random initialization without pre-training. Following previous works [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou], we binarize all the convolution and fully-connected layers except the first and the last layer. For regularization, we set the weight decay parameter to be . [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio] and [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] pointed out that ADAM converges faster and usually performs better for binary inputs. Thus we use ADAM optimization for parameter updating with an initial learning rate and for CIFAR-10 and ImageNet, respectively.

4.1 Performance on CIFAR-10

Table 2 presents the results of the VGG-Small network on CIFAR-10 dataset. The second column Bit-width denotes quantization bits for weights and activations. The third column shows the accuracy results. Two binary network approaches, BNN [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio] and XNOR-Net [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi], and one low-precision quantization approach, HWGQ [Cai et al.(2017)Cai, He, Sun, and Vasconcelos], are selected for comparison. The result of full-precision network model is also presented as a reference. The proposed training approach achieves 92.3% accuracy on CIFAR-10, exceeding BNN and XNOR-Net by 2.4% and 2.5%, respectively. Moreover, our result on binary network is very close to HWGQ [Cai et al.(2017)Cai, He, Sun, and Vasconcelos] with 2-bit activation quantization.

Method Bit-width (W/A) Accuracy
Ours 1/1 92.3%
BNN [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio] 1/1 89.9%
XNOR-Net [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] 1/1 89.8%
HWGQ [Cai et al.(2017)Cai, He, Sun, and Vasconcelos] 1/2 92.5%
Full-Precision 32/32 93.6%
Table 2: Results Comparison of VGG-Small on CIFAR-10.
(a) Results of our approach versus BNN [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio].
(b) Results with and without regularization.
Figure 4: Training curves comparison of VGG-Small on CIFAR-10.

To further evaluate the proposed method, we show the training curves in Figure 4. First in Figure 4(a), training curves of BNN [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio] and our approach are compared. Although they have similar training accuracy, our approach is faster to train and improves the accuracy a lot on validation set. Moreover, it can be observed that our validation accuracy curve is more stable than BNN due to the regularization. The effectiveness of regularization is then validated in Figure 4(b). The red curve is the result with our new regularization. It has less fluctuation relatively and the regularization term does help to improve the accuracy from 92.0% to 92.3% indicating better generalization.

4.2 Evaluations and Comparisons on ImageNet

On large ImageNet ILSVRC12 dataset, we test the accuracy performance of the proposed approach for AlexNet and Resnet-18 networks. The experimental results are presented in Table 3. We select four existing methods for comparison including XNOR-Net [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi], BinaryNet [Tang et al.(2017)Tang, Hua, and Wang], ABC-Net [Lin et al.(2017)Lin, Zhao, and Pan] and DoReFa-Net [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou]. It should be noted that the Resnet-18 accuracy results of BinaryNet [Tang et al.(2017)Tang, Hua, and Wang] are quoted from [Lin et al.(2017)Lin, Zhao, and Pan]. Some results of ABC-Net [Lin et al.(2017)Lin, Zhao, and Pan] and DoReFa-Net [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] are not provided so we leave them blank. Similarly, the results of full-precision network are provided in Table 3 as a reference.

For AlexNet, our approach achieves 46.1% top-1 accuracy and 70.9% top-5 accuracy. It is the best result among five binary network solutions and surpasses other works by up to 4.9% and 5.3%, respectively. For Resnet-18, our method obtains 54.2% top-1 accuracy and 77.9% top-5 accuracy, improving the performance by up to 12.0% and 10.8% compared with other works. It is also shown that the proposed approach succeeds to reduce the accuracy gap between full-precision and binary networks to about 10%. These indicates our approach can improve the inference performance of binary convolutional neural networks effectively.

Method Bit-width (W/A) AlexNet Accuracy Resnet-18 Accuracy
Top-1 Top-5 Top-1 Top-5
Ours 1/1 46.1% 70.9% 54.2% 77.9%
XNOR-Net [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] 1/1 44.2% 69.2% 51.2% 73.2%
BinaryNet [Tang et al.(2017)Tang, Hua, and Wang] 1/1 41.2% 65.6% 42.2% 67.1%
ABC-Net [Lin et al.(2017)Lin, Zhao, and Pan] 1/1 - - 42.7% 67.6%
DoReFa-Net [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] 1/1 43.6% - - -
Full-Precision 32/32 56.6% 80.2% 69.6% 89.2%
Table 3: Results Comparisons of AlexNet and Resnet-18 on ImageNet.

4.3 Analysis of Network Model Size

Network Full Model Binary Model Compression
Size Size Ratio
VGG-Small 53.52MB 1.75MB 30.6
AlexNet 237.99MB 22.77MB 10.5
Resnet-18 49.83MB 3.51MB 14.2
Table 4: Network Model Size Comparison.

Ideally, binary neural network should achieve 32 compression ratio compared with full-precision model. But in our approach the first and the last layer are excluded from binarization operation following [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou]. Besides, batch normalization parameters and scaling factors should be in full precision for better network representation capability. These will affect the actual network compression ratio. Table 4 shows the actual parameter size comparison of three network structures. For a typical network, binarization can reduce model size by over 10 times. Besides, it can also be observed that with a better network structure, the binary network can achieve better performance in terms of model compression ratio.

5 Conclusion

This paper proposes an approach to train binary CNNs with higher inference accuracy including three ideas. First, trainable scaling factors for both weights and activations are employed to provide different value ranges of binary number. Then, higher-order estimator and long-tailed estimator for derivative of binarization function are proposed to balance the tight approximation and efficient backpropagation. At last, the regularization is performed directly on weight scaling factors for better generalization capability. The approach is effective to improve network accuracy and it is suitable for hardware implementation.

References

  • [Bengio et al.(2013)Bengio, Léonard, and Courville] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • [Cai et al.(2017)Cai, He, Sun, and Vasconcelos] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 5918–5926, 2017.
  • [Chen et al.(2017)Chen, Yuan, Liao, Yu, and Hua] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stylebank: An explicit representation for neural image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1897–1906, 2017.
  • [Courbariaux et al.(2015)Courbariaux, Bengio, and David] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131. 2015.
  • [Dong et al.(2017)Dong, Ni, Li, Chen, Zhu, and Su] Yinpeng Dong, Renkun Ni, Jianguo Li, Yurong Chen, Jun Zhu, and Hang Su. Learning accurate low-bit deep neural networks with stochastic quantization. In British machine vision conference (BMVC), 2017.
  • [Elfwing et al.(2018)Elfwing, Uchibe, and Doya] Stefan Elfwing, Eiji Uchibe, and Kenji Doya.

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.

    Neural Networks, 107:3–11, 2018.
  • [Gatys et al.(2016)Gatys, Ecker, and Bethge] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2414–2423, 2016.
  • [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 580–587, 2014.
  • [Han et al.(2015)Han, Pool, Tran, and Dally] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143. 2015.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
  • [Hou and Kwok(2018)] Lu Hou and James T Kwok. Loss-aware weight quantization of deep networks. In International conference on learning representations (ICLR), 2018.
  • [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [Hu et al.(2018)Hu, Wang, and Cheng] Qinghao Hu, Peisong Wang, and Jian Cheng. From hashing to cnns: Training binary weight networks via hashing. In

    Thirty-Second AAAI Conference on Artificial Intelligence (AAAI)

    , 2018.
  • [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115. 2016.
  • [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • [Krizhevsky and Hinton(2009)] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105. 2012.
  • [Lebedev et al.(2015)Lebedev, Ganin, Rakhuba, Oseledets, and Lempitsky] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. 2015.
  • [Lin et al.(2017)Lin, Zhao, and Pan] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pages 345–353. 2017.
  • [Lin et al.(2016)Lin, Courbariaux, Memisevic, and Bengio] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. In International conference on learning representations (ICLR), 2016.
  • [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision (ECCV), pages 21–37. Springer, 2016.
  • [Liu et al.(2018)Liu, Wu, Luo, Yang, Liu, and Cheng] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pages 722–737, 2018.
  • [Novikov et al.(2015)Novikov, Podoprikhin, Osokin, and Vetrov] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in neural information processing systems, pages 442–450. 2015.
  • [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision (ECCV), pages 525–542. Springer, 2016.
  • [Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 779–788, 2016.
  • [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [Simonyan and Zisserman(2015)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (ICLR), 2015.
  • [Tang et al.(2017)Tang, Hua, and Wang] Wei Tang, Gang Hua, and Liang Wang. How to train a compact binary neural network with high accuracy? In Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 2017.
  • [Umuroglu et al.(2017)Umuroglu, Fraser, Gambardella, Blott, Leong, Jahre, and Vissers] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pages 65–74. ACM, 2017.
  • [Wang et al.(2018)Wang, Hu, Zhang, Zhang, Liu, and Cheng] Peisong Wang, Qinghao Hu, Yifan Zhang, Chunjie Zhang, Yang Liu, and Jian Cheng. Two-step quantization for low-bit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4376–4384, 2018.
  • [Zhang et al.(2017a)Zhang, Kyaw, Chang, and Chua] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 5532–5540, 2017a.
  • [Zhang et al.(2017b)Zhang, Kyaw, Yu, and Chang] Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and Shih-Fu Chang. Ppr-fcn: weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4233–4241, 2017b.
  • [Zhang et al.(2018)Zhang, Zhou, Lin, and Sun] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848–6856, 2018.
  • [Zhou et al.(2017)Zhou, Yao, Guo, Xu, and Chen] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. In International Conference on Learning Representations (ICLR), 2017.
  • [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
  • [Zhu et al.(2017)Zhu, Han, Mao, and Dally] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In International conference on learning representations (ICLR), 2017.
  • [Zhuang et al.(2018)Zhuang, Shen, Tan, Liu, and Reid] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Towards effective low-bitwidth convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7920–7928, 2018.