1 Introduction
Convolutional neural networks (CNNs) have achieved great success in a wide range of realworld applications such as image classification [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Simonyan and Zisserman(2015), He et al.(2016)He, Zhang, Ren, and Sun], object detection [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik, Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi, Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg], visual relation detection [Zhang et al.(2017a)Zhang, Kyaw, Chang, and Chua, Zhang et al.(2017b)Zhang, Kyaw, Yu, and Chang] and image style transfer [Gatys et al.(2016)Gatys, Ecker, and Bethge, Chen et al.(2017)Chen, Yuan, Liao, Yu, and Hua] in recent years. However, the powerful CNNs are also accompanied by the large model size and high computational complexity. For example, VGG16 [Simonyan and Zisserman(2015)] requires over 500MB memory for parameter storage and 31GFLOP for a single image inference. This high resource demanding makes it difficult to deploy on mobile and embedded devices.
To alleviate this issue, several kinds of approaches have been developed to compress network and reduce computational cost. The first approach is to design compact network architectures with similar performance. For example, SqueezeNet [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] utilized 1x1 convolution layer to squeeze the output channel number before computationally expensive 3x3 convolution operations. More recently, MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] and ShuffleNet [Zhang et al.(2018)Zhang, Zhou, Lin, and Sun] introduced efficient depthwise and group convolution to replace the normal complex convolution operation. The second approach is to reduce or compress network parameters. [Lebedev et al.(2015)Lebedev, Ganin, Rakhuba, Oseledets, and Lempitsky] and [Novikov et al.(2015)Novikov, Podoprikhin, Osokin, and Vetrov]
achieved to compress network weights using tensor decomposition. Connection pruning was employed in
[Han et al.(2015)Han, Pool, Tran, and Dally] to reduce parameters by up to 13 times for AlexNet and VGG16. The third category is to quantize network parameters presented in [Zhuang et al.(2018)Zhuang, Shen, Tan, Liu, and Reid, Dong et al.(2017)Dong, Ni, Li, Chen, Zhu, and Su, Lin et al.(2016)Lin, Courbariaux, Memisevic, and Bengio, Zhu et al.(2017)Zhu, Han, Mao, and Dally, Zhou et al.(2017)Zhou, Yao, Guo, Xu, and Chen, Hou and Kwok(2018)]. Network quantization can reduce memory requirement efficiently because parameter values are represented with less bits. At the same time, it can alleviate the computational cost issue since floatingpoint calculations are transferred into fixedpoint calculations with less computation resources.Furthermore, as an extreme case of parameter quantization, binary convolutional neural networks quantize both weights and activations with 1 bit [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi]. It has attracted large research interests because binary CNNs can reduce network size by 32 times compared with full precision and replace the multiplication with bitwise logic operations. As a result, binary CNNs are suitable for accelerator implementations on hardware platforms such as FPGA [Umuroglu et al.(2017)Umuroglu, Fraser, Gambardella, Blott, Leong, Jahre, and Vissers]. However, network binarization may decrease accuracy noticeably due to extremely limited precision. It is still a challenge and needs further exploration towards better inference performance.
In this paper, an improved training approach for binary CNNs is proposed which is easy to be implemented on hardware platforms. Our approach includes three novel ideas:

Trainable scaling factors are introduced for weight and activation binarization. Previous works such as XNORNet
[Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] set the mean value of weights as the scaling factor, however it results in minimum quantization error but cannot ensure the best inference accuracy. Instead, we employ the trainable scaling factors for both weights and activations and update them jointly with other network parameters. 
Derivative approximation is discussed for binary network training. Since the derivative of binarization function is like an impulse function, it is not suitable for backpropagation. We propose to use a higher order approximation for weight binarization and a longtailed approximation for activation binarization as a tradeoff between tight approximation and smooth backpropagation.

The regularization term is now acting on the weight scaling factors directly. In our approach, weight scaling factors represent the actual binary filters, the regularization should be modified accordingly for better generalization capability.
The proposed binary network approach achieves better inference accuracy and faster convergence speed. The rest of this paper is organized as follows. Section 2 reviews previous related works. The proposed binary network training approach is introduced in detail in Section 3. Experimental results are provided in Section 4. Finally, Section 5 gives the conclusion.
2 Related Work
Previous works [Zhuang et al.(2018)Zhuang, Shen, Tan, Liu, and Reid, Dong et al.(2017)Dong, Ni, Li, Chen, Zhu, and Su, Lin et al.(2016)Lin, Courbariaux, Memisevic, and Bengio, Zhu et al.(2017)Zhu, Han, Mao, and Dally, Zhou et al.(2017)Zhou, Yao, Guo, Xu, and Chen, Hou and Kwok(2018)] already demonstrated that quantization can reduce much memory resources and computational complexity for various CNN structures. One extreme case of quantization is to constrain real value with only 1 bit, i.ebinarization. It can be further divided into two subcategories: one is only binarizing weights and leaving activations fullprecision or quantized, another is binarizing both weight and activation values.
Weightonly binarization methods: Courbariaux et al [Courbariaux et al.(2015)Courbariaux, Bengio, and David] firstly proposed to train networks constraining weights to only two possible values, 1 and +1. They introduced a training method, called BinaryConnect, to deal with binary weights in both forward and backward propagations and obtained promising results on small datasets. In [Cai et al.(2017)Cai, He, Sun, and Vasconcelos], Cai et albinarized weights and proposed a halfwave Gaussian quantizer for activations. The proposed quantizer exploited the statistics of activations and was efficient for lowbit quantization. Later Wang et al [Wang et al.(2018)Wang, Hu, Zhang, Zhang, Liu, and Cheng] further extended [Cai et al.(2017)Cai, He, Sun, and Vasconcelos]
’s idea with sparse quantization. Besides, they proposed a twostep quantization framework: code learning and transformation function learning. In code learning step, weights were of fullprecision and activations were quantized based on Gaussian distribution and sparse constraint. Then the weights binarization was solved as a nonlinear least regression problem in the second step. In
[Hu et al.(2018)Hu, Wang, and Cheng], Hu et alproposed to transfer binary weight networks training problem into a hashing problem. This hashing problem was solved with alternating optimization algorithm and the binary weight network was then finetuned to improve accuracy.Complete binarization methods: Although weightonly binarization methods already save much memory resources and reduce multiplications, further binarization on activation can transfer arithmetic to bitwise logic operations enabling fast inference on embedded devices. As far as our knowledge goes, [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio] was the first work binarizing both weights and activations to 1 and +1. The work obtained 32 times compression ratio on weights and 7 times faster inference speed with comparative results to BinaryConnect on small datasets like CIFAR10 and SVHN. However, later results showed that this training method was not suitable for large datasets with obvious accuracy degradation. Later Rastegari et al [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] proposed XNORNet to improve the inference performance on largescale datasets. It achieved better tradeoff between compression ratio and accuracy in which scaling factors for both weights and activations were used to minimize the quantization errors. DoReFaNet [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] proposed by Zhou et alinherited the idea of XNORNet and provided a complete solution for lowbit quantization and binarization of weights, activations and gradients.
In [Tang et al.(2017)Tang, Hua, and Wang], Tang et al
explored several strategies to train binary neural networks. The strategies included setting small learning rate for binary neural network training, introducing scaling factors for weights using PReLU activation function and utilizing regularizer to constraint weights close to +1 or 1. They showed the binary neural networks achieved similar accuracy to XNORNet with simpler training procedure. In
[Lin et al.(2017)Lin, Zhao, and Pan], Lin et alproposed ABCNet towards accurate binary convolutional neural network. Unlike other approaches, they proposed to use the linear combination of multiple binary weights and binary activations for better approximation of fullprecision weights and activations. With adequate bases, the structure could finally achieve close results to fullprecision networks. In recent work [Liu et al.(2018)Liu, Wu, Luo, Yang, Liu, and Cheng], Liu et alproposed BiReal Net in which the real activations are added to the binary activations through a shortcut connection to increase the representational capability. Moreover, they provided a tight approximation of sign function and proposed to pretrain fullprecision neural networks as the initialization for binary network training.3 Proposed Binary Network Training
Training an accurate binary CNN has two major challenges: one is the extremely limited value range because of binary data, another is the difficult backpropagation in training procedure caused by the derivative of binarization function. In this section, the proposed binary CNN training approach is introduced in order to address the above two issues.
3.1 Binarization with Trainable Scaling Factors
We first briefly review the binarization operation in one convolution layer, which is shown in Figure 1. As binary weighs are difficult to update with gradientbased optimization methods due to the huge gap between different values, it is common to reserve fullprecision weights during training process. The binary weights are then obtained from real values via the binarization function. The input of the convolution operation is actually the activation output of the previous layer, the binary convolution is represented as
(1) 
Since the input and weights are both binary, the convolution operation can be implemented with bitwise logic and operations to get integer results similar as [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi]
. After batch normalization, the integer results
become real values within a certain range and they are binarized in the activation step to generate binary output feature map for the next convolution layer.We use a sign function and a unit step function for weight and activation binarization, respectively. and are
(2) 
However, and highly restrict the results to be only three possible values . To increase the value range, [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] and [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] proposed to use the average of absolute values of each output channel as scaling factors for weights. This minimizes the binarization errors but does not ensure the optimal inference performance. Instead we propose to set trainable scaling factors directly and these scaling factors are updated through backpropagation as part of the network parameters.
Given the real value weight filter where and stand for output and input channels respectively, we set a scaling factor . Then the weight binarization is represented as
(3) 
stands for the binary weight at output channel . Importing the scaling factor enables binary weights to have different value magnitudes in each output channel.
Similarly, we set the scaling factor for binary activation
(4) 
We set a threshold as a shift parameter trying to extract more information from the original fullprecision . By importing parameter , only important values will be activated and other small values are set to be zero. It should be noted that the shift parameter will also be updated during network training. is identical for a single activation layer to make it compatible with hardware implementation which will be discussed in Section 3.4.
3.2 Training Algorithm
As we discussed before, and are used to binarize weights and activations, respectively. They are illustrated in Figure 2(a) and 2(b). To update network parameters, their derivatives are required. However as we can see in Figure 2(c), their derivatives are like an impulse function whose value is zero almost everywhere and infinite at zero. Thus it cannot be applied for backpropagation and parameter updating directly. In [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou], a clip function was employed to approximate the binarization functions. The derivate of is represented in Eq. 5
based on straight through estimator (STE)
[Bengio et al.(2013)Bengio, Léonard, and Courville] and it is illustrated in Figure 2(d).(5) 
Instead of Eq. 5, [Liu et al.(2018)Liu, Wu, Luo, Yang, Liu, and Cheng] utilized a piecewise polynomial function as a tighter approximation. In this paper, we further develop this higherorder estimator approach. For used in weight binarization, we give derivative approximation whose active range is between illustrated in Figure 2(e). It is a piecewise linear function.
(6) 
For the derivative of , an overtight approximation may not be an optimal choice because it will affect the backpropagation to shallow layers far away from the network output. For an overtight approximation, the gradient value will be near zero in a large scope resulting in gradient vanishing. To alleviate this problem, we use a longtailed higherorder estimator whose active range is between , shown in Figure 2(f). It is a piecewise function with tight approximation near zero and small constant value in a wider range.
(7) 
Based on Eq. 6 and 7, the backpropagation can be performed smoothly during binary network training. It should be noted that there certainly exist other estimators as a tradeoff between tight approximation and smooth backpropagation, such as Swishlike function [Elfwing et al.(2018)Elfwing, Uchibe, and Doya]. In this section, we provide a simple yet efficient approximation and leave other options as our future work.
3.3 Regularization on Scaling Factors
Since deep CNNs usually have a tremendous parameter set, a good regularization term is necessary during training process for robust generalization capability.
regularization, also called ridge regression, is widely used in fullprecision networks. It uses squared magnitude of weights as penalty term to help avoid overfitting issue.
In our binary network training approach, weight scaling factors,
, stand for the actual binary filters for feature extraction. Thus, the
regularization term is modified to restrict the scaling factors accordingly. The total loss function is then represented as
(8) 
where is the taskrelated loss such as cross entropy loss. is the weight scaling factors in which the superscript stands for different layers. stands for other parameters adopted in CNN structure. The second term is the new regularization with weight decay parameter . The regularization term tends to decrease the magnitude of scaling factor .
3.4 Compatibility with Hardware Implementation
In this section we show that our binary network can be easily implemented on hardware platforms via bitwise operations. To simplify the discussion, we take 3x3 convolution as an example and let the input channel to be 1. is the convolution kernel where indicates the output channel, the input of convolution is actually the binary activation of the previous layer. The 33 convolution operation is
(9) 
where and are trained scaling factors for weights and activations respectively. For the same output channel, is a constant so it can be integrated into the following batch normalization.
and are both binary values thus can be implemented with bitwise operations. In each bit, the result of basic multiplication is one of three values . If we let binary "" stands for the value in , the truth table of binary multiplication is then shown in Table 1, in which means the result is and means the result is . It is easy to see that
(10) 
With operation counting number of "" in a binary sequence, we can get the binary convolution
(11) 
Figure 3 shows the hardware architecture of binary convolution based on Eq. 11. As we can see, for a 33 convolution with 9 multiplications and 8 additions, the binary calculation only requires 2 operations and 1 subtraction.
0  0  0  0 
0  1  0  0 
1  0  0  1 
1  1  1  0 
4 Experimental Results
In this section, the performance of our proposed method is evaluated on image classification task. The performance of other tasks such as object detection and pose estimation will be included in the future work. Two image classification datasets: CIFAR10
[Krizhevsky and Hinton(2009)] and ImageNet ILSVRC12 [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.] are used for evaluation. The CIFAR10 dataset consists of 50,000 training images with size 3232 belonging to 10 classes and 10,000 testing images. The large ImageNet ILSVRC12 consists of 1000 classes with about 1.2 million images for training and 50,000 images for validation. We build the same VGGSmall network as [Courbariaux et al.(2015)Courbariaux, Bengio, and David, Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] for evaluation on CIFAR10 dataset. For ImageNet, we implement AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] and Resnet18 [He et al.(2016)He, Zhang, Ren, and Sun] networks. All networks are trained from random initialization without pretraining. Following previous works [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou], we binarize all the convolution and fullyconnected layers except the first and the last layer. For regularization, we set the weight decay parameter to be . [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio] and [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] pointed out that ADAM converges faster and usually performs better for binary inputs. Thus we use ADAM optimization for parameter updating with an initial learning rate and for CIFAR10 and ImageNet, respectively.4.1 Performance on CIFAR10
Table 2 presents the results of the VGGSmall network on CIFAR10 dataset. The second column Bitwidth denotes quantization bits for weights and activations. The third column shows the accuracy results. Two binary network approaches, BNN [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio] and XNORNet [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi], and one lowprecision quantization approach, HWGQ [Cai et al.(2017)Cai, He, Sun, and Vasconcelos], are selected for comparison. The result of fullprecision network model is also presented as a reference. The proposed training approach achieves 92.3% accuracy on CIFAR10, exceeding BNN and XNORNet by 2.4% and 2.5%, respectively. Moreover, our result on binary network is very close to HWGQ [Cai et al.(2017)Cai, He, Sun, and Vasconcelos] with 2bit activation quantization.
Method  Bitwidth (W/A)  Accuracy 

Ours  1/1  92.3% 
BNN [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio]  1/1  89.9% 
XNORNet [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi]  1/1  89.8% 
HWGQ [Cai et al.(2017)Cai, He, Sun, and Vasconcelos]  1/2  92.5% 
FullPrecision  32/32  93.6% 
To further evaluate the proposed method, we show the training curves in Figure 4. First in Figure 4(a), training curves of BNN [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio] and our approach are compared. Although they have similar training accuracy, our approach is faster to train and improves the accuracy a lot on validation set. Moreover, it can be observed that our validation accuracy curve is more stable than BNN due to the regularization. The effectiveness of regularization is then validated in Figure 4(b). The red curve is the result with our new regularization. It has less fluctuation relatively and the regularization term does help to improve the accuracy from 92.0% to 92.3% indicating better generalization.
4.2 Evaluations and Comparisons on ImageNet
On large ImageNet ILSVRC12 dataset, we test the accuracy performance of the proposed approach for AlexNet and Resnet18 networks. The experimental results are presented in Table 3. We select four existing methods for comparison including XNORNet [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi], BinaryNet [Tang et al.(2017)Tang, Hua, and Wang], ABCNet [Lin et al.(2017)Lin, Zhao, and Pan] and DoReFaNet [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou]. It should be noted that the Resnet18 accuracy results of BinaryNet [Tang et al.(2017)Tang, Hua, and Wang] are quoted from [Lin et al.(2017)Lin, Zhao, and Pan]. Some results of ABCNet [Lin et al.(2017)Lin, Zhao, and Pan] and DoReFaNet [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] are not provided so we leave them blank. Similarly, the results of fullprecision network are provided in Table 3 as a reference.
For AlexNet, our approach achieves 46.1% top1 accuracy and 70.9% top5 accuracy. It is the best result among five binary network solutions and surpasses other works by up to 4.9% and 5.3%, respectively. For Resnet18, our method obtains 54.2% top1 accuracy and 77.9% top5 accuracy, improving the performance by up to 12.0% and 10.8% compared with other works. It is also shown that the proposed approach succeeds to reduce the accuracy gap between fullprecision and binary networks to about 10%. These indicates our approach can improve the inference performance of binary convolutional neural networks effectively.
Method  Bitwidth (W/A)  AlexNet Accuracy  Resnet18 Accuracy  
Top1  Top5  Top1  Top5  
Ours  1/1  46.1%  70.9%  54.2%  77.9% 
XNORNet [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi]  1/1  44.2%  69.2%  51.2%  73.2% 
BinaryNet [Tang et al.(2017)Tang, Hua, and Wang]  1/1  41.2%  65.6%  42.2%  67.1% 
ABCNet [Lin et al.(2017)Lin, Zhao, and Pan]  1/1      42.7%  67.6% 
DoReFaNet [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou]  1/1  43.6%       
FullPrecision  32/32  56.6%  80.2%  69.6%  89.2% 
4.3 Analysis of Network Model Size
Network  Full Model  Binary Model  Compression 
Size  Size  Ratio  
VGGSmall  53.52MB  1.75MB  30.6 
AlexNet  237.99MB  22.77MB  10.5 
Resnet18  49.83MB  3.51MB  14.2 
Ideally, binary neural network should achieve 32 compression ratio compared with fullprecision model. But in our approach the first and the last layer are excluded from binarization operation following [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou]. Besides, batch normalization parameters and scaling factors should be in full precision for better network representation capability. These will affect the actual network compression ratio. Table 4 shows the actual parameter size comparison of three network structures. For a typical network, binarization can reduce model size by over 10 times. Besides, it can also be observed that with a better network structure, the binary network can achieve better performance in terms of model compression ratio.
5 Conclusion
This paper proposes an approach to train binary CNNs with higher inference accuracy including three ideas. First, trainable scaling factors for both weights and activations are employed to provide different value ranges of binary number. Then, higherorder estimator and longtailed estimator for derivative of binarization function are proposed to balance the tight approximation and efficient backpropagation. At last, the regularization is performed directly on weight scaling factors for better generalization capability. The approach is effective to improve network accuracy and it is suitable for hardware implementation.
References
 [Bengio et al.(2013)Bengio, Léonard, and Courville] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

[Cai et al.(2017)Cai, He, Sun, and Vasconcelos]
Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos.
Deep learning with low precision by halfwave gaussian quantization.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 5918–5926, 2017.  [Chen et al.(2017)Chen, Yuan, Liao, Yu, and Hua] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. Stylebank: An explicit representation for neural image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1897–1906, 2017.
 [Courbariaux et al.(2015)Courbariaux, Bengio, and David] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131. 2015.
 [Dong et al.(2017)Dong, Ni, Li, Chen, Zhu, and Su] Yinpeng Dong, Renkun Ni, Jianguo Li, Yurong Chen, Jun Zhu, and Hang Su. Learning accurate lowbit deep neural networks with stochastic quantization. In British machine vision conference (BMVC), 2017.

[Elfwing et al.(2018)Elfwing, Uchibe, and Doya]
Stefan Elfwing, Eiji Uchibe, and Kenji Doya.
Sigmoidweighted linear units for neural network function approximation in reinforcement learning.
Neural Networks, 107:3–11, 2018.  [Gatys et al.(2016)Gatys, Ecker, and Bethge] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2414–2423, 2016.
 [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 580–587, 2014.
 [Han et al.(2015)Han, Pool, Tran, and Dally] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143. 2015.
 [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
 [Hou and Kwok(2018)] Lu Hou and James T Kwok. Lossaware weight quantization of deep networks. In International conference on learning representations (ICLR), 2018.
 [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[Hu et al.(2018)Hu, Wang, and Cheng]
Qinghao Hu, Peisong Wang, and Jian Cheng.
From hashing to cnns: Training binary weight networks via hashing.
In
ThirtySecond AAAI Conference on Artificial Intelligence (AAAI)
, 2018.  [Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115. 2016.
 [Iandola et al.(2016)Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [Krizhevsky and Hinton(2009)] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105. 2012.
 [Lebedev et al.(2015)Lebedev, Ganin, Rakhuba, Oseledets, and Lempitsky] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. 2015.
 [Lin et al.(2017)Lin, Zhao, and Pan] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pages 345–353. 2017.
 [Lin et al.(2016)Lin, Courbariaux, Memisevic, and Bengio] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. In International conference on learning representations (ICLR), 2016.
 [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision (ECCV), pages 21–37. Springer, 2016.
 [Liu et al.(2018)Liu, Wu, Luo, Yang, Liu, and Cheng] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and KwangTing Cheng. Bireal net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pages 722–737, 2018.
 [Novikov et al.(2015)Novikov, Podoprikhin, Osokin, and Vetrov] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in neural information processing systems, pages 442–450. 2015.
 [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision (ECCV), pages 525–542. Springer, 2016.
 [Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 779–788, 2016.
 [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
 [Simonyan and Zisserman(2015)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In International conference on learning representations (ICLR), 2015.
 [Tang et al.(2017)Tang, Hua, and Wang] Wei Tang, Gang Hua, and Liang Wang. How to train a compact binary neural network with high accuracy? In ThirtyFirst AAAI Conference on Artificial Intelligence (AAAI), 2017.
 [Umuroglu et al.(2017)Umuroglu, Fraser, Gambardella, Blott, Leong, Jahre, and Vissers] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA), pages 65–74. ACM, 2017.
 [Wang et al.(2018)Wang, Hu, Zhang, Zhang, Liu, and Cheng] Peisong Wang, Qinghao Hu, Yifan Zhang, Chunjie Zhang, Yang Liu, and Jian Cheng. Twostep quantization for lowbit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4376–4384, 2018.
 [Zhang et al.(2017a)Zhang, Kyaw, Chang, and Chua] Hanwang Zhang, Zawlin Kyaw, ShihFu Chang, and TatSeng Chua. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 5532–5540, 2017a.
 [Zhang et al.(2017b)Zhang, Kyaw, Yu, and Chang] Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and ShihFu Chang. Pprfcn: weakly supervised visual relation detection via parallel pairwise rfcn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4233–4241, 2017b.
 [Zhang et al.(2018)Zhang, Zhou, Lin, and Sun] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848–6856, 2018.
 [Zhou et al.(2017)Zhou, Yao, Guo, Xu, and Chen] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. In International Conference on Learning Representations (ICLR), 2017.
 [Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [Zhu et al.(2017)Zhu, Han, Mao, and Dally] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In International conference on learning representations (ICLR), 2017.
 [Zhuang et al.(2018)Zhuang, Shen, Tan, Liu, and Reid] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Towards effective lowbitwidth convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7920–7928, 2018.