Introduction
Deep neural networks (DNNs), especially deep convolution neural networks (CNNs), have been well demonstrated in a wide variety of computer vision applications such as image classification
[12], object detection [23] and reidentification [10]. Traditional CNNs are usually of massive parameters and high computational complexity for the accuracy reason. Subsequently, most of these advanced deep CNN models requires expensive storage and computing resources, and cannot be easily deployed on portable devices such as mobile phones, cameras, etc.In recent years, a number of approaches have been proposed to learn portable deep neural networks, including multibit quantization [2, 4, 14], pruning [11, 30], lowrank decomposition [31], hashing [5], and lightweight architecture design [35], knowledge distillation [13, 32]. Among them, quantization based methods [2, 14] represent the weights and activations in the network with a very low precision, and thus can yield highly more compact DNN models compared to those floatingpoint counterparts. In an extreme case where both the weights and activations are quantized (i.e. binarized) into onebit values, the conventional convolution operations can be efficiently achieved via bitwise operations [22, 16]. The resulting decrease in the storage and acceleration in the inference are therefore appealing to the community. Since the proposition of binary neural network [16], many works have been done to address the performance drop and improve expression ability of binarized networks, e.g. by BirealNet [21] and taking more bits into accounts by DoReFaNet [36], LQNet [33].
Although much progress has been made on binarizing DNNs, the existing quantization methods still cause a significantly large accuracy drops, compared with the fullprecision models. This observation indicates that the information propagates through the fullprecision network are largely lost, when using the binary representation. On the one side, the weights act as filters that extract features layer by layer, which are expect to be diverse for powerful feature representation. The biased weight binarization like sign function usually ignores the variations among the floatingpoint weights, and thus lead to the severe information loss. For instance, if all the binary weight is degenerated into 1 or 1, the convolution layers or fullyconnected layers certainly degenerate into an identity map, which inevitably reduces the representation learning capability of the binary network and makes it hard to train in practice. On the other side, the activations convey the useful information for network prediction, while binarizing them into 1bit causes the large deviations from the original fullprecision feature maps, and thus greatly increases the information loss during the layerwise propagation.
To address this problem and maximally preserve the information, in this paper we propose the Balanced Binary Network with Gated Residual (BBG for short), which learns the balanced binary weights by maximizing their information entropy, and reduce binarization loss for activations by a gated residual path. Figure 1
demonstrates the whole structure of our BBG. In BBG, we fortunately find that by simply removing the mean in the floating point weight, our method can roughly eliminate the redundancy among binary weights with the maximum entropy. To accomplish this goal in the deep network, we reparameterize the weight and devise a linear transformation to replace the standard one, which can be easily implemented and learnt. To compensate the information loss when binarizing activations, the gated residual path employs a lightweight operation to use the channel attention information of floatingpoint activations to further reconstruct the information flow across layers. Our approach is fully compatible with bitwise operations, owning the benefits of rapid inferring for the wide spectrum of deep network architectures. We evaluate our BBG method on the image classification and object detection benchmarks,
i.e. CIFAR10, CIFAR100, ImageNet datasets and Pascal VOC datasets, and the experimental results show that it performs remarkably well across various network architectures, outperforming stateofthearts in terms of memory consumption, inference speed and accuracy.Related Work
To reduce the space and computational complexities of deep neural networks simultaneously, a number of quantization and binarization methods have been proposed for processing both weights and activations in neural networks.
The BinaryNet [16]
first proposes binary neural networks and the performance drop is not obvious in shallow networks in small dataset. XNORNet
[22] adopts some scale factors in each layer and minimize the quantization error of the output matrix to solve the scale factors. TBN [29] enhances the expression ability of neural networks with ternary activations. ABCNet [18] proposes to use more binary weight and activation bases to increase the accuracy, while the compression and seepup ratios will be reduced accordingly. LQNet [33] further proposes low bitwidth quantization function by minimize quantization error, which achieved comparable results on the ImageNet benchmark by increasing the memory overhead.In contrast to conventional model compression methods with sophisticated pipelines such as pruning [11, 30] and matrix decomposition [31], etc, neural network binarization can be efficiently implemented though logic and counting operations. However, the performance of existing neural network binarization methods is still lower than that of original baseline models due to some inaccurate quantization procedures, which decrease the network expression ability and make network hard to train. Therefore, BirealNet [21] proposes using additional shortcut and different approximation of sign function in the backward pass. PCNN [9] employs a projection matrix to help with the network training. CircConv [19] rotate binary weight by three times and calculate feature map with four binary weights and merge them together. BENN [37] ensembles multiple standard binary networks to improve performance. AutoBNN [25]
employs genetic algorithm to search for binary neural network architectures. Most of these work have complicated training pipeline and some of them increase the FLOPs to achieve better results.
The Proposed Method
In this section, we first give the preliminaries about binary neural networks, then present the technical details about our BBG method, with respect to the weights and activations.
Binary Neural Networks
Consider and as the floatingpoint weight and activation of a fullyconnected layer in the neural networks, respectively. is the dimension of weight and activations. The quantization of weight and activation is beneficial for reducing the memory cost and accelerating the computation in the inference phase. Denote and as the binary base for and , respectively. The output feature can be obtained by
(1) 
where the inner product for vectors can be calculated efficiently with bitwise operations AND and Bitcount. To deal with the zero gradient problem of the sign function in the backward propagation process, we adopt the STE method
[1]to estimate the gradient. The sign function without considering distribution of floatingpoint weights generate biased binary weights makes the binary network hard to train.
Maximizing Entropy with Balanced Binary Weights
For training deep neural networks on training dataset with samples , the goal is to minimize the loss between true label and the predicted label and get a bunch of training parameter to fit in original training data and have a good generalization ability on test data:
(2) 
In Binary Neural Networks, the most important training parameter is the nondifferentiable discrete binary weights of fullyconnected layers and convolution layers. In quantized convolution layers, how to generate a binary weight is crucial to improve network’s performance. This discrete property of binary weights brings troublesome problems to the training of the network.
To this end, we aim to preserve the information of weights in the binarization process by differentiable and timesaving methods. The value of each binarized weight
obeys Bernoulli distribution which
takes the value 1 with probability
and the value with probability , where . For each kernel in the neural networks, the information contained in its binary weights can be measured by the entropy:(3) 
To preserve the information of binary weights, thus the target of training binary neural networks should maximize entropy of binary weights. Directly optimize the above regularization is hard due to the hardness to estimate the probability . Instead, we approximate the optimal solution for it by making the expectation of binary weights to be zero to, i.e. . Therefore, the binarized network should be optimized by jointly minimizing the taskrelated object function, e.g. the cross entropy function for classification task, and the information loss passed through each layer (i.e. maximizing the entropy):
(4)  
Hence, we first center the realvalued weights and then quantize them into binary codes. More specifically, we use a proxy parameter to get and then quantize it into binary codes . In the convolutional layer, we express the weight in terms of the proxy parameter using:
(5) 
After balanced weight normalization, we can perform binary quantization on the floatingpoint weight . Subsequently, the forward pass and backward pass of binary weights are as follows:
Forward:  (6)  
Backward: 
where is sign function that outputs for positive numbers and otherwise and calculates the mean of absolute value.
In our balanced weight quantization, the parameter updating is completed based on the proxy parameters , and thus the gradient signal can backpropagate through the normalization process. In the whole process, and are floatingpoint, and we update on training process and only is needed on inference. Since balanced normalization from to is differentiable, we can update proxy parameter easily. The SGD or other alternative techniques such as ADAM can be directly applied to the optimization with respect to the proxy parameter .
Reconstructing Information Flow with Gated Residual
In the binarization of activation, we first clip the value range of activations into and use a round function to binarize activations which equals the following equations:
(7) 
Therefore, the forward pass and backward pass for binary activations are as follows:
Forward:  (8)  
Backward: 
where means if elements of is in the range of , then it is 1, otherwise 0. Therefore, the whole binarization process of a fullyconnected layer can be written in Algorithm 1
Binarizing activations results in much larger loss of precision than binary weights. Moreover, usually all activation values of different channels are quantized to 0 or 1, without considering the differences among the channels. The quantization error caused by binary layers is accumulated layer by layer. To address this problem, we further propose a new module named gated residual to reconstruct information flow in channelwise manner, during the forward propagation. Our gated residual employs the floatingpoint activations to reduce quantization error and recalibrate features.
In our gated residual module, we propose a new designed layer named gated layer, with gate weights that learn the channel attention information of floatingpoint input feature map in a binary convolution layer (, , means channels, height and width respectively). The operation on the th channel of the input feature map is defined as follows:
(9) 
Based on the gated residual, the output feature map can be recalibrated, enhancing the representation power of the activations. The operation in the gated module can be written in the following form:
(10) 
where means the operation in the main path including activation binarization, balanced convolution and BatchNorm in total.
With gated residual layer, the overall structure of gated module is shown in the Figure 2(d). Similarly, we construct our balanced module with gated residual as BirealNet [21], we add shortcut to each layer rather than one shortcut for one basic block. We initialize weight of gated residual with which is the same with identity shortcut of vanilla ResNet. In Figure 3, we visualize the weight values in a gated residual layer, where they have the peak around 1, but vary in a large range indicating the different impacts on different channels. This proves that the gated residual can also help distinguish the channels and eliminate the less useful channels.
Beside reconstructing the information flow in the forward process, this path also acts as an auxiliary gradient path for activations in the backward propagation. Usually, the STE used in the backward pass to approximate discrete binary functions results in severe gradient mismatch problem. Fortunately, with learnable gated weights, our gated residual can also reconstruct gradient in the following way:
(11) 
In terms of computational complexity and memory limitation in the nature of the binary networks, the additional operations for designing a new module need to be as small as possible. The structure of HighwayNet [27] requires a fullprecision weights that is as large as the weight in the convolutional layer, which is unacceptable. Compared to SENet [15], SE module is a correction to the output after the convolution layer, however the information has already been corrupted because of the binary convolution layer. And we argue that it is better to make use of the unquantified information. Similarly, our FLOPs is only and is still much smaller than the SE module. When the number of channels increases and the reduction decreases, the amount of FLOPs required by the SE module will be much more. In comparison with BirealNet, the channelwise scalars in BirealNet are exactly the same as the number of weights in the gated layer. These scalars are computed through the mean of absolute floatingpoint weights, which are not independent parameters and can be replaced by the following BatchNorm layer. However, our learnable gated residual layers help reconstruct the activation and gradient information, subsequently enhancing the network representation ability.
Experiments
In this section, to verify the effectiveness of our proposed BBGNet, we conduct experiments on both image classification and object detection task, over on several benchmark datasets, and compare BBGNet with other stateoftheart methods.
Datasets and Implementation Details
Datasets.
In our experiments, we adopt three common image classification datasets: CIFAR10/100 [17], and ILSVRC2012 ImageNet [24]. The CIFAR10/100 dataset consists of 60,000 color images with 10/100 classes, which is composed of 50,000 training and 10,000 validation images. ImageNet is a largescale dataset with 1.2 million training images and 50k validation images belonging to 1,000 classes. For the evaluation, we employ the widely used metrics for the classification task: Top1 and Top5 validation accuracies, and use the letter W and A to stand for weights and activations respectively. We also evaluate our proposed method on the object detection task with a standard detection benchmark, Pascal VOC datasets named VOC2007 [7], and VOC2012 [8]. Following previous work, we train our models on the combination of VOC2007 trainval and VOC2012 trainval set and test on VOC2007 test set. The standard mAP metric is used for comparison.
Network Architectures and Setup.
We conduct our experiments on popular and powerful network architectures VGGSmall [4], ResNet [12] and Single Shot Detector(SSD[20]). For fair comparison with the existing methods, we respectively choose VGGSmall, ResNet20, ResNet18 and ResNet34 as the baseline model on CIFAR10/100 and ImageNet datasets. And we verify SSD with different backbones, i.e. VGG16 [26] and ResNet34 in Pascal VOC datasets. As for hyperparameter, we mostly follow the same setup in the original papers and all our models are training from scratch.
On CIFAR10/100 datasets, the padding downsample layers are used for shortcut instead of learnable convolutional layers for ResNet20. we train 400 epochs and employs learning rate starting at 0.1 and decays by 0.1 at the epoch of 200, 300 and 375. For VGGSmall, we train 400 epochs and employs learning rate starting at 0.02 and decays by 0.1 at the epoch of 80, 160 and 300. We apply SGD with momentum of 0.9 as our optimization algorithm.
On ImageNet datasets, we train 120 epochs and employs learning rate starting at 0.001 as and reduce by 0.1 at epoch 70, 90 and 110. we apply ADAM as our optimization algorithm. Note that we do not quantize the first and last layer as most binary neural networks and we also do not quantize downsample layers as suggested by many previous work [3]. The accuracy drop is far too much if we quantize these layers, moreover, most binary networks like BirealNet [21] does not binarize these layers either.
On Pascal VOC datasets, we binarize all convolutional layers in the backbone except the first convolutional layer and downsample layers in ResNet34. In VGG16, we binarize all convolution layers in the backbone except the first and last layer and we only add gated residual to those convolution layers with the same input and output channels to avoid additional computation. BatchNorm layers are used in the backbone and the learning rate starts at 0.05.
FLOPs and Memory Analysis.
Binary neural networks are made up of two kind of network layers, floatingpoint layers and binary layers. In the calculation of FLOPs, the binary layer is divided by following BirealNet[21]. This could be the actual acceleration ratio of binary layers with the support of certain hardware design. Similarly, the weights in floatingpoint layer should be multiplied by , as we need 32 bits to store one single floatingpoint weight and we only need 1bit to store binary weight. In the comparison with different networks or network width and resolution experiments, we calculate FLOPs and memory to show our advantages over other stateoftheart results.
W/A  Weight  Residual  Acc.(%) 

32/32  FP  FP  92.1 
1/1  Vanilla  Identity  84.13 
1/1  Balanced  Identity  84.71 
1/1  Vanilla  Gated  84.89 
1/1  Balanced  Gated  85.34 
Method  BitWidth(W/A)  Acc.(%) 

FP  32/32  93.2 
QNN  1/1  89.9 
XNORNet  1/1  89.8 
DoReFaNet  1/1  90.2 
BBG  1/1  91.8 
Ablation Study
Now we study how our balanced weight quantization and gated residual module affects the network’s performance. In Table 1, we report the results of ResNet20 on CIFAR10, with and without balanced quantization or gated residual. In the performance comparison from the first two rows, the network with balanced quantization can obtain 0.6% accuracy than that without this operation. From the whole table, we can easily observe that, balanced weights or gated residual brings accuracy improvement and together they work even better with 1.2% accuracy improvement. It reveals that our proposed method faithfully helps pursue a highly accurate binary network.
Method  Kernel Stage  CIFAR10  CIFAR100 

FP  163264  92.1  68.1 
Vanilla  163264  84.71  53.37 
Vanilla Gated  163264  84.96  55.24 
Bireal  163264  85.54  55.07 
Gated  163264  85.34  55.62 
Vanilla  3264128  90.22  65.06 
Vanilla Gated  3264128  90.71  66.15 
Bireal  3264128  90.27  65.6 
Gated  3264128  90.68  66.47 
Vanilla  4896192  92.01  68.66 
Vanilla Gated  4896192  92.31  69.11 
Bireal  4896192  91.78  68.5 
Gated  4896192  92.46  69.38 
Comparison with the StateoftheArt
CIFAR10/100 Dataset.
In VGGSmall on CIFAR10 dataset, we compared our quantization method with BNN [6], XNORNet [22] and DoReFaNet [36]. We do not add gated residual here. In Table 2, with our proposed binarization methods, we consistently obtain higher accuracy than other 3 methods with over 2% accuracy improvement. It only brings a less than 2% accuracy drop compared with the fullprecision networks.
In ResNet20 on CIFAR10/100 datasets, we further compare four different kinds of modules in Figure 2 which are utilized to construct the network structure of ResNet. In this experiment, the binarization methods of weights and activations are fixed with our proposed methods. We only care about the influence caused by different construction modules. Vanilla is the basic block which are originally used in fullprecision ResNet networks. The kernel stage of standard ResNet20 is 163264, and the three number represents the number of channels in each stage. Inspired by recent EfficientNet [28], we employ network width expansion to pursue higher accuracy of binary networks which we expand all the number of channels to or of the standard ResNet20.
The results of different modules with different kernel stage are shown in Table 3. In CIFAR10 datasets, the accuracy improvement caused by Gated or module is less than 1% while in a more challenging CIFAR100 dataset, it adds up to over 2%. Through the whole experiments, we can conclude that Vanilla Gated and Gated show superiority over Bireal and Vanilla modules. Especially when the network grows wider, Bireal Module even performs worse than Vanilla while Vanilla Gated and Gated consistently performs better. With the kernel stage of 4896192, our binary network matches the accuracy of fullprecision networks in both CIFAR10 and CIFAR100 datasets. In CIFAR100 datasets, with the kernel stage increasing, Gated consistently performs the best among the four different modules. Therefore, we use the Gated module in the following experiments of ImageNet dataset and Pascal VOC dataset.
ResNet18 on ImageNet Dataset.
Method  FLOPs(M)  Memory(Mbit)  Top1  Top5 

FP  1820  374  69.3  89.2 
BNN  149  278  42.2  67.1 
ABCNet  149  278  42.7  67.6 
XNORNet  169  332  51.2  73.2 
DoReFaNet  149  278  52.5  76.9 
BirealNet  169  332  56.4  79.5 
PCNN  169  332  57.3  80 
ResNetE  169  332  58.1  80.6 
BBG  169  332  59.4  81.3 
The experimental results of ResNet18 are shown in Table 4.We compare our method with many binarization methods of recent years including BNN [6], ABCNet [18], XNORNet [22], DoReFaNet [36], PCNN [9], and it further reveals stability of our proposed BBGNet on larger datasets. Compared with BiReal Net, we still obtain 2.6% improvement. And it is easy to see that in terms of Top1 accuracy, our method significantly outperforms all the other methods, with 1.3% performance gain over the stateoftheart ResNetE [3]. In comparison with the fullprecision baseline, the Top1 classification accuracy using our method is reduced by about 10%, and the Top5 classification accuracy declines less than 8%. Note that BNN [6], ABCNet and DoReFaNet binarize downsample layers to 1bit and all the other convolutional layers except the first and last layers. However, all the rest methods including our methods does not binarize downsample layers. From the FLOPs and memory comparison, our method is only slightly higher than DoReFaNet, while our accuracy can increase up to about 7%.
We also improve the accuracy of binary networks by exploring network width and resolution in a simple but effective way. In the network width experiments, we expand all channels of the original ResNet18 by and . In Figure 4, compared with the fullprecision networks, our width models can achieve almost the same accuracy while we only need a third of FLOPs of fullprecision networks. We also compare with ABCNet which utilize multiple binary bases to represent weights and activation where
means 5 binary bases for weight and 3 bases for activations. In the comparison of ABCNet with multiple bases, our methods consistently perform better than ABCNet by a large margin. In resolution exploration, we employ a simple strategy that we remove the max pooling layer after the first convolution, and then change the feature map height and width to 1 with global average pooling before the fullyconnected layer, which makes all hidden layers have
feature maps compared with the original ones. Compared to CircConv [19] which employs four times binary weights, we have better results with 1.6% accuracy improvement, while we have slightly higher FLOPs but the same memory. Our resolution accuracy is 2.3% higher while the FLOPs is only of the FLOPs of DenseNet28 [3]. BENN [37] ensembles 6 standard binary ResNet18 which has nearly three times FLOPs of our resolution while the accuracy declines by 2%.ResNet34 on ImageNet Dataset.
Method  FLOPs(M)  Memory(Mbit)  Top1  Top5 

FP  3673  697  73.3  91.3 
TBN  235  433  58.2  81.0 
BirealNet  199  433  62.2  83.9 
DenseNet37  294  420  62.5  83.9 
BBG  199  433  62.6  84.1 
We also conduct experiments on ResNet34 to verify the advantages of our proposed methods when the network grows deeper. Our method gets much higher accuracy than TBN, with 4.4% performance gain. And TBN employs binary weights and ternary activations while we adopts binary weights and binary activations. In computing FLOPs of TBN, the FLOPs of layer with ternary activations and binary weights is of FLOPs of layer with binary activations and binary weights. We also performs better than BirealNet (0.4% improvement) and slightly better than DenseNet37 [3]. Note that DenseNet37 consumes more running memory than ResNet34 due to dense connections and the FLOPs is of ours, which makes it difficult to speed up in actual hardware and applied to edge devices. The results demonstrates the effectiveness of our proposed methods in deeper networks.
Method  Backbone  Resolution  FLOPs  mAP(%) 

FP  VGG16  300  29986  74.3 
FP  ResNet34  300  6850  75.5 
XNOR  ResNet34  300  362  55.1 
TBN  ResNet34  300  464  59.5 
BDN  DenseNet37  512  1530  66.4 
BDN  DenseNet45  512  1960  68.2 
BBG  ResNet34  300  362  62.8 
BBG  VGG16  300  1062  68.5 
SSD on Pascal VOC Dataset.
In object detection task, we compare our method with XNORNet [22], TBN [29], and BDN [3] which includes DenseNet37 and DenseNet45.In the comparison of ResNet34 as its backbone, we outperform XNOR and TBN by 6.9% and 2.5%. It proves that our solution using 1bit can preserve the stronger feature representation and maintain the better generalization ability than ternary neural networks. As for VGG16, we only need half of FLOPs of binary DenseNet45 to achieve slightly higher results and our model outperforms DenseNet37 by 2% with one thirds fewer FLOPs. The accuracy of our VGG16 is only 5.8% less than fullprecision counterparts. The significant performance gain further demonstrates that our method can better preserve the information propagated in the network and help extract the most discriminative features for detection task.
Model  FullPrecision  DoReFa  BBG 

time(ms)  1457  249  251 
Deploying Efficiency
Finally, we implement our method on mobile devices using a framework named daBNN [34]. The mobile device we use is Rasberry Pi 3B, which has a 1.2GHz 64bit quadcore ARM CortexA53. As shown in Table 7, we implement DoReFa which binarizes downsample layers while we do not, the difference in inference time is only which can be ignored. Our proposed method can run faster than fullprecision counterparts. This acceleration ratio can be further improved by better design or some customized hardware like FPGA.
Conclusion
Binarization methods for establishing portable neural networks are urgently required so that these networks with massive parameters and complex architectures can be launched efficiently. In this work, we proposed a novel Balanced Binary Neural Network with Gated Residual, namely BBGNet. The new framework introduces a linear balanced transformation module for maximizing the information entropy of weights in deep neural networks, which can be easily implemented in any deep neural architectures. In addition, our gated residual reconstruct information loss and recalibrate feature maps from the binary layers. Experiments conducted on benchmark datasets and architectures demonstrate the effectiveness of the proposed balanced weight quantization and gated residual for learning binary neural networks with higher performance but lower memory and computation consumption than the stateoftheart methods.
References
 [1] (2013) Estimating or propagating gradients through stochastic neurons for conditional computation.. arXiv: Learning. Cited by: Binary Neural Networks.
 [2] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In CVPR, pp. 2704–2713. Cited by: Introduction.
 [3] (2019) Back to simplicity: how to train accurate bnns from scratch?. arXiv preprint arXiv:1906.08637. Cited by: Network Architectures and Setup., ResNet18 on ImageNet Dataset., ResNet18 on ImageNet Dataset., ResNet34 on ImageNet Dataset., SSD on Pascal VOC Dataset..

[4]
(2017)
Deep learning with low precision by halfwave gaussian quantization.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5918–5926. Cited by: Introduction, Network Architectures and Setup..  [5] (2015) Compressing neural networks with the hashing trick. ICML, pp. 2285–2294. Cited by: Introduction.
 [6] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: CIFAR10/100 Dataset., ResNet18 on ImageNet Dataset..
 [7] (2007) The pascal visual object classes challenge 2007 (voc2007) results. Cited by: Datasets..
 [8] (2012) The pascal visual object classes challenge 2012 (voc2012) results. Cited by: Datasets..

[9]
(2019)
Projection convolutional neural networks for 1bit cnns via discrete back propagation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 8344–8351. Cited by: Related Work, ResNet18 on ImageNet Dataset..  [10] (2019) Attribute aware pooling for pedestrian attribute recognition. arXiv preprint arXiv:1907.11837. Cited by: Introduction.
 [11] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: Introduction, Related Work.
 [12] (2016) Deep residual learning for image recognition. CVPR, pp. 770–778. Cited by: Introduction, Network Architectures and Setup..
 [13] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: Introduction.
 [14] (2018) Lossaware weight quantization of deep networks. ICLR. Cited by: Introduction.
 [15] (2018) Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: Reconstructing Information Flow with Gated Residual.
 [16] (2016) Binarized neural networks. NIPS, pp. 4107–4115. Cited by: Introduction, Related Work.
 [17] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Datasets..
 [18] (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: Related Work, ResNet18 on ImageNet Dataset..
 [19] (2019) Circulant binary convolutional networks: enhancing the performance of 1bit dcnns with circulant back propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. Cited by: Related Work, ResNet18 on ImageNet Dataset..
 [20] (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: Network Architectures and Setup..
 [21] (2018) Bireal net: enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In ECCV, Cited by: Introduction, Related Work, Reconstructing Information Flow with Gated Residual, Network Architectures and Setup., FLOPs and Memory Analysis..
 [22] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: Introduction, Related Work, CIFAR10/100 Dataset., ResNet18 on ImageNet Dataset., SSD on Pascal VOC Dataset..
 [23] (2015) Faster rcnn: towards realtime object detection with region proposal networks. Cited by: Introduction.
 [24] (2015) ImageNet large scale visual recognition challenge. IJCV, pp. 211–252. Cited by: Datasets..
 [25] (2019) Searching for accurate binary neural architectures. arXiv preprint arXiv:1909.07378. Cited by: Related Work.
 [26] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Network Architectures and Setup..
 [27] (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: Reconstructing Information Flow with Gated Residual.
 [28] (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: CIFAR10/100 Dataset..
 [29] (2018) Tbn: convolutional neural network with ternary inputs and binary weights. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 315–332. Cited by: Related Work, SSD on Pascal VOC Dataset..
 [30] (2017) Towards evolutional compression. arXiv preprint arXiv:1707.08005. Cited by: Introduction, Related Work.
 [31] (2017) On compressing deep models by low rank and sparse decomposition. pp. 67–76. Cited by: Introduction, Related Work.
 [32] (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: Introduction.
 [33] (2018) Lqnets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: Introduction, Related Work.
 [34] (2019) DaBNN: a super fast inference framework for binary neural networks on arm devices. arXiv preprint arXiv:1908.05858. Cited by: Deploying Efficiency.
 [35] (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: Introduction.
 [36] (2016) Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: Introduction, CIFAR10/100 Dataset., ResNet18 on ImageNet Dataset..
 [37] (2019) Binary ensemble neural network: more bits per network or more networks per bit?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4923–4932. Cited by: Related Work, ResNet18 on ImageNet Dataset..
Comments
There are no comments yet.