Balanced Binary Neural Networks with Gated Residual

09/26/2019 ∙ by Mingzhu Shen, et al. ∙ Beihang University The University of Sydney State Key Laboratory of Computer Science, Institute Peking University 5

Binary neural networks have attracted numerous attention in recent years. However, mainly due to the information loss stemming from the biased binarization, how to preserve the accuracy of networks still remains a critical issue. In this paper, we attempt to maintain the information propagated in the forward process and propose a Balanced Binary Neural Networks with Gated Residual (BBG for short). First, a weight balanced binarization is introduced to maximize information entropy of binary weights, and thus the informative binary weights can capture more information contained in the activations. Second, for binary activations, a gated residual is further appended to compensate their information loss during the forward process, with a slight overhead. Both techniques can be wrapped as a generic network module that supports various network architectures for different tasks including classification and detection. We evaluate our BBG on image classification tasks over CIFAR-10/100 and ImageNet and on detection task over Pascal VOC. The experimental results show that BBG-Net performs remarkably well across various network architectures such as VGG, ResNet and SSD with the superior performance over state-of-the-art methods in terms of memory consumption, inference speed and accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Deep neural networks (DNNs), especially deep convolution neural networks (CNNs), have been well demonstrated in a wide variety of computer vision applications such as image classification 

[12], object detection [23] and re-identification [10]. Traditional CNNs are usually of massive parameters and high computational complexity for the accuracy reason. Subsequently, most of these advanced deep CNN models requires expensive storage and computing resources, and cannot be easily deployed on portable devices such as mobile phones, cameras, etc.

In recent years, a number of approaches have been proposed to learn portable deep neural networks, including multi-bit quantization [2, 4, 14], pruning [11, 30], low-rank decomposition [31], hashing [5], and lightweight architecture design [35], knowledge distillation [13, 32]. Among them, quantization based methods [2, 14] represent the weights and activations in the network with a very low precision, and thus can yield highly more compact DNN models compared to those floating-point counterparts. In an extreme case where both the weights and activations are quantized (i.e. binarized) into one-bit values, the conventional convolution operations can be efficiently achieved via bitwise operations [22, 16]. The resulting decrease in the storage and acceleration in the inference are therefore appealing to the community. Since the proposition of binary neural network [16], many works have been done to address the performance drop and improve expression ability of binarized networks, e.g. by Bireal-Net [21] and taking more bits into accounts by DoReFa-Net [36], LQ-Net [33].

Figure 1: The diagram of the proposed method for learning binary neural networks by exploiting balanced weights binarization and gated residual to reconstruct information loss.

Although much progress has been made on binarizing DNNs, the existing quantization methods still cause a significantly large accuracy drops, compared with the full-precision models. This observation indicates that the information propagates through the full-precision network are largely lost, when using the binary representation. On the one side, the weights act as filters that extract features layer by layer, which are expect to be diverse for powerful feature representation. The biased weight binarization like sign function usually ignores the variations among the floating-point weights, and thus lead to the severe information loss. For instance, if all the binary weight is degenerated into 1 or -1, the convolution layers or fully-connected layers certainly degenerate into an identity map, which inevitably reduces the representation learning capability of the binary network and makes it hard to train in practice. On the other side, the activations convey the useful information for network prediction, while binarizing them into 1-bit causes the large deviations from the original full-precision feature maps, and thus greatly increases the information loss during the layer-wise propagation.

To address this problem and maximally preserve the information, in this paper we propose the Balanced Binary Network with Gated Residual (BBG for short), which learns the balanced binary weights by maximizing their information entropy, and reduce binarization loss for activations by a gated residual path. Figure 1

demonstrates the whole structure of our BBG. In BBG, we fortunately find that by simply removing the mean in the floating point weight, our method can roughly eliminate the redundancy among binary weights with the maximum entropy. To accomplish this goal in the deep network, we re-parameterize the weight and devise a linear transformation to replace the standard one, which can be easily implemented and learnt. To compensate the information loss when binarizing activations, the gated residual path employs a lightweight operation to use the channel attention information of floating-point activations to further reconstruct the information flow across layers. Our approach is fully compatible with bitwise operations, owning the benefits of rapid inferring for the wide spectrum of deep network architectures. We evaluate our BBG method on the image classification and object detection benchmarks,

i.e. CIFAR-10, CIFAR-100, ImageNet datasets and Pascal VOC datasets, and the experimental results show that it performs remarkably well across various network architectures, outperforming state-of-the-arts in terms of memory consumption, inference speed and accuracy.

Related Work

To reduce the space and computational complexities of deep neural networks simultaneously, a number of quantization and binarization methods have been proposed for processing both weights and activations in neural networks.

The BinaryNet [16]

first proposes binary neural networks and the performance drop is not obvious in shallow networks in small dataset. XNOR-Net 

[22] adopts some scale factors in each layer and minimize the quantization error of the output matrix to solve the scale factors. TBN [29] enhances the expression ability of neural networks with ternary activations. ABC-Net [18] proposes to use more binary weight and activation bases to increase the accuracy, while the compression and seep-up ratios will be reduced accordingly. LQ-Net  [33] further proposes low bit-width quantization function by minimize quantization error, which achieved comparable results on the ImageNet benchmark by increasing the memory overhead.

In contrast to conventional model compression methods with sophisticated pipelines such as pruning [11, 30] and matrix decomposition [31], etc, neural network binarization can be efficiently implemented though logic and counting operations. However, the performance of existing neural network binarization methods is still lower than that of original baseline models due to some inaccurate quantization procedures, which decrease the network expression ability and make network hard to train. Therefore, Bireal-Net [21] proposes using additional shortcut and different approximation of sign function in the backward pass. PCNN [9] employs a projection matrix to help with the network training. CircConv [19] rotate binary weight by three times and calculate feature map with four binary weights and merge them together. BENN [37] ensembles multiple standard binary networks to improve performance. AutoBNN [25]

employs genetic algorithm to search for binary neural network architectures. Most of these work have complicated training pipeline and some of them increase the FLOPs to achieve better results.

The Proposed Method

In this section, we first give the preliminaries about binary neural networks, then present the technical details about our BBG method, with respect to the weights and activations.

Binary Neural Networks

Consider and as the floating-point weight and activation of a fully-connected layer in the neural networks, respectively. is the dimension of weight and activations. The quantization of weight and activation is beneficial for reducing the memory cost and accelerating the computation in the inference phase. Denote and as the binary base for and , respectively. The output feature can be obtained by

(1)

where the inner product for vectors can be calculated efficiently with bit-wise operations AND and Bitcount. To deal with the zero gradient problem of the sign function in the backward propagation process, we adopt the STE method 

[1]

to estimate the gradient. The sign function without considering distribution of floating-point weights generate biased binary weights makes the binary network hard to train.

Maximizing Entropy with Balanced Binary Weights

For training deep neural networks on training dataset with samples , the goal is to minimize the loss between true label and the predicted label and get a bunch of training parameter to fit in original training data and have a good generalization ability on test data:

(2)

In Binary Neural Networks, the most important training parameter is the non-differentiable discrete binary weights of fully-connected layers and convolution layers. In quantized convolution layers, how to generate a binary weight is crucial to improve network’s performance. This discrete property of binary weights brings troublesome problems to the training of the network.

To this end, we aim to preserve the information of weights in the binarization process by differentiable and time-saving methods. The value of each binarized weight

obeys Bernoulli distribution which

takes the value 1 with probability

and the value with probability , where . For each kernel in the neural networks, the information contained in its binary weights can be measured by the entropy:

(3)

To preserve the information of binary weights, thus the target of training binary neural networks should maximize entropy of binary weights. Directly optimize the above regularization is hard due to the hardness to estimate the probability . Instead, we approximate the optimal solution for it by making the expectation of binary weights to be zero to, i.e. . Therefore, the binarized network should be optimized by jointly minimizing the task-related object function, e.g. the cross entropy function for classification task, and the information loss passed through each layer (i.e. maximizing the entropy):

(4)
1:Require: the input activation , pre-activation , and parameters to be learned: .
2:Feed-Forward
3: Compute binary input data (Eq. 78):
4:;
5: Compute balanced binary weight (Eq. 56):
6:;
7:;
8:

 Calucate the output neuron:

.
9:Back-Propagation
10: Calculate the gradients w.r.t. and with STE:
11:;
12:;
13:.
14:Parameters Update
15:Update : , where is learning rate.
Algorithm 1 Feed-Forward and Back-Propagation procedures for learning each filter in the proposed BBG-Nets.

Hence, we first center the real-valued weights and then quantize them into binary codes. More specifically, we use a proxy parameter to get and then quantize it into binary codes . In the convolutional layer, we express the weight in terms of the proxy parameter using:

(5)

After balanced weight normalization, we can perform binary quantization on the floating-point weight . Subsequently, the forward pass and backward pass of binary weights are as follows:

Forward: (6)
Backward:

where is sign function that outputs for positive numbers and otherwise and calculates the mean of absolute value.

In our balanced weight quantization, the parameter updating is completed based on the proxy parameters , and thus the gradient signal can back-propagate through the normalization process. In the whole process, and are floating-point, and we update on training process and only is needed on inference. Since balanced normalization from to is differentiable, we can update proxy parameter easily. The SGD or other alternative techniques such as ADAM can be directly applied to the optimization with respect to the proxy parameter .

Reconstructing Information Flow with Gated Residual

In the binarization of activation, we first clip the value range of activations into and use a round function to binarize activations which equals the following equations:

(7)

Therefore, the forward pass and backward pass for binary activations are as follows:

Forward: (8)
Backward:

where means if elements of is in the range of , then it is 1, otherwise 0. Therefore, the whole binarization process of a fully-connected layer can be written in Algorithm 1

Binarizing activations results in much larger loss of precision than binary weights. Moreover, usually all activation values of different channels are quantized to 0 or 1, without considering the differences among the channels. The quantization error caused by binary layers is accumulated layer by layer. To address this problem, we further propose a new module named gated residual to reconstruct information flow in channel-wise manner, during the forward propagation. Our gated residual employs the floating-point activations to reduce quantization error and recalibrate features.

Figure 2: The four different kinds of modules are utilized to construct ResNet networks.

In our gated residual module, we propose a new designed layer named gated layer, with gate weights that learn the channel attention information of floating-point input feature map in a binary convolution layer (, , means channels, height and width respectively). The operation on the th channel of the input feature map is defined as follows:

(9)

Based on the gated residual, the output feature map can be recalibrated, enhancing the representation power of the activations. The operation in the gated module can be written in the following form:

(10)

where means the operation in the main path including activation binarization, balanced convolution and BatchNorm in total.

With gated residual layer, the overall structure of gated module is shown in the Figure 2(d). Similarly, we construct our balanced module with gated residual as Bireal-Net [21], we add shortcut to each layer rather than one shortcut for one basic block. We initialize weight of gated residual with which is the same with identity shortcut of vanilla ResNet. In Figure 3, we visualize the weight values in a gated residual layer, where they have the peak around 1, but vary in a large range indicating the different impacts on different channels. This proves that the gated residual can also help distinguish the channels and eliminate the less useful channels.

Figure 3: The visualization of Gated Weight in last binarized convolution of ResNet-18

Beside reconstructing the information flow in the forward process, this path also acts as an auxiliary gradient path for activations in the backward propagation. Usually, the STE used in the backward pass to approximate discrete binary functions results in severe gradient mismatch problem. Fortunately, with learnable gated weights, our gated residual can also reconstruct gradient in the following way:

(11)

In terms of computational complexity and memory limitation in the nature of the binary networks, the additional operations for designing a new module need to be as small as possible. The structure of HighwayNet [27] requires a full-precision weights that is as large as the weight in the convolutional layer, which is unacceptable. Compared to SENet [15], SE module is a correction to the output after the convolution layer, however the information has already been corrupted because of the binary convolution layer. And we argue that it is better to make use of the unquantified information. Similarly, our FLOPs is only and is still much smaller than the SE module. When the number of channels increases and the reduction decreases, the amount of FLOPs required by the SE module will be much more. In comparison with Bireal-Net, the channel-wise scalars in Bireal-Net are exactly the same as the number of weights in the gated layer. These scalars are computed through the mean of absolute floating-point weights, which are not independent parameters and can be replaced by the following BatchNorm layer. However, our learnable gated residual layers help reconstruct the activation and gradient information, subsequently enhancing the network representation ability.

Experiments

In this section, to verify the effectiveness of our proposed BBG-Net, we conduct experiments on both image classification and object detection task, over on several benchmark datasets, and compare BBG-Net with other state-of-the-art methods.

Datasets and Implementation Details

Datasets.

In our experiments, we adopt three common image classification datasets: CIFAR-10/100 [17], and ILSVRC-2012 ImageNet [24]. The CIFAR-10/100 dataset consists of 60,000 color images with 10/100 classes, which is composed of 50,000 training and 10,000 validation images. ImageNet is a large-scale dataset with 1.2 million training images and 50k validation images belonging to 1,000 classes. For the evaluation, we employ the widely used metrics for the classification task: Top-1 and Top-5 validation accuracies, and use the letter W and A to stand for weights and activations respectively. We also evaluate our proposed method on the object detection task with a standard detection benchmark, Pascal VOC datasets named VOC2007 [7], and VOC2012 [8]. Following previous work, we train our models on the combination of VOC2007 trainval and VOC2012 trainval set and test on VOC2007 test set. The standard mAP metric is used for comparison.

Network Architectures and Setup.

We conduct our experiments on popular and powerful network architectures VGG-Small [4], ResNet [12] and Single Shot Detector(SSD[20]). For fair comparison with the existing methods, we respectively choose VGG-Small, ResNet-20, ResNet-18 and ResNet-34 as the baseline model on CIFAR-10/100 and ImageNet datasets. And we verify SSD with different backbones, i.e. VGG-16 [26] and ResNet-34 in Pascal VOC datasets. As for hyper-parameter, we mostly follow the same setup in the original papers and all our models are training from scratch.

On CIFAR-10/100 datasets, the padding downsample layers are used for shortcut instead of learnable convolutional layers for ResNet-20. we train 400 epochs and employs learning rate starting at 0.1 and decays by 0.1 at the epoch of 200, 300 and 375. For VGG-Small, we train 400 epochs and employs learning rate starting at 0.02 and decays by 0.1 at the epoch of 80, 160 and 300. We apply SGD with momentum of 0.9 as our optimization algorithm.

On ImageNet datasets, we train 120 epochs and employs learning rate starting at 0.001 as and reduce by 0.1 at epoch 70, 90 and 110. we apply ADAM as our optimization algorithm. Note that we do not quantize the first and last layer as most binary neural networks and we also do not quantize down-sample layers as suggested by many previous work [3]. The accuracy drop is far too much if we quantize these layers, moreover, most binary networks like Bireal-Net [21] does not binarize these layers either.

On Pascal VOC datasets, we binarize all convolutional layers in the backbone except the first convolutional layer and down-sample layers in ResNet-34. In VGG-16, we binarize all convolution layers in the backbone except the first and last layer and we only add gated residual to those convolution layers with the same input and output channels to avoid additional computation. BatchNorm layers are used in the backbone and the learning rate starts at 0.05.

FLOPs and Memory Analysis.

Binary neural networks are made up of two kind of network layers, floating-point layers and binary layers. In the calculation of FLOPs, the binary layer is divided by following Bireal-Net[21]. This could be the actual acceleration ratio of binary layers with the support of certain hardware design. Similarly, the weights in floating-point layer should be multiplied by , as we need 32 bits to store one single floating-point weight and we only need 1-bit to store binary weight. In the comparison with different networks or network width and resolution experiments, we calculate FLOPs and memory to show our advantages over other state-of-the-art results.

W/A Weight Residual Acc.(%)
32/32 FP FP 92.1
1/1 Vanilla Identity 84.13
1/1 Balanced Identity 84.71
1/1 Vanilla Gated 84.89
1/1 Balanced Gated 85.34
Table 1: Performance comparison of our proposed methods with vanilla binary networks on ResNet-20 networks on CIFAR-10 validation set.
Method Bit-Width(W/A) Acc.(%)
FP 32/32 93.2
QNN 1/1 89.9
XNOR-Net 1/1 89.8
DoReFa-Net 1/1 90.2
BBG 1/1 91.8
Table 2: Performance comparison with binarization methods on VGG-Small models on CIFAR-10 validation set.

Ablation Study

Now we study how our balanced weight quantization and gated residual module affects the network’s performance. In Table 1, we report the results of ResNet-20 on CIFAR-10, with and without balanced quantization or gated residual. In the performance comparison from the first two rows, the network with balanced quantization can obtain 0.6% accuracy than that without this operation. From the whole table, we can easily observe that, balanced weights or gated residual brings accuracy improvement and together they work even better with 1.2% accuracy improvement. It reveals that our proposed method faithfully helps pursue a highly accurate binary network.

Method Kernel Stage CIFAR-10 CIFAR-100
FP 16-32-64 92.1 68.1
Vanilla 16-32-64 84.71 53.37
Vanilla Gated 16-32-64 84.96 55.24
Bireal 16-32-64 85.54 55.07
Gated 16-32-64 85.34 55.62
Vanilla 32-64-128 90.22 65.06
Vanilla Gated 32-64-128 90.71 66.15
Bireal 32-64-128 90.27 65.6
Gated 32-64-128 90.68 66.47
Vanilla 48-96-192 92.01 68.66
Vanilla Gated 48-96-192 92.31 69.11
Bireal 48-96-192 91.78 68.5
Gated 48-96-192 92.46 69.38
Table 3: Performance comparison of 4 different modules on ResNet-20 models on CIFAR-10/100 validation set.

Comparison with the State-of-the-Art

CIFAR-10/100 Dataset.

In VGG-Small on CIFAR-10 dataset, we compared our quantization method with BNN [6], XNOR-Net [22] and DoReFa-Net [36]. We do not add gated residual here. In Table 2, with our proposed binarization methods, we consistently obtain higher accuracy than other 3 methods with over 2% accuracy improvement. It only brings a less than 2% accuracy drop compared with the full-precision networks.

In ResNet-20 on CIFAR-10/100 datasets, we further compare four different kinds of modules in Figure 2 which are utilized to construct the network structure of ResNet. In this experiment, the binarization methods of weights and activations are fixed with our proposed methods. We only care about the influence caused by different construction modules. Vanilla is the basic block which are originally used in full-precision ResNet networks. The kernel stage of standard ResNet-20 is 16-32-64, and the three number represents the number of channels in each stage. Inspired by recent EfficientNet [28], we employ network width expansion to pursue higher accuracy of binary networks which we expand all the number of channels to or of the standard ResNet-20.

The results of different modules with different kernel stage are shown in Table 3. In CIFAR-10 datasets, the accuracy improvement caused by Gated or module is less than 1% while in a more challenging CIFAR-100 dataset, it adds up to over 2%. Through the whole experiments, we can conclude that Vanilla Gated and Gated show superiority over Bireal and Vanilla modules. Especially when the network grows wider, Bireal Module even performs worse than Vanilla while Vanilla Gated and Gated consistently performs better. With the kernel stage of 48-96-192, our binary network matches the accuracy of full-precision networks in both CIFAR-10 and CIFAR-100 datasets. In CIFAR-100 datasets, with the kernel stage increasing, Gated consistently performs the best among the four different modules. Therefore, we use the Gated module in the following experiments of ImageNet dataset and Pascal VOC dataset.

ResNet-18 on ImageNet Dataset.

Method FLOPs(M) Memory(Mbit) Top-1 Top-5
FP 1820 374 69.3 89.2
BNN 149 278 42.2 67.1
ABC-Net 149 278 42.7 67.6
XNOR-Net 169 332 51.2 73.2
DoReFa-Net 149 278 52.5 76.9
Bireal-Net 169 332 56.4 79.5
PCNN 169 332 57.3 80
ResNetE 169 332 58.1 80.6
BBG 169 332 59.4 81.3
Table 4: Performance comparison of binarized ResNet-18 on ImageNet classification dataset.
Figure 4: Performance comparison with state-of-the-arts methods and our network width and resolution exploration experiments.

The experimental results of ResNet-18 are shown in Table 4.We compare our method with many binarization methods of recent years including BNN [6], ABC-Net [18], XNOR-Net [22], DoReFa-Net [36], PCNN [9], and it further reveals stability of our proposed BBG-Net on larger datasets. Compared with Bi-Real Net, we still obtain 2.6% improvement. And it is easy to see that in terms of Top-1 accuracy, our method significantly outperforms all the other methods, with 1.3% performance gain over the state-of-the-art ResNetE [3]. In comparison with the full-precision baseline, the Top-1 classification accuracy using our method is reduced by about 10%, and the Top-5 classification accuracy declines less than 8%. Note that BNN [6], ABC-Net and DoReFa-Net binarize downsample layers to 1-bit and all the other convolutional layers except the first and last layers. However, all the rest methods including our methods does not binarize downsample layers. From the FLOPs and memory comparison, our method is only slightly higher than DoReFa-Net, while our accuracy can increase up to about 7%.

We also improve the accuracy of binary networks by exploring network width and resolution in a simple but effective way. In the network width experiments, we expand all channels of the original ResNet-18 by and . In Figure 4, compared with the full-precision networks, our width models can achieve almost the same accuracy while we only need a third of FLOPs of full-precision networks. We also compare with ABC-Net which utilize multiple binary bases to represent weights and activation where

means 5 binary bases for weight and 3 bases for activations. In the comparison of ABC-Net with multiple bases, our methods consistently perform better than ABC-Net by a large margin. In resolution exploration, we employ a simple strategy that we remove the max pooling layer after the first convolution, and then change the feature map height and width to 1 with global average pooling before the fully-connected layer, which makes all hidden layers have

feature maps compared with the original ones. Compared to CircConv [19] which employs four times binary weights, we have better results with 1.6% accuracy improvement, while we have slightly higher FLOPs but the same memory. Our resolution accuracy is 2.3% higher while the FLOPs is only of the FLOPs of DenseNet-28 [3]. BENN [37] ensembles 6 standard binary ResNet-18 which has nearly three times FLOPs of our resolution while the accuracy declines by 2%.

ResNet-34 on ImageNet Dataset.

Method FLOPs(M) Memory(Mbit) Top-1 Top-5
FP 3673 697 73.3 91.3
TBN 235 433 58.2 81.0
Bireal-Net 199 433 62.2 83.9
DenseNet-37 294 420 62.5 83.9
BBG 199 433 62.6 84.1
Table 5: Performance comparison of binarized ResNet-34 network on ImageNet classification dataset.

We also conduct experiments on ResNet-34 to verify the advantages of our proposed methods when the network grows deeper. Our method gets much higher accuracy than TBN, with 4.4% performance gain. And TBN employs binary weights and ternary activations while we adopts binary weights and binary activations. In computing FLOPs of TBN, the FLOPs of layer with ternary activations and binary weights is of FLOPs of layer with binary activations and binary weights. We also performs better than Bireal-Net (0.4% improvement) and slightly better than DenseNet-37 [3]. Note that DenseNet-37 consumes more running memory than ResNet-34 due to dense connections and the FLOPs is of ours, which makes it difficult to speed up in actual hardware and applied to edge devices. The results demonstrates the effectiveness of our proposed methods in deeper networks.

Method Backbone Resolution FLOPs mAP(%)
FP VGG-16 300 29986 74.3
FP ResNet-34 300 6850 75.5
XNOR ResNet-34 300 362 55.1
TBN ResNet-34 300 464 59.5
BDN DenseNet-37 512 1530 66.4
BDN DenseNet-45 512 1960 68.2
BBG ResNet-34 300 362 62.8
BBG VGG-16 300 1062 68.5
Table 6: Performance of binarized SSD on Pascal VOC dataset.

SSD on Pascal VOC Dataset.

In object detection task, we compare our method with XNOR-Net [22], TBN [29], and BDN [3] which includes DenseNet37 and DenseNet-45.In the comparison of ResNet-34 as its backbone, we outperform XNOR and TBN by 6.9% and 2.5%. It proves that our solution using 1-bit can preserve the stronger feature representation and maintain the better generalization ability than ternary neural networks. As for VGG-16, we only need half of FLOPs of binary DenseNet-45 to achieve slightly higher results and our model outperforms DenseNet-37 by 2% with one thirds fewer FLOPs. The accuracy of our VGG-16 is only 5.8% less than full-precision counterparts. The significant performance gain further demonstrates that our method can better preserve the information propagated in the network and help extract the most discriminative features for detection task.

Model Full-Precision DoReFa BBG
time(ms) 1457 249 251
Table 7: Comparison of inference time of full-precision ResNet-18 and binary ResNet-18.

Deploying Efficiency

Finally, we implement our method on mobile devices using a framework named daBNN [34]. The mobile device we use is Rasberry Pi 3B, which has a 1.2GHz 64-bit quad-core ARM Cortex-A53. As shown in Table 7, we implement DoReFa which binarizes downsample layers while we do not, the difference in inference time is only which can be ignored. Our proposed method can run faster than full-precision counterparts. This acceleration ratio can be further improved by better design or some customized hardware like FPGA.

Conclusion

Binarization methods for establishing portable neural networks are urgently required so that these networks with massive parameters and complex architectures can be launched efficiently. In this work, we proposed a novel Balanced Binary Neural Network with Gated Residual, namely BBG-Net. The new framework introduces a linear balanced transformation module for maximizing the information entropy of weights in deep neural networks, which can be easily implemented in any deep neural architectures. In addition, our gated residual reconstruct information loss and recalibrate feature maps from the binary layers. Experiments conducted on benchmark datasets and architectures demonstrate the effectiveness of the proposed balanced weight quantization and gated residual for learning binary neural networks with higher performance but lower memory and computation consumption than the state-of-the-art methods.

References

  • [1] Y. Bengio, N. Leonard, and A. C. Courville (2013) Estimating or propagating gradients through stochastic neurons for conditional computation.. arXiv: Learning. Cited by: Binary Neural Networks.
  • [2] J. Benoit, K. Skirmantas, C. Bo, Z. Menglong, T. Matthew, H. Andrew, A. Hartwig, and K. Dmitry (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, pp. 2704–2713. Cited by: Introduction.
  • [3] J. Bethge, H. Yang, M. Bornstein, and C. Meinel (2019) Back to simplicity: how to train accurate bnns from scratch?. arXiv preprint arXiv:1906.08637. Cited by: Network Architectures and Setup., ResNet-18 on ImageNet Dataset., ResNet-18 on ImageNet Dataset., ResNet-34 on ImageNet Dataset., SSD on Pascal VOC Dataset..
  • [4] Z. Cai, X. He, J. Sun, and N. Vasconcelos (2017) Deep learning with low precision by half-wave gaussian quantization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5918–5926. Cited by: Introduction, Network Architectures and Setup..
  • [5] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen (2015) Compressing neural networks with the hashing trick. ICML, pp. 2285–2294. Cited by: Introduction.
  • [6] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: CIFAR-10/100 Dataset., ResNet-18 on ImageNet Dataset..
  • [7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2007) The pascal visual object classes challenge 2007 (voc2007) results. Cited by: Datasets..
  • [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2012) The pascal visual object classes challenge 2012 (voc2012) results. Cited by: Datasets..
  • [9] J. Gu, C. Li, B. Zhang, J. Han, X. Cao, J. Liu, and D. Doermann (2019) Projection convolutional neural networks for 1-bit cnns via discrete back propagation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 8344–8351. Cited by: Related Work, ResNet-18 on ImageNet Dataset..
  • [10] K. Han, Y. Wang, H. Shu, C. Liu, C. Xu, and C. Xu (2019) Attribute aware pooling for pedestrian attribute recognition. arXiv preprint arXiv:1907.11837. Cited by: Introduction.
  • [11] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: Introduction, Related Work.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. CVPR, pp. 770–778. Cited by: Introduction, Network Architectures and Setup..
  • [13] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: Introduction.
  • [14] L. Hou and J. T. Kwok (2018) Loss-aware weight quantization of deep networks. ICLR. Cited by: Introduction.
  • [15] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: Reconstructing Information Flow with Gated Residual.
  • [16] I. Hubara, M. Courbariaux, D. Soudry, R. Elyaniv, and Y. Bengio (2016) Binarized neural networks. NIPS, pp. 4107–4115. Cited by: Introduction, Related Work.
  • [17] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Datasets..
  • [18] X. Lin, C. Zhao, and W. Pan (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: Related Work, ResNet-18 on ImageNet Dataset..
  • [19] C. Liu, W. Ding, X. Xia, B. Zhang, J. Gu, J. Liu, R. Ji, and D. Doermann (2019) Circulant binary convolutional networks: enhancing the performance of 1-bit dcnns with circulant back propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. Cited by: Related Work, ResNet-18 on ImageNet Dataset..
  • [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: Network Architectures and Setup..
  • [21] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K. Cheng (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, Cited by: Introduction, Related Work, Reconstructing Information Flow with Gated Residual, Network Architectures and Setup., FLOPs and Memory Analysis..
  • [22] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: Introduction, Related Work, CIFAR-10/100 Dataset., ResNet-18 on ImageNet Dataset., SSD on Pascal VOC Dataset..
  • [23] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Cited by: Introduction.
  • [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, et al. (2015) ImageNet large scale visual recognition challenge. IJCV, pp. 211–252. Cited by: Datasets..
  • [25] M. Shen, K. Han, C. Xu, and W. Yunhe (2019) Searching for accurate binary neural architectures. arXiv preprint arXiv:1909.07378. Cited by: Related Work.
  • [26] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Network Architectures and Setup..
  • [27] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: Reconstructing Information Flow with Gated Residual.
  • [28] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: CIFAR-10/100 Dataset..
  • [29] D. Wan, F. Shen, L. Liu, F. Zhu, J. Qin, L. Shao, and H. Tao Shen (2018) Tbn: convolutional neural network with ternary inputs and binary weights. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 315–332. Cited by: Related Work, SSD on Pascal VOC Dataset..
  • [30] Y. Wang, C. Xu, J. Qiu, C. Xu, and D. Tao (2017) Towards evolutional compression. arXiv preprint arXiv:1707.08005. Cited by: Introduction, Related Work.
  • [31] X. Yu, T. Liu, X. Wang, and D. Tao (2017) On compressing deep models by low rank and sparse decomposition. pp. 67–76. Cited by: Introduction, Related Work.
  • [32] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: Introduction.
  • [33] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: Introduction, Related Work.
  • [34] J. Zhang, Y. Pan, T. Yao, H. Zhao, and T. Mei (2019) DaBNN: a super fast inference framework for binary neural networks on arm devices. arXiv preprint arXiv:1908.05858. Cited by: Deploying Efficiency.
  • [35] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: Introduction.
  • [36] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: Introduction, CIFAR-10/100 Dataset., ResNet-18 on ImageNet Dataset..
  • [37] S. Zhu, X. Dong, and H. Su (2019) Binary ensemble neural network: more bits per network or more networks per bit?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4923–4932. Cited by: Related Work, ResNet-18 on ImageNet Dataset..