dabnn is an accelerated binary neural networks inference framework for mobile platform
Binary neural networks have attracted numerous attention in recent years. However, mainly due to the information loss stemming from the biased binarization, how to preserve the accuracy of networks still remains a critical issue. In this paper, we attempt to maintain the information propagated in the forward process and propose a Balanced Binary Neural Networks with Gated Residual (BBG for short). First, a weight balanced binarization is introduced to maximize information entropy of binary weights, and thus the informative binary weights can capture more information contained in the activations. Second, for binary activations, a gated residual is further appended to compensate their information loss during the forward process, with a slight overhead. Both techniques can be wrapped as a generic network module that supports various network architectures for different tasks including classification and detection. We evaluate our BBG on image classification tasks over CIFAR-10/100 and ImageNet and on detection task over Pascal VOC. The experimental results show that BBG-Net performs remarkably well across various network architectures such as VGG, ResNet and SSD with the superior performance over state-of-the-art methods in terms of memory consumption, inference speed and accuracy.READ FULL TEXT VIEW PDF
dabnn is an accelerated binary neural networks inference framework for mobile platform
In recent years, a number of approaches have been proposed to learn portable deep neural networks, including multi-bit quantization [2, 4, 14], pruning [11, 30], low-rank decomposition , hashing , and lightweight architecture design , knowledge distillation [13, 32]. Among them, quantization based methods [2, 14] represent the weights and activations in the network with a very low precision, and thus can yield highly more compact DNN models compared to those floating-point counterparts. In an extreme case where both the weights and activations are quantized (i.e. binarized) into one-bit values, the conventional convolution operations can be efficiently achieved via bitwise operations [22, 16]. The resulting decrease in the storage and acceleration in the inference are therefore appealing to the community. Since the proposition of binary neural network , many works have been done to address the performance drop and improve expression ability of binarized networks, e.g. by Bireal-Net  and taking more bits into accounts by DoReFa-Net , LQ-Net .
Although much progress has been made on binarizing DNNs, the existing quantization methods still cause a significantly large accuracy drops, compared with the full-precision models. This observation indicates that the information propagates through the full-precision network are largely lost, when using the binary representation. On the one side, the weights act as filters that extract features layer by layer, which are expect to be diverse for powerful feature representation. The biased weight binarization like sign function usually ignores the variations among the floating-point weights, and thus lead to the severe information loss. For instance, if all the binary weight is degenerated into 1 or -1, the convolution layers or fully-connected layers certainly degenerate into an identity map, which inevitably reduces the representation learning capability of the binary network and makes it hard to train in practice. On the other side, the activations convey the useful information for network prediction, while binarizing them into 1-bit causes the large deviations from the original full-precision feature maps, and thus greatly increases the information loss during the layer-wise propagation.
To address this problem and maximally preserve the information, in this paper we propose the Balanced Binary Network with Gated Residual (BBG for short), which learns the balanced binary weights by maximizing their information entropy, and reduce binarization loss for activations by a gated residual path. Figure 1
demonstrates the whole structure of our BBG. In BBG, we fortunately find that by simply removing the mean in the floating point weight, our method can roughly eliminate the redundancy among binary weights with the maximum entropy. To accomplish this goal in the deep network, we re-parameterize the weight and devise a linear transformation to replace the standard one, which can be easily implemented and learnt. To compensate the information loss when binarizing activations, the gated residual path employs a lightweight operation to use the channel attention information of floating-point activations to further reconstruct the information flow across layers. Our approach is fully compatible with bitwise operations, owning the benefits of rapid inferring for the wide spectrum of deep network architectures. We evaluate our BBG method on the image classification and object detection benchmarks,i.e. CIFAR-10, CIFAR-100, ImageNet datasets and Pascal VOC datasets, and the experimental results show that it performs remarkably well across various network architectures, outperforming state-of-the-arts in terms of memory consumption, inference speed and accuracy.
To reduce the space and computational complexities of deep neural networks simultaneously, a number of quantization and binarization methods have been proposed for processing both weights and activations in neural networks.
The BinaryNet 
first proposes binary neural networks and the performance drop is not obvious in shallow networks in small dataset. XNOR-Net adopts some scale factors in each layer and minimize the quantization error of the output matrix to solve the scale factors. TBN  enhances the expression ability of neural networks with ternary activations. ABC-Net  proposes to use more binary weight and activation bases to increase the accuracy, while the compression and seep-up ratios will be reduced accordingly. LQ-Net  further proposes low bit-width quantization function by minimize quantization error, which achieved comparable results on the ImageNet benchmark by increasing the memory overhead.
In contrast to conventional model compression methods with sophisticated pipelines such as pruning [11, 30] and matrix decomposition , etc, neural network binarization can be efficiently implemented though logic and counting operations. However, the performance of existing neural network binarization methods is still lower than that of original baseline models due to some inaccurate quantization procedures, which decrease the network expression ability and make network hard to train. Therefore, Bireal-Net  proposes using additional shortcut and different approximation of sign function in the backward pass. PCNN  employs a projection matrix to help with the network training. CircConv  rotate binary weight by three times and calculate feature map with four binary weights and merge them together. BENN  ensembles multiple standard binary networks to improve performance. AutoBNN 
employs genetic algorithm to search for binary neural network architectures. Most of these work have complicated training pipeline and some of them increase the FLOPs to achieve better results.
In this section, we first give the preliminaries about binary neural networks, then present the technical details about our BBG method, with respect to the weights and activations.
Consider and as the floating-point weight and activation of a fully-connected layer in the neural networks, respectively. is the dimension of weight and activations. The quantization of weight and activation is beneficial for reducing the memory cost and accelerating the computation in the inference phase. Denote and as the binary base for and , respectively. The output feature can be obtained by
where the inner product for vectors can be calculated efficiently with bit-wise operations AND and Bitcount. To deal with the zero gradient problem of the sign function in the backward propagation process, we adopt the STE method
to estimate the gradient. The sign function without considering distribution of floating-point weights generate biased binary weights makes the binary network hard to train.
For training deep neural networks on training dataset with samples , the goal is to minimize the loss between true label and the predicted label and get a bunch of training parameter to fit in original training data and have a good generalization ability on test data:
In Binary Neural Networks, the most important training parameter is the non-differentiable discrete binary weights of fully-connected layers and convolution layers. In quantized convolution layers, how to generate a binary weight is crucial to improve network’s performance. This discrete property of binary weights brings troublesome problems to the training of the network.
To this end, we aim to preserve the information of weights in the binarization process by differentiable and time-saving methods. The value of each binarized weight
obeys Bernoulli distribution which
takes the value 1 with probabilityand the value with probability , where . For each kernel in the neural networks, the information contained in its binary weights can be measured by the entropy:
To preserve the information of binary weights, thus the target of training binary neural networks should maximize entropy of binary weights. Directly optimize the above regularization is hard due to the hardness to estimate the probability . Instead, we approximate the optimal solution for it by making the expectation of binary weights to be zero to, i.e. . Therefore, the binarized network should be optimized by jointly minimizing the task-related object function, e.g. the cross entropy function for classification task, and the information loss passed through each layer (i.e. maximizing the entropy):
Hence, we first center the real-valued weights and then quantize them into binary codes. More specifically, we use a proxy parameter to get and then quantize it into binary codes . In the convolutional layer, we express the weight in terms of the proxy parameter using:
After balanced weight normalization, we can perform binary quantization on the floating-point weight . Subsequently, the forward pass and backward pass of binary weights are as follows:
where is sign function that outputs for positive numbers and otherwise and calculates the mean of absolute value.
In our balanced weight quantization, the parameter updating is completed based on the proxy parameters , and thus the gradient signal can back-propagate through the normalization process. In the whole process, and are floating-point, and we update on training process and only is needed on inference. Since balanced normalization from to is differentiable, we can update proxy parameter easily. The SGD or other alternative techniques such as ADAM can be directly applied to the optimization with respect to the proxy parameter .
In the binarization of activation, we first clip the value range of activations into and use a round function to binarize activations which equals the following equations:
Therefore, the forward pass and backward pass for binary activations are as follows:
where means if elements of is in the range of , then it is 1, otherwise 0. Therefore, the whole binarization process of a fully-connected layer can be written in Algorithm 1
Binarizing activations results in much larger loss of precision than binary weights. Moreover, usually all activation values of different channels are quantized to 0 or 1, without considering the differences among the channels. The quantization error caused by binary layers is accumulated layer by layer. To address this problem, we further propose a new module named gated residual to reconstruct information flow in channel-wise manner, during the forward propagation. Our gated residual employs the floating-point activations to reduce quantization error and recalibrate features.
In our gated residual module, we propose a new designed layer named gated layer, with gate weights that learn the channel attention information of floating-point input feature map in a binary convolution layer (, , means channels, height and width respectively). The operation on the th channel of the input feature map is defined as follows:
Based on the gated residual, the output feature map can be recalibrated, enhancing the representation power of the activations. The operation in the gated module can be written in the following form:
where means the operation in the main path including activation binarization, balanced convolution and BatchNorm in total.
With gated residual layer, the overall structure of gated module is shown in the Figure 2(d). Similarly, we construct our balanced module with gated residual as Bireal-Net , we add shortcut to each layer rather than one shortcut for one basic block. We initialize weight of gated residual with which is the same with identity shortcut of vanilla ResNet. In Figure 3, we visualize the weight values in a gated residual layer, where they have the peak around 1, but vary in a large range indicating the different impacts on different channels. This proves that the gated residual can also help distinguish the channels and eliminate the less useful channels.
Beside reconstructing the information flow in the forward process, this path also acts as an auxiliary gradient path for activations in the backward propagation. Usually, the STE used in the backward pass to approximate discrete binary functions results in severe gradient mismatch problem. Fortunately, with learnable gated weights, our gated residual can also reconstruct gradient in the following way:
In terms of computational complexity and memory limitation in the nature of the binary networks, the additional operations for designing a new module need to be as small as possible. The structure of HighwayNet  requires a full-precision weights that is as large as the weight in the convolutional layer, which is unacceptable. Compared to SENet , SE module is a correction to the output after the convolution layer, however the information has already been corrupted because of the binary convolution layer. And we argue that it is better to make use of the unquantified information. Similarly, our FLOPs is only and is still much smaller than the SE module. When the number of channels increases and the reduction decreases, the amount of FLOPs required by the SE module will be much more. In comparison with Bireal-Net, the channel-wise scalars in Bireal-Net are exactly the same as the number of weights in the gated layer. These scalars are computed through the mean of absolute floating-point weights, which are not independent parameters and can be replaced by the following BatchNorm layer. However, our learnable gated residual layers help reconstruct the activation and gradient information, subsequently enhancing the network representation ability.
In this section, to verify the effectiveness of our proposed BBG-Net, we conduct experiments on both image classification and object detection task, over on several benchmark datasets, and compare BBG-Net with other state-of-the-art methods.
In our experiments, we adopt three common image classification datasets: CIFAR-10/100 , and ILSVRC-2012 ImageNet . The CIFAR-10/100 dataset consists of 60,000 color images with 10/100 classes, which is composed of 50,000 training and 10,000 validation images. ImageNet is a large-scale dataset with 1.2 million training images and 50k validation images belonging to 1,000 classes. For the evaluation, we employ the widely used metrics for the classification task: Top-1 and Top-5 validation accuracies, and use the letter W and A to stand for weights and activations respectively. We also evaluate our proposed method on the object detection task with a standard detection benchmark, Pascal VOC datasets named VOC2007 , and VOC2012 . Following previous work, we train our models on the combination of VOC2007 trainval and VOC2012 trainval set and test on VOC2007 test set. The standard mAP metric is used for comparison.
We conduct our experiments on popular and powerful network architectures VGG-Small , ResNet  and Single Shot Detector(SSD). For fair comparison with the existing methods, we respectively choose VGG-Small, ResNet-20, ResNet-18 and ResNet-34 as the baseline model on CIFAR-10/100 and ImageNet datasets. And we verify SSD with different backbones, i.e. VGG-16  and ResNet-34 in Pascal VOC datasets. As for hyper-parameter, we mostly follow the same setup in the original papers and all our models are training from scratch.
On CIFAR-10/100 datasets, the padding downsample layers are used for shortcut instead of learnable convolutional layers for ResNet-20. we train 400 epochs and employs learning rate starting at 0.1 and decays by 0.1 at the epoch of 200, 300 and 375. For VGG-Small, we train 400 epochs and employs learning rate starting at 0.02 and decays by 0.1 at the epoch of 80, 160 and 300. We apply SGD with momentum of 0.9 as our optimization algorithm.
On ImageNet datasets, we train 120 epochs and employs learning rate starting at 0.001 as and reduce by 0.1 at epoch 70, 90 and 110. we apply ADAM as our optimization algorithm. Note that we do not quantize the first and last layer as most binary neural networks and we also do not quantize down-sample layers as suggested by many previous work . The accuracy drop is far too much if we quantize these layers, moreover, most binary networks like Bireal-Net  does not binarize these layers either.
On Pascal VOC datasets, we binarize all convolutional layers in the backbone except the first convolutional layer and down-sample layers in ResNet-34. In VGG-16, we binarize all convolution layers in the backbone except the first and last layer and we only add gated residual to those convolution layers with the same input and output channels to avoid additional computation. BatchNorm layers are used in the backbone and the learning rate starts at 0.05.
Binary neural networks are made up of two kind of network layers, floating-point layers and binary layers. In the calculation of FLOPs, the binary layer is divided by following Bireal-Net. This could be the actual acceleration ratio of binary layers with the support of certain hardware design. Similarly, the weights in floating-point layer should be multiplied by , as we need 32 bits to store one single floating-point weight and we only need 1-bit to store binary weight. In the comparison with different networks or network width and resolution experiments, we calculate FLOPs and memory to show our advantages over other state-of-the-art results.
Now we study how our balanced weight quantization and gated residual module affects the network’s performance. In Table 1, we report the results of ResNet-20 on CIFAR-10, with and without balanced quantization or gated residual. In the performance comparison from the first two rows, the network with balanced quantization can obtain 0.6% accuracy than that without this operation. From the whole table, we can easily observe that, balanced weights or gated residual brings accuracy improvement and together they work even better with 1.2% accuracy improvement. It reveals that our proposed method faithfully helps pursue a highly accurate binary network.
In VGG-Small on CIFAR-10 dataset, we compared our quantization method with BNN , XNOR-Net  and DoReFa-Net . We do not add gated residual here. In Table 2, with our proposed binarization methods, we consistently obtain higher accuracy than other 3 methods with over 2% accuracy improvement. It only brings a less than 2% accuracy drop compared with the full-precision networks.
In ResNet-20 on CIFAR-10/100 datasets, we further compare four different kinds of modules in Figure 2 which are utilized to construct the network structure of ResNet. In this experiment, the binarization methods of weights and activations are fixed with our proposed methods. We only care about the influence caused by different construction modules. Vanilla is the basic block which are originally used in full-precision ResNet networks. The kernel stage of standard ResNet-20 is 16-32-64, and the three number represents the number of channels in each stage. Inspired by recent EfficientNet , we employ network width expansion to pursue higher accuracy of binary networks which we expand all the number of channels to or of the standard ResNet-20.
The results of different modules with different kernel stage are shown in Table 3. In CIFAR-10 datasets, the accuracy improvement caused by Gated or module is less than 1% while in a more challenging CIFAR-100 dataset, it adds up to over 2%. Through the whole experiments, we can conclude that Vanilla Gated and Gated show superiority over Bireal and Vanilla modules. Especially when the network grows wider, Bireal Module even performs worse than Vanilla while Vanilla Gated and Gated consistently performs better. With the kernel stage of 48-96-192, our binary network matches the accuracy of full-precision networks in both CIFAR-10 and CIFAR-100 datasets. In CIFAR-100 datasets, with the kernel stage increasing, Gated consistently performs the best among the four different modules. Therefore, we use the Gated module in the following experiments of ImageNet dataset and Pascal VOC dataset.
The experimental results of ResNet-18 are shown in Table 4.We compare our method with many binarization methods of recent years including BNN , ABC-Net , XNOR-Net , DoReFa-Net , PCNN , and it further reveals stability of our proposed BBG-Net on larger datasets. Compared with Bi-Real Net, we still obtain 2.6% improvement. And it is easy to see that in terms of Top-1 accuracy, our method significantly outperforms all the other methods, with 1.3% performance gain over the state-of-the-art ResNetE . In comparison with the full-precision baseline, the Top-1 classification accuracy using our method is reduced by about 10%, and the Top-5 classification accuracy declines less than 8%. Note that BNN , ABC-Net and DoReFa-Net binarize downsample layers to 1-bit and all the other convolutional layers except the first and last layers. However, all the rest methods including our methods does not binarize downsample layers. From the FLOPs and memory comparison, our method is only slightly higher than DoReFa-Net, while our accuracy can increase up to about 7%.
We also improve the accuracy of binary networks by exploring network width and resolution in a simple but effective way. In the network width experiments, we expand all channels of the original ResNet-18 by and . In Figure 4, compared with the full-precision networks, our width models can achieve almost the same accuracy while we only need a third of FLOPs of full-precision networks. We also compare with ABC-Net which utilize multiple binary bases to represent weights and activation where
means 5 binary bases for weight and 3 bases for activations. In the comparison of ABC-Net with multiple bases, our methods consistently perform better than ABC-Net by a large margin. In resolution exploration, we employ a simple strategy that we remove the max pooling layer after the first convolution, and then change the feature map height and width to 1 with global average pooling before the fully-connected layer, which makes all hidden layers havefeature maps compared with the original ones. Compared to CircConv  which employs four times binary weights, we have better results with 1.6% accuracy improvement, while we have slightly higher FLOPs but the same memory. Our resolution accuracy is 2.3% higher while the FLOPs is only of the FLOPs of DenseNet-28 . BENN  ensembles 6 standard binary ResNet-18 which has nearly three times FLOPs of our resolution while the accuracy declines by 2%.
We also conduct experiments on ResNet-34 to verify the advantages of our proposed methods when the network grows deeper. Our method gets much higher accuracy than TBN, with 4.4% performance gain. And TBN employs binary weights and ternary activations while we adopts binary weights and binary activations. In computing FLOPs of TBN, the FLOPs of layer with ternary activations and binary weights is of FLOPs of layer with binary activations and binary weights. We also performs better than Bireal-Net (0.4% improvement) and slightly better than DenseNet-37 . Note that DenseNet-37 consumes more running memory than ResNet-34 due to dense connections and the FLOPs is of ours, which makes it difficult to speed up in actual hardware and applied to edge devices. The results demonstrates the effectiveness of our proposed methods in deeper networks.
In object detection task, we compare our method with XNOR-Net , TBN , and BDN  which includes DenseNet37 and DenseNet-45.In the comparison of ResNet-34 as its backbone, we outperform XNOR and TBN by 6.9% and 2.5%. It proves that our solution using 1-bit can preserve the stronger feature representation and maintain the better generalization ability than ternary neural networks. As for VGG-16, we only need half of FLOPs of binary DenseNet-45 to achieve slightly higher results and our model outperforms DenseNet-37 by 2% with one thirds fewer FLOPs. The accuracy of our VGG-16 is only 5.8% less than full-precision counterparts. The significant performance gain further demonstrates that our method can better preserve the information propagated in the network and help extract the most discriminative features for detection task.
Finally, we implement our method on mobile devices using a framework named daBNN . The mobile device we use is Rasberry Pi 3B, which has a 1.2GHz 64-bit quad-core ARM Cortex-A53. As shown in Table 7, we implement DoReFa which binarizes downsample layers while we do not, the difference in inference time is only which can be ignored. Our proposed method can run faster than full-precision counterparts. This acceleration ratio can be further improved by better design or some customized hardware like FPGA.
Binarization methods for establishing portable neural networks are urgently required so that these networks with massive parameters and complex architectures can be launched efficiently. In this work, we proposed a novel Balanced Binary Neural Network with Gated Residual, namely BBG-Net. The new framework introduces a linear balanced transformation module for maximizing the information entropy of weights in deep neural networks, which can be easily implemented in any deep neural architectures. In addition, our gated residual reconstruct information loss and recalibrate feature maps from the binary layers. Experiments conducted on benchmark datasets and architectures demonstrate the effectiveness of the proposed balanced weight quantization and gated residual for learning binary neural networks with higher performance but lower memory and computation consumption than the state-of-the-art methods.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: Introduction, Network Architectures and Setup..
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8344–8351. Cited by: Related Work, ResNet-18 on ImageNet Dataset..