1 Introduction
With a rapid evolution of mobile devices from simple tools for voice and data transfer to multifunctional intelligent gadgets performing multiple deep learning tasks like image classification, object detection or natural language processing, there is a substantial need in fast, powerefficient and robust neural networks. One promising approach to this problem is reducing the bitwidth of weights, which in extreme cases leads to emission of binary networks, where commonly used floatingpoint weights are replaced with binary ones. Such reduction results in up to 32 times smaller network size and more efficient inference on CPUs and specialized hardware
[18, 19]. However, binarization also leads to a substantial accuracy drop caused by the reduction of information capacity in the convolution filters. The latter one can be measured by Shannon information entropy (
) according to the basic principles of the information theory.In this paper, we propose a novel solution for the problem of restricted performance of binary networks that is controlling and stabilizing of their information capacity. The proposed approach ensures the best representation of information in the convolutional filters, which leads to a higher overall prediction accuracy.
Our main contributions are as follows:

[label=]

For the first time, we propose the concept of information capacity regularization in convolutional neural networks.

A new regularization technique is developed: Shannon entropy based information loss penalty for binary convolutional neural networks.

We show that maintaining the upperlevel value of information entropy in convolutional filters during the training process leads to a higher prediction accuracy of binary neural networks.
The rest of the paper is organized as follows. Section 2 gives an overview of the related works. Section 3 describes the proposed approach in detail, and Section 4
provides the experimental results obtained on several common machine learning problems. Section
5 concludes this work.2 Background
In this section, we review the existing techniques related to binary neural networks and application of the entropybased approaches in the field of machine learning.
2.1 Binary Neural Networks
Training and inference with deep neural networks usually consume a large amount of computational power, which makes them hard to deploy on mobile devices. Recent network binarization approaches [5, 6, 16, 17, 36, 48] offer an efficient solution for these tasks by using binary weights and fast bitwise operations that are radically accelerating the computations and reducing the size of the network by up to 32 times, however, at the cost of lower prediction accuracy. According to the theoretical analysis [36]
, the binarization of weights and quantization of activation functions can bring 2 to 58 times speedup. The convolutions with binarized weights require only addition and subtraction operations (without multiplications) that results in 2 times reduction of computation time, while additional binarization of activations, and usage of XNOR and POPCOUNT operations can provide up to 58 times speedup
[6].In works [5, 6, 17], the authors proposed constraining weights and activations to binary values and using efficient XNOR and POPCOUNT operations instead of matrix multiplications. Built on a similar idea, XNORNet [36] introduces a channelwise scaling factor to improve approximation of fullprecision weights and stores weights between layers in realvalued variables. Replacing the channelwise scaling factor with one scalar for all filters of each layer and focusing on different bitwidth of weights and activations is the main idea of DoReFaNet [48] that shows the effectiveness of the binary network with 4bit activations. The quantization of activations in binarized UNet [43] with varied bitwidth reduces the memory consumption by up to 4 times without compromising the performance. An ambitious attempt to use only binary weights is taken in ABCNet [30] that is essentially reducing the drop in top1 accuracy on the ImageNet dataset. By providing 3 to 5 binary weight bases to approximate fullprecision weights, ABCNet demonstrates an increased information capacity, as well as an enlarged size and complexity.
While the previously proposed approaches are achieving a significantly reduced computational load and smaller network size due to the restricted weight bitwidth, this also leads to essential drop of prediction accuracy. Finding a way to accurately train binary neural networks still remains an unsolved task. One natural way to solve this problem is to measure and control the quantity of information in the binary network that can be done, for example, with a classical measure of information quantity — Shannon information entropy.
2.2 EntropyBased Learning Methods
Shannon information entropy (
) is defined as an expected amount of information in a random variable
[38], and is widely used, together with its successors, as a measure of information value in different fields, including chemistry, medicine [9, 21, 24], robotics and machine learning. In deep belief networks, the maximum entropy learning algorithm provides better generalization capability than the maximum likelihood learning approach, ensuring a less biased distribution and robust to overfitting predictive model
[23]. Shannon entropy inspired the creation of crossentropy loss function
[44]that is commonly used for training various networks performing data classification. Another successor is the KullbackLeibler divergence
[27] that is employed for training deep belief networks [29], where maximization of the model parameters entropy ensures a more efficient generalization. Maximum entropy principle is also used for maximization of the expected reward in reinforcement learning
[33, 39, 45]to define the probability of a trajectory and to choose a nearoptimal expert policy
[2, 46].According to the principle of maximum entropy
, the better generalization is achieved when the input data distribution is as uniform and unbiased as possible when dealing with estimation problems given incomplete knowledge
[22]. Thus, increasing the entropy of the input data by adding random noise improves the robustness to overfitting on small data sets [1, 22]. Multisensor data fusion with entropy maximization also ensures an effective imaginary reconstruction [40]. Maximum entropy principle gives the ability to extract the learned features from the input data and provides a better prediction accuracy for deep neural networks [7]. High information quantity in the form of the highest homogeneity of samples in child nodes is maintained by entropybased information gain ratio and Gini impurity measures in such nonparametric supervised learning methods as decision tree
[35] and decision stream [20].Entropy maximization methods are used to train generative networks for texture synthesis [31] and for reconstruction of macroeconomic models [11]. It was shown that penalizing the entropy of the network’s output distribution improves the exploration in reinforcement learning, acts as a strong regularizer in supervised learning [34], and improves medical image segmentation [14]. In this paper, we make an attempt to go further and propose a new regularization approach for controlling information entropy of the weight distribution in the convolutional filters of binary networks.
3 Information Loss Penalty
In this section, we introduce an approach for controlling information capacity of binary neural networks with Shannon information based loss penalty. First, we describe the calculation of the entropy for binary convolutional filters, and then present the information loss penalty for networks with binary weights backed by realvalued variables.
The sum of absolute values of weights in the convolutional filter and the difference between the quantity of positive () and negative () values of are defined for every convolutional filter as:
(1) 
(2) 
where
denotes the size of tensor
. These values are used to calculate the probability density functions for positive and negative weights
in the filter according to:(3) 
(4) 
The Shannon information entropy of the binary weight distribution in convolution filter is then computed using the above density functions and as:
(5) 
The mean information entropy for all convolutional filters in the network with binary weights can be obtained with:
(6) 
where denotes the total number of filters, and is a tensor with binary weights corresponding to filter . The estimation of the time complexity of function for conventional deep neural network architectures is .
Following [6], to binarize a floatingpoint network we constrain the weights to and , which is advantageous from a hardware perspective, and in order to transform the realvalued variables into discrete values, we use the deterministic binarization function :
(7) 
where the calculated binary weights are used in both forward and backward passes of the network computations, and the original realvalued weights are utilized in the update phase of the training procedure.
The previously introduced information entropy (Eq. 6) was defined only for binary weights, while the network optimization is performed on floatingpoint variables. Though binary weights can be substituted with , in this case we will get a nondifferentiable loss function. To avoid this problem and provide differentiable , we propose approximating the function of with:
(8) 
where for based on the experimental results. Using this approximation of binary weights , we can estimate the information loss as an absolute difference between the expected information entropy and the actual one :
(9) 
where provides the control over the network information capacity, and its biggest value for the binary distribution () corresponds to the maximum information capacity. The resulting information loss penalty is added to the overall loss function in a conventional way:
(10) 
and is used in the backpropagation algorithm for network training.
4 Empirical Verification
To evaluate the performance of the information loss penalty and to explore its properties, we conducted several experiments with binary networks on common machine learning problems. Below we are examining the impact of
on the network’s performance and information capacity, and compare the obtained accuracy values (mean and 95% confidence interval) with the results of the standard binary and fullprecision networks.
4.1 Datasets

[label=]

CIFAR10 and CIFAR100 image classification datasets [25]: 10 and 100 classes, respectively, 3232 px images, 50K training and 10K validation samples.^{1}^{1}1https://www.cs.toronto.edu/~kriz/cifar.html

SVHN a realworld digit image classification dataset [32]: 10 classes, 3232 px images, 73K training and 26K validation samples.^{2}^{2}2http://ufldl.stanford.edu/housenumbers

ImageNet ILSVRC12 classification dataset [37]: 1000 classes, 224224 px images, 1.28M training and 50K validation samples.^{3}^{3}3http://www.imagenet.org
4.2 Training Process
All experiments were performed on Nvidia GK110B GPUs for both fullprecision and binary networks. The models were trained to minimize cross entropy loss function additionally augmented with the information loss penalty (Eq. 9) in the corresponding experiments.
Following [15]
, all networks were trained using stochastic gradient descent. On SVHN and CIFAR datasets, the networks were trained for 40 and 300 epochs, respectively, using batch sizes of 64 and 128 for fullprecision and binary networks. The initial learning rate was set to 0.1, and was divided by 10 at 50% and 75% of the total number of training epochs. On ImageNet dataset, the images were resized to 224
224 px, and the models were trained for 90 epochs with a batch size of 256. The learning rate was initially set to 0.1, and was decreased by 10 times at epochs 30 and 60. A weight decay of [10]and a Nesterov momentum of 0.9
[41] were used without damping. We utilized DoReFaNet binarization technique for weights quantization [48], adapting its implementation^{4}^{4}4https://github.com/zzzxxxttt/Pytorch_DoReFaNet for our tasks, and keeping fullprecision first and last layers [4, 16, 48]. A 4bit uniform quantization is applied to activations [28] starting from the outputs of the first layer.To ensure a fair comparison between fullprecision, binary networks and binary networks with the information loss penalty, we are using identical experimental settings in all three cases, including data preprocessing and optimization settings. For the experiments, we used five publicly available network architectures: for ImageNet classification — PyTorch
^{5}^{5}5https://pytorch.org (PyTorch v. 1.0, Cuda v. 10.0) implementation of DenseNet121 [15], ResNet18 [12], Inception v3 [42] and AlexNet [26] from TorchVision package, for other datasets — preactivation ResNet18^{6}^{6}6https://github.com/kuangliu/pytorchcifar [13]. An auxiliary classifier in the binary Inception network was kept fullprecision. Binarization leads to 32
compression of the hidden layers and a commensurate reduction of the model memory footprint (Table 1).Model  Output size  # para meters  Memory footprint, MB  

Fullprecision  Binary  
Preactivation ResNet18  10  11.2M  44.7  1.4 
100  11.2M  44.9  1.6  
DenseNet121  1000  8.0M  32.4  1.3 
ResNet18  1000  11.7M  46.8  3.6 
Inceptionv3  1000  27.2M  108.9  11.6 
AlexNet  1000  61.1M  244.4  23.5 

The first and the last layer are fullprecision.
During the training process, using additional information loss penalty resulted in computational overhead of 6.3 0.8%.
4.3 Classification Results
Model  SVHN  CIFAR10  CIFAR100  ImageNet 2012  

Preactivation ResNet18  DenseNet121  ResNet18  Inceptionv3  AlexNet  
Fullprecision  —  96.45 0.11  95.23 0.20  76.78 0.19  76.4 [15]  69.3 [36]  78.8 [42]  56.6 [36] 
Binary  —  96.10 0.08  93.43 0.19  73.07 0.18  65.12  59.26  72.61  53.36 
0.97  96.270.07  93.820.17  73.480.16  67.14  61.36  73.83  54.07  
1.00  96.14 0.08  93.51 0.19  73.20 0.17  —  —  —  — 
First, we conducted a series of preliminary experiments aimed at determining the optimal values of parameters and , where the accuracy of preactivation ResNet18 binary network was assessed on the validation subsets of SVHN, CIFAR10 and CIFAR100 datasets. In the first set of experiments, the value of the expected information entropy was kept constant and equal to its maximum value (), while the value of was varied between and with a step size of n = 1, resulting in the optimal value .
In the second set of experiments, we used the obtained value of , but varied the target entropy between 0.8 and 1 with a step size of 0.01. The results demonstrated that the optimal value of is equal to , and in the next experiments we used the obtained values of and to get the final results. We also trained preactivation ResNet18 network with the highest target entropy () to analyze its performance in this case.
Table 2 summarizes the classification results obtained on SVHN, CIFAR10, CIFAR100, and ImageNet 2012 datasets with fullprecision networks and binary networks trained with common parameters and with additional information loss penalty. On SVHN and CIFAR datasets, the networks with binary weights show an accuracy drop of 0.353 % compared to their fullprecision versions that corresponds to the results of [8, 17, 48]. Augmenting the target loss with the information loss penalty and aiming at the highest possible entropy value () is leading only to a slight improvement of the accuracy by 0.040.13% in case of preactivation ResNet model. Lowering the target entropy to 0.97 resulted in larger accuracy improvements: 0.17%, 0.39% and 0.41% for preactivation ResNet on SVHN, CIFAR10 and CIFAR100 datasets, respectively. This effect can be explained as follows: while the largest entropy is theoretically leading to the highest information capacity, it is also forcing the network to use the highest variety of weights, therefore preventing it from learning different modifications of the same filter targeted at different patterns, which can be crucial in many classification tasks.
On the ImageNet classification problem, our PyTorch implementation of binary ResNet18 and AlexNet models provided the same stateoftheart accuracy (Table 2) as the DoReFaNets [48] with 4bit activations: 59.2% and 53.0%, respectively. Our vanilla binary versions with fullprecision external layers outperform 4bit DenseNet121 (4.1MB) and 6bit Inception v3 (20.4MB) with 8bit activations that are showing the accuracy of 63.0% and 72.5%, respectively [47]. Compared to the DenseNet45 (7.4MB) with binary weights/activations and an accuracy of 63.7% [3], we provide a more precise while 5.7 smaller DenseNet121.
By adding the information loss penalty with and , the top1 classification accuracy was drastically improved to 67.14%, 61.36%, 73.83% and 54.07% with DenseNet, ResNet, Inception and AlexNet models, respectively. While this performance still leaves room for improvement when compared to fullprecision models, it shows a significant advantage over the previously proposed attempts to improve the accuracy of the binarized networks with a fixed architecture and 4bit activations on the ImageNet dataset.
Figure 1 demonstrates that the information loss penalty does not cause the training curve to be significantly altered, though it consistently shows a higher level of accuracy during the entire training. The reason for this is the stabilization of the information entropy in the binary network, reflected by a similar distribution of the realvalued and binary weights (Fig. 2) in the network’s filters. As one can see, a higher level of
corresponds not only to a lower skewness, but also to smaller absolute values of weights, which shows a similarity between the information loss penalty and the classical regularization methods such as
and norms.5 Conclusion
In this paper, we report on the first attempt to control information capacity of deep binary networks with a new regularization technique — information loss penalty. By maintaining an appropriate level of information entropy in every convolutional filter and providing a stable information capacity of the entire binary network, the proposed algorithm provides better generalization ability and improves the accuracy in classification tasks. The main parameter of the algorithm — expected information entropy, — provides an opportunity to determine the optimal information capacity level and force the network to maintain this level throughout the entire learning process. By controlling information entropy of the binary network, we outperformed the existing stateoftheart binary solutions using 4bit activation functions and got closer to the prediction accuracy of their 32bit counterparts on SVHN, CIFAR, and ImageNet machine learning datasets.
Acknowledgement
We would like to thank Dr. Alexander Nikolaevich Filippov form Russian Research Center of Huawei Technologies for insightful discussions.
References
 [1] G. Amos, G. George, and D.M. Judge. Maximum entropy econometrics. Robust Estimation with Limited Data, p. 324, 1996.
 [2] J. Audiffren, M. Valko, A. Lazaric, and M. Ghavamzadeh. Maximum entropy semisupervised inverse reinforcement learning. In IJCAI, 2015.
 [3] J. Bethge, H. Yang, M. Bornstein, and C. Meinel. BinaryDenseNet: developing an architecture for binary neural networks. In ICCV workshop, 2019.

[4]
A. Bulat and G. Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In
ICCV, 2017.  [5] M. Courbariaux, Y. Bengio, and J.P. David. BinaryConnect: training deep neural networks with binary weights during propagations. In NIPS, 2015.
 [6] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or 1. arXiv preprint arXiv:1602.02830, 2016.
 [7] A. Finnegan and J.S. Song. Maximum entropy methods for extracting the learned features of deep neural networks. PLoS Comput. Biol., 13(10), 2017.
 [8] J. Fromm, S. Patel, and M. Philipose. Heterogeneous bitwidth binarization in convolutional neural network. In NIPS, 2018.
 [9] I. Gerasimov and D. Ignatov. Effect of lower body negative pressure on the heart rate variability. Human Physiology. 31(4):421–424, 2005.
 [10] S. Gross and M. Wilber. Training and investigating residual nets, 2016.
 [11] A. Hazan. A maximum entropy network reconstruction of macroeconomic models. Physica A: Statistical Mechanics and its Applications, 519, pp. 117, 2019.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 [14] X. Hu, J. Tao, Z. Ye, B. Qiu, and J. Xu. An improved wavelet neural network medical image segmentation algorithm with combined maximum entropy. AIP, 2018.
 [15] G. Huang, Z. Liu, L. Van Der Maaten, and K.Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [16] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks. In NIPS, 2016.
 [17] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res., 18(187), 2018.
 [18] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool. AI benchmark: running deep neural networks on android smartphones. In ECCV, 2018.
 [19] A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, M. Wu, L. Xu, and L. Van Gool. AI benchmark: all about deep learning on smartphones in 2019. In IEEE/CVF ICCVW, 2019.
 [20] D. Ignatov and A. Ignatov. Decision stream: cultivating deep decision trees. In IEEE ICTAI, 2017.
 [21] D.Yu. Ignatov. Functional heterogeneity of human neutrophils and their role in peripheral blood leukocyte quantity regulation (PhD). Donetsk National Medical University, 2012.
 [22] N. Japkowicz and M. Shah. Evaluating learning algorithms: a classification perspective. 2011.
 [23] H. Jing and Y. Tsao. Sparse maximum entropy deep belief nets. In IJCNN, 2013.

[24]
V. Kazakov, I. Kuznetsov, I. Gerasimov, and D. Ignatov. An informational approach to the analysis of the lowfrequency neuronal impulse activity in the rostral hypothalamus.
Neurophysiology. 33(4):235–241, 2001.  [25] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
 [26] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [27] S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics. 22(1):7986, 1951.
 [28] D. Lin, S. Talathi, and S. Annapureddy. Fixed point qantization of deep convolutional networks. In ICML, 2016.
 [29] P. Lin, S.W. Fu, S.S. Wang, Y.H. Lai, and Y. Tsao. Maximum entropy learning with deep belief networks. Entropy, 18:251, 2016.
 [30] X. Lin, C. Zhao, and W. Pan. Towards accurate binary convolutional neural network. In NIPS, 2017.
 [31] G. LoaizaGanem, Y. Gao, and J.P. Cunningham. Maximum entropy flow networks. ICLR, 2017.
 [32] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A.Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, 2011.
 [33] M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster, Y. Wu, and D. Schuurmans. Reward augmented maximum likelihood for neural structured prediction. In NIPS, 2016.
 [34] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
 [35] J.R. Quinlan. C4.5: Programs for machine learning. 1993.
 [36] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet: ImageNet classification using binary convolutional neural networks. In ECCV, 2016.

[37]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zh. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge.
International Journal of Computer Vision
, 115(3):211252, 2015.  [38] C.E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379423, 1948.
 [39] M. Shen and J.P. How. Active perception in adversarial scenarios using maximum entropy deep reinforcement learning. arXiv preprint arXiv:1902.05644, 2019.
 [40] Y.V. Shkvarko, J.A. Lopez, and S.R. Santos. Maximum entropy neural networks for feature enhanced imaging with collaborative microwave multisensor data fusion. In IEEE LAMC, 2016.
 [41] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
 [42] Ch. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
 [43] Z. Tang, X. Peng, K. Li, and D.N. Metaxas. Towards efficient UNets: a coupled and quantized approach. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
 [44] M.D. Richard and R.P. Lippmann. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural computation, 3(4):461483, 1991.
 [45] R. Xiong, J. Cao, and Q. Yu. Reinforcement learningbased realtime power management for hybrid energy storage system in the plugin hybrid electric vehicle. Applied Energy, 211:538548, 2018.
 [46] B.D. Ziebart, A.L. Maas, J.A. Bagnell, and A.K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
 [47] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang. Improving neural network quantization without retraining using outlier channel splitting. arXiv preprint arXiv:1901.09504, 2019.
 [48] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. DoReFaNet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
Comments
There are no comments yet.