With a rapid evolution of mobile devices from simple tools for voice and data transfer to multifunctional intelligent gadgets performing multiple deep learning tasks like image classification, object detection or natural language processing, there is a substantial need in fast, power-efficient and robust neural networks. One promising approach to this problem is reducing the bitwidth of weights, which in extreme cases leads to emission of binary networks, where commonly used floating-point weights are replaced with binary ones. Such reduction results in up to 32 times smaller network size and more efficient inference on CPUs and specialized hardware[18, 19]
. However, binarization also leads to a substantial accuracy drop caused by the reduction of information capacity in the convolution filters. The latter one can be measured by Shannon information entropy () according to the basic principles of the information theory.
In this paper, we propose a novel solution for the problem of restricted performance of binary networks that is controlling and stabilizing of their information capacity. The proposed approach ensures the best representation of information in the convolutional filters, which leads to a higher overall prediction accuracy.
Our main contributions are as follows:
For the first time, we propose the concept of information capacity regularization in convolutional neural networks.
A new regularization technique is developed: Shannon entropy based information loss penalty for binary convolutional neural networks.
We show that maintaining the upper-level value of information entropy in convolutional filters during the training process leads to a higher prediction accuracy of binary neural networks.
In this section, we review the existing techniques related to binary neural networks and application of the entropy-based approaches in the field of machine learning.
2.1 Binary Neural Networks
Training and inference with deep neural networks usually consume a large amount of computational power, which makes them hard to deploy on mobile devices. Recent network binarization approaches [5, 6, 16, 17, 36, 48] offer an efficient solution for these tasks by using binary weights and fast bitwise operations that are radically accelerating the computations and reducing the size of the network by up to 32 times, however, at the cost of lower prediction accuracy. According to the theoretical analysis 
, the binarization of weights and quantization of activation functions can bring 2 to 58 times speedup. The convolutions with binarized weights require only addition and subtraction operations (without multiplications) that results in 2 times reduction of computation time, while additional binarization of activations, and usage of XNOR and POPCOUNT operations can provide up to 58 times speedup.
In works [5, 6, 17], the authors proposed constraining weights and activations to binary values and using efficient XNOR and POPCOUNT operations instead of matrix multiplications. Built on a similar idea, XNOR-Net  introduces a channel-wise scaling factor to improve approximation of full-precision weights and stores weights between layers in real-valued variables. Replacing the channel-wise scaling factor with one scalar for all filters of each layer and focusing on different bitwidth of weights and activations is the main idea of DoReFa-Net  that shows the effectiveness of the binary network with 4-bit activations. The quantization of activations in binarized U-Net  with varied bitwidth reduces the memory consumption by up to 4 times without compromising the performance. An ambitious attempt to use only binary weights is taken in ABC-Net  that is essentially reducing the drop in top-1 accuracy on the ImageNet dataset. By providing 3 to 5 binary weight bases to approximate full-precision weights, ABC-Net demonstrates an increased information capacity, as well as an enlarged size and complexity.
While the previously proposed approaches are achieving a significantly reduced computational load and smaller network size due to the restricted weight bitwidth, this also leads to essential drop of prediction accuracy. Finding a way to accurately train binary neural networks still remains an unsolved task. One natural way to solve this problem is to measure and control the quantity of information in the binary network that can be done, for example, with a classical measure of information quantity — Shannon information entropy.
2.2 Entropy-Based Learning Methods
Shannon information entropy (
) is defined as an expected amount of information in a random variable, and is widely used, together with its successors, as a measure of information value in different fields, including chemistry, medicine [9, 21, 24]
, robotics and machine learning. In deep belief networks, the maximum entropy learning algorithm provides better generalization capability than the maximum likelihood learning approach, ensuring a less biased distribution and robust to over-fitting predictive model
. Shannon entropy inspired the creation of cross-entropy loss function
that is commonly used for training various networks performing data classification. Another successor is the Kullback-Leibler divergence that is employed for training deep belief networks 
, where maximization of the model parameters entropy ensures a more efficient generalization. Maximum entropy principle is also used for maximization of the expected reward in reinforcement learning[33, 39, 45]
to define the probability of a trajectory and to choose a near-optimal expert policy[2, 46].
According to the principle of maximum entropy
, the better generalization is achieved when the input data distribution is as uniform and unbiased as possible when dealing with estimation problems given incomplete knowledge. Thus, increasing the entropy of the input data by adding random noise improves the robustness to over-fitting on small data sets [1, 22]. Multi-sensor data fusion with entropy maximization also ensures an effective imaginary reconstruction . Maximum entropy principle gives the ability to extract the learned features from the input data and provides a better prediction accuracy for deep neural networks 
. High information quantity in the form of the highest homogeneity of samples in child nodes is maintained by entropy-based information gain ratio and Gini impurity measures in such nonparametric supervised learning methods as decision tree and decision stream .
Entropy maximization methods are used to train generative networks for texture synthesis  and for reconstruction of macroeconomic models . It was shown that penalizing the entropy of the network’s output distribution improves the exploration in reinforcement learning, acts as a strong regularizer in supervised learning , and improves medical image segmentation . In this paper, we make an attempt to go further and propose a new regularization approach for controlling information entropy of the weight distribution in the convolutional filters of binary networks.
3 Information Loss Penalty
In this section, we introduce an approach for controlling information capacity of binary neural networks with Shannon information based loss penalty. First, we describe the calculation of the entropy for binary convolutional filters, and then present the information loss penalty for networks with binary weights backed by real-valued variables.
The sum of absolute values of weights in the convolutional filter and the difference between the quantity of positive () and negative () values of are defined for every convolutional filter as:
denotes the size of tensor
. These values are used to calculate the probability density functions for positive and negative weightsin the filter according to:
The Shannon information entropy of the binary weight distribution in convolution filter is then computed using the above density functions and as:
The mean information entropy for all convolutional filters in the network with binary weights can be obtained with:
where denotes the total number of filters, and is a tensor with binary weights corresponding to filter . The estimation of the time complexity of function for conventional deep neural network architectures is .
Following , to binarize a floating-point network we constrain the weights to and , which is advantageous from a hardware perspective, and in order to transform the real-valued variables into discrete values, we use the deterministic binarization function :
where the calculated binary weights are used in both forward and backward passes of the network computations, and the original real-valued weights are utilized in the update phase of the training procedure.
The previously introduced information entropy (Eq. 6) was defined only for binary weights, while the network optimization is performed on floating-point variables. Though binary weights can be substituted with , in this case we will get a non-differentiable loss function. To avoid this problem and provide differentiable , we propose approximating the function of with:
where for based on the experimental results. Using this approximation of binary weights , we can estimate the information loss as an absolute difference between the expected information entropy and the actual one :
where provides the control over the network information capacity, and its biggest value for the binary distribution () corresponds to the maximum information capacity. The resulting information loss penalty is added to the overall loss function in a conventional way:
and is used in the back-propagation algorithm for network training.
4 Empirical Verification
To evaluate the performance of the information loss penalty and to explore its properties, we conducted several experiments with binary networks on common machine learning problems. Below we are examining the impact of
on the network’s performance and information capacity, and compare the obtained accuracy values (mean and 95% confidence interval) with the results of the standard binary and full-precision networks.
4.2 Training Process
All experiments were performed on Nvidia GK110B GPUs for both full-precision and binary networks. The models were trained to minimize cross entropy loss function additionally augmented with the information loss penalty (Eq. 9) in the corresponding experiments.
, all networks were trained using stochastic gradient descent. On SVHN and CIFAR datasets, the networks were trained for 40 and 300 epochs, respectively, using batch sizes of 64 and 128 for full-precision and binary networks. The initial learning rate was set to 0.1, and was divided by 10 at 50% and 75% of the total number of training epochs. On ImageNet dataset, the images were resized to 224224 px, and the models were trained for 90 epochs with a batch size of 256. The learning rate was initially set to 0.1, and was decreased by 10 times at epochs 30 and 60. A weight decay of 
and a Nesterov momentum of 0.9 were used without damping. We utilized DoReFa-Net binarization technique for weights quantization , adapting its implementation444https://github.com/zzzxxxttt/Pytorch_DoReFaNet for our tasks, and keeping full-precision first and last layers [4, 16, 48]. A 4-bit uniform quantization is applied to activations  starting from the outputs of the first layer.
To ensure a fair comparison between full-precision, binary networks and binary networks with the information loss penalty, we are using identical experimental settings in all three cases, including data preprocessing and optimization settings. For the experiments, we used five publicly available network architectures: for ImageNet classification — PyTorch555https://pytorch.org (PyTorch v. 1.0, Cuda v. 10.0) implementation of DenseNet-121 , ResNet-18 , Inception v3  and AlexNet  from TorchVision package, for other datasets — pre-activation ResNet-18666https://github.com/kuangliu/pytorch-cifar 
. An auxiliary classifier in the binary Inception network was kept full-precision. Binarization leads to 32compression of the hidden layers and a commensurate reduction of the model memory footprint (Table 1).
|Model||Output size||# para- meters||Memory footprint, MB|
The first and the last layer are full-precision.
During the training process, using additional information loss penalty resulted in computational overhead of 6.3 0.8%.
4.3 Classification Results
|Full-precision||—||96.45 0.11||95.23 0.20||76.78 0.19||76.4 ||69.3 ||78.8 ||56.6 |
|Binary||—||96.10 0.08||93.43 0.19||73.07 0.18||65.12||59.26||72.61||53.36|
|1.00||96.14 0.08||93.51 0.19||73.20 0.17||—||—||—||—|
First, we conducted a series of preliminary experiments aimed at determining the optimal values of parameters and , where the accuracy of pre-activation ResNet-18 binary network was assessed on the validation subsets of SVHN, CIFAR-10 and CIFAR-100 datasets. In the first set of experiments, the value of the expected information entropy was kept constant and equal to its maximum value (), while the value of was varied between and with a step size of n = 1, resulting in the optimal value .
In the second set of experiments, we used the obtained value of , but varied the target entropy between 0.8 and 1 with a step size of 0.01. The results demonstrated that the optimal value of is equal to , and in the next experiments we used the obtained values of and to get the final results. We also trained pre-activation ResNet-18 network with the highest target entropy () to analyze its performance in this case.
Table 2 summarizes the classification results obtained on SVHN, CIFAR-10, CIFAR-100, and ImageNet 2012 datasets with full-precision networks and binary networks trained with common parameters and with additional information loss penalty. On SVHN and CIFAR datasets, the networks with binary weights show an accuracy drop of 0.35-3 % compared to their full-precision versions that corresponds to the results of [8, 17, 48]. Augmenting the target loss with the information loss penalty and aiming at the highest possible entropy value () is leading only to a slight improvement of the accuracy by 0.04-0.13% in case of pre-activation ResNet model. Lowering the target entropy to 0.97 resulted in larger accuracy improvements: 0.17%, 0.39% and 0.41% for pre-activation ResNet on SVHN, CIFAR-10 and CIFAR-100 datasets, respectively. This effect can be explained as follows: while the largest entropy is theoretically leading to the highest information capacity, it is also forcing the network to use the highest variety of weights, therefore preventing it from learning different modifications of the same filter targeted at different patterns, which can be crucial in many classification tasks.
On the ImageNet classification problem, our PyTorch implementation of binary ResNet-18 and AlexNet models provided the same state-of-the-art accuracy (Table 2) as the DoReFa-Nets  with 4-bit activations: 59.2% and 53.0%, respectively. Our vanilla binary versions with full-precision external layers outperform 4-bit DenseNet-121 (4.1MB) and 6-bit Inception v3 (20.4MB) with 8-bit activations that are showing the accuracy of 63.0% and 72.5%, respectively . Compared to the DenseNet-45 (7.4MB) with binary weights/activations and an accuracy of 63.7% , we provide a more precise while 5.7 smaller DenseNet-121.
By adding the information loss penalty with and , the top-1 classification accuracy was drastically improved to 67.14%, 61.36%, 73.83% and 54.07% with DenseNet, ResNet, Inception and AlexNet models, respectively. While this performance still leaves room for improvement when compared to full-precision models, it shows a significant advantage over the previously proposed attempts to improve the accuracy of the binarized networks with a fixed architecture and 4-bit activations on the ImageNet dataset.
Figure 1 demonstrates that the information loss penalty does not cause the training curve to be significantly altered, though it consistently shows a higher level of accuracy during the entire training. The reason for this is the stabilization of the information entropy in the binary network, reflected by a similar distribution of the real-valued and binary weights (Fig. 2) in the network’s filters. As one can see, a higher level of
corresponds not only to a lower skewness, but also to smaller absolute values of weights, which shows a similarity between the information loss penalty and the classical regularization methods such asand norms.
In this paper, we report on the first attempt to control information capacity of deep binary networks with a new regularization technique — information loss penalty. By maintaining an appropriate level of information entropy in every convolutional filter and providing a stable information capacity of the entire binary network, the proposed algorithm provides better generalization ability and improves the accuracy in classification tasks. The main parameter of the algorithm — expected information entropy, — provides an opportunity to determine the optimal information capacity level and force the network to maintain this level throughout the entire learning process. By controlling information entropy of the binary network, we outperformed the existing state-of-the-art binary solutions using 4-bit activation functions and got closer to the prediction accuracy of their 32-bit counterparts on SVHN, CIFAR, and ImageNet machine learning datasets.
We would like to thank Dr. Alexander Nikolaevich Filippov form Russian Research Center of Huawei Technologies for insightful discussions.
-  G. Amos, G. George, and D.M. Judge. Maximum entropy econometrics. Robust Estimation with Limited Data, p. 324, 1996.
-  J. Audiffren, M. Valko, A. Lazaric, and M. Ghavamzadeh. Maximum entropy semi-supervised inverse reinforcement learning. In IJCAI, 2015.
-  J. Bethge, H. Yang, M. Bornstein, and C. Meinel. BinaryDenseNet: developing an architecture for binary neural networks. In ICCV workshop, 2019.
A. Bulat and G. Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. InICCV, 2017.
-  M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: training deep neural networks with binary weights during propagations. In NIPS, 2015.
-  M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016.
-  A. Finnegan and J.S. Song. Maximum entropy methods for extracting the learned features of deep neural networks. PLoS Comput. Biol., 13(10), 2017.
-  J. Fromm, S. Patel, and M. Philipose. Heterogeneous bitwidth binarization in convolutional neural network. In NIPS, 2018.
-  I. Gerasimov and D. Ignatov. Effect of lower body negative pressure on the heart rate variability. Human Physiology. 31(4):421–424, 2005.
-  S. Gross and M. Wilber. Training and investigating residual nets, 2016.
-  A. Hazan. A maximum entropy network reconstruction of macroeconomic models. Physica A: Statistical Mechanics and its Applications, 519, pp. 1-17, 2019.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
-  X. Hu, J. Tao, Z. Ye, B. Qiu, and J. Xu. An improved wavelet neural network medical image segmentation algorithm with combined maximum entropy. AIP, 2018.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K.Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In NIPS, 2016.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res., 18(187), 2018.
-  A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool. AI benchmark: running deep neural networks on android smartphones. In ECCV, 2018.
-  A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, M. Wu, L. Xu, and L. Van Gool. AI benchmark: all about deep learning on smartphones in 2019. In IEEE/CVF ICCVW, 2019.
-  D. Ignatov and A. Ignatov. Decision stream: cultivating deep decision trees. In IEEE ICTAI, 2017.
-  D.Yu. Ignatov. Functional heterogeneity of human neutrophils and their role in peripheral blood leukocyte quantity regulation (PhD). Donetsk National Medical University, 2012.
-  N. Japkowicz and M. Shah. Evaluating learning algorithms: a classification perspective. 2011.
-  H. Jing and Y. Tsao. Sparse maximum entropy deep belief nets. In IJCNN, 2013.
V. Kazakov, I. Kuznetsov, I. Gerasimov, and D. Ignatov. An informational approach to the analysis of the low-frequency neuronal impulse activity in the rostral hypothalamus.Neurophysiology. 33(4):235–241, 2001.
-  A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics. 22(1):79-86, 1951.
-  D. Lin, S. Talathi, and S. Annapureddy. Fixed point qantization of deep convolutional networks. In ICML, 2016.
-  P. Lin, S.-W. Fu, S.-S. Wang, Y.-H. Lai, and Y. Tsao. Maximum entropy learning with deep belief networks. Entropy, 18:251, 2016.
-  X. Lin, C. Zhao, and W. Pan. Towards accurate binary convolutional neural network. In NIPS, 2017.
-  G. Loaiza-Ganem, Y. Gao, and J.P. Cunningham. Maximum entropy flow networks. ICLR, 2017.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A.Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, 2011.
-  M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster, Y. Wu, and D. Schuurmans. Reward augmented maximum likelihood for neural structured prediction. In NIPS, 2016.
-  G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
-  J.R. Quinlan. C4.5: Programs for machine learning. 1993.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet classification using binary convolutional neural networks. In ECCV, 2016.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zh. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge.
International Journal of Computer Vision, 115(3):211-252, 2015.
-  C.E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379-423, 1948.
-  M. Shen and J.P. How. Active perception in adversarial scenarios using maximum entropy deep reinforcement learning. arXiv preprint arXiv:1902.05644, 2019.
-  Y.V. Shkvarko, J.A. Lopez, and S.R. Santos. Maximum entropy neural networks for feature enhanced imaging with collaborative microwave multi-sensor data fusion. In IEEE LAMC, 2016.
-  I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
-  Ch. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
-  Z. Tang, X. Peng, K. Li, and D.N. Metaxas. Towards efficient U-Nets: a coupled and quantized approach. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
-  M.D. Richard and R.P. Lippmann. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural computation, 3(4):461-483, 1991.
-  R. Xiong, J. Cao, and Q. Yu. Reinforcement learning-based real-time power management for hybrid energy storage system in the plug-in hybrid electric vehicle. Applied Energy, 211:538-548, 2018.
-  B.D. Ziebart, A.L. Maas, J.A. Bagnell, and A.K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
-  R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang. Improving neural network quantization without retraining using outlier channel splitting. arXiv preprint arXiv:1901.09504, 2019.
-  S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.