1 Introduction
Over the past few years, Artificial Intelligence (AI) utilizing Deep Neural Networks (DNNs), especially Convolutional Neural Networks (CNNs) has shown great potential on some specific tasks such as computer vision, including but not limited to classification
krizhevsky2012imagenet; VeryDeepConvolutional; 7298594; wang2019dynamic; 9027877, detection DBLP:journals/corr/GirshickDDM13; DBLP:journals/corr/Girshick15; DBLP:journals/corr/abs190402701; NIPS2015_5638; Li_2019_CVPR and segmentation Everingham:2010:PVO:1747084.1747104; Zhuang_2019_CVPR. However, deep CNNs usually have a large number of parameters and high computational complexity to satisfy the requirement of high accuracy. Thus a great deal of memory and computing power is always required when running the high accurate CNNs, which significantly limits the deployment of CNNs on lightweight devices such as lowpower chips and embedded devices.Fortunately, Binary Neural Networks (BNNs) can achieve efficient inference and small memory usage utilizing the highperformance instructions including XNOR, Bitcount, and Shift that most lowpower devices support Dong_2019_ICCV; Morozov_2019_ICCV; Ajanthan_2019_ICCV; Jung_2019_CVPR; Yang_2019_CVPR; Wang_2019_CVPR; Cao_2019_CVPR; Nagel_2019_ICCV; qin2020bipointnet. Despite the huge speed advantage, existing binary neural networks still suffer a large drop in accuracy compared with their fullprecision counterparts DBLP:journals/ijcv/LiuDHZLGD21; DBLP:journals/ijcv/LiuLWYLC20; DBLP:journals/ijcv/SongHGXHS20; DBLP:journals/ijcv/DongNLCSZ19. The reasons for the accuracy drop mainly lie in two aspects.
On the one hand, the limited representation capability and discreteness of binarized parameters lead to significant information loss in the forward propagation. When the 32bit parameters are binarized to 1bit, the diversity of the neural network model drops sharply, which is proved to be the key factor in the accuracy drop of BNNs diverse. To increase diversity, some work proposed to introduce additional operations. For example, the ABCNet ABCNet utilizes multiple binary bases for more representation levels and the WRPN mishra2018wrpn devises wider networks for more parameters. The BiReal Net proposed in Liu_2018_ECCV added on a fullprecision shortcut to the binarized activations to improve the feature diversity, which also greatly improves the BNNs. But due to the speed and memory limit, any extra floatingpoint calculation or parameter increase will greatly harm the practical deployment on the edge hardware like Raspberry Pi and BeagleBone kruger2014benchmarking. Therefore, it is still a great challenge for BNNs to achieve high accuracy while can be deployed on lightweight devices as well.
On the other hand, accurate gradients supply correct information for network optimization in backward propagation. But during the training process of BNNs, discrete binarization inevitably causes inaccurate gradients and further the wrong optimization direction. In order to deal with the problem of discreteness, different approximations of binarization for the backward propagation have been proposed DBLP:conf/cvpr/CaiHSV17; Liu_2018_ECCV; BNN+; selfBN; ImprovedTraining, which can be mainly categorized into improving the updating capability and reducing the mismatching area between the function and the approximate one. However, the difference between the early and later training stages is always being ignored. In fact, powerful update capabilities are usually highly required at the beginning of the training process, while small gradient errors become more important at the end of training. Moreover, some works extremely decrease the gap between the function and the estimator in a certain period of the training process, while our study shows that ensuring suitable parameters can be updated in the whole training process is better for BNN optimization. Specifically, when the estimator of BNN extremely approximates the function, though the gradient error between them is small, the gradient values in BNN are almost all zeros and the BNN can hardly be updated, which is called "saturation" Regularizeactdistribution. Therefore, the methods devoted to extremely decreased gradient error may seriously ignore the harm to the parameter updating capability.
In order to address the abovementioned issues, we study the network binarization from the information flow perspective and propose a novel Distributionsensitive Information Retention Network (DIRNet) (see the overview in Fig. 1). We train BNNs with high accuracy by retaining the information during both forward and backward propagation: (1) the DIRNet introduces a novel binarization approach named Information Maximized Binarization (IMB) in the forward propagation, which balances and standardizes the weight distribution before binarizing. With the IMB, we can minimize the information loss in the forward propagation by maximizing the information entropy of the quantized parameters and minimizing the quantization error. Besides, the IMB is conducted offline and thus brings no time cost during inference. (2) The Distributionsensitive Twostage Estimator (DTE) is devised to compute gradients in the backward propagation, which minimizes the multitype information loss by approximating the function. The shape change of the DTE is distributionsensitive, which obtains the accurate gradients and, more importantly, ensures that there are always enough parameters updated in the whole training process.
Note that we extend our prior conference publication IRNet that mainly concentrates on a binary neural network method. This paper further comprehensively studies the information loss of BNN from the perspective of mathematics and experience to comprehend the forward and backward propagation of BNN more deeply. Existing works lack the analysis and comprehension of the information loss in binarization, and the manual or fixed strategies are always applied to BNN but significant information loss still exists. Therefore, compared with the conference version, this manuscript further comprehensively studies the information loss problem in binarization, presents new distributionsensitive improvements to the BNN, and compares the proposed method with more SOTA methods on more architectures. Specifically, first, we present a more indepth analysis about the information loss in the forward and backward propagation in Sec. 4. For the forward propagation, we provide a mathematically study about the effect of binarization errors on the global level, which further clarifies the motivation of the error minimization in our IMB. For the backward propagation, we show that the changes of weight distribution during BNN training may limit updating capability of BNNs with soft estimators. Second, we propose a novel DIRNet with distributionsensitive estimator DTE, which improves the backward propagation process. Instead of changing the shape of the estimator in IRNet IRNet with a fixed strategy, the DIRNet further adjusts the shape of the estimator according to the distribution of weights/activations in the backward propagation to retain the information of accurate gradients and the updating capability of BNN. Third, we add detailed ablation experiments in Sec. 5 to verify the effectiveness of techniques in DIRNet on BNNs (Table 4), and also evaluate the impact of binarization errors (Table 2), clipping interval of DTE ( setting in Table 3), and parameter information entropy (Fig. 6). Fourth, we compare proposed DIRNet with more SOTA binarization approaches in Table 6 (BONN gu2019bayesian, SiBNN wang2020sparsity, PCNN DBLP:journals/corr/abs181112755, RealtoBin martinez2020training, MeliusNet bethge2021meliusnet, and ReActNet liu2020reactnet) and evaluate it on compact networks (EfficientNet tan2019efficientnet, MobileNet mobilenet and DARTS liu2018darts). The results show that our DIRNet is versatile and effective, and can improve the performance on these structures. Moreover, we also add and discuss more latest related work of network compression and quantization he2021generative; wang2020towards; phan2020binarizing; chen2020binarized; DBLP:journals/ijcv/LiuLWYLC20; Liu_2019_CVPR; DBLP:journals/ijcv/LiuDHZLGD21; DBLP:journals/ijcv/LiuLWYLC20 to reflect the characteristics and advantages of our DIRNet.
This work provides a novel and practical view to explain how BNNs work. In addition to the strong capability to retain the information in the forward and backward propagation, DIRNet has excellent versatility to be extended to various architectures of BNNs and can be trained via the standard training pipeline. We tested our DIRNet on classification tasks with CIFAR10 and ImageNet datasets. The results indicate that our DIRNet performs extremely well in a variety of structures including ResNet20, VGGSmall, ResNet18/34, EfficientNet, MobileNet, and DARTS, exceeding other binarization approaches greatly. To validate the performance of DIRNet on lowpower devices, we implement it on Raspberry Pi and it achieves outstanding efficiency.
In summary, our main contributions are listed as follows:

We propose the simple yet efficient Distributionsensitive Information Retention Network (DIRNet), which can improve BNNs by retaining information during the training process. Compared with existing fixedstrategy estimators, the estimator of DIRNet (DTE) estimator ensures enough updating capability and improves the accuracy of BNNs.

We measure the amount of information for binarized parameters by information entropy and present an indepth analysis about the effects of information loss and binarization error in BNNs.

We investigate both forward and backward processes of binary networks from the unified information perspective, which provides new insight into the mechanism of network binarization.

Experiments demonstrate that our method significantly outperforms other stateoftheart (SOTA) methods in both accuracy and practicality on mainstream and compact architectures. And we further prove the DTE in proposed DIRNet can stably improve the performance.

We implement 1bit BNNs and evaluate their speed on realworld ARM devices, and the results show that our DIRNet achieves outstanding efficiency.
The rest of this paper is organized as follows. Section II gives a brief review of related model binarization methods and lowpower devices. Section III describes the preliminaries of binary neural networks. Section IV describes the proposed approach, formulation, implementation, and discussion in detail. Section V provides the experiments conducted for this work, model analysis, and experimental comparisons with other SOTA methods. In Section VI, we conclude the study.
2 Related Work
Recently, resourcelimited embedded devices attract researchers in the area of artificial intelligence by their lowpower consumption, tiny size, and high practicality, which significantly promotes the application of artificial intelligence technology. However, the SOTA neural network models suffer massive parameters and large sizes to achieve good performance in different tasks, which also cause significant complex computation and great resource consumption. To compress and accelerate the deep CNNs, many approaches have been proposed, which can be classified into five categories: transferred/compact convolutional filters
shufflenet; yu2017on; DBLP:journals/pami/WangXXT19; quantization/binarization DBLP:conf/eccv/HuLWZC18; DBLP:conf/nips/ChenWP19; Wu2020Rotation; zhu2019unified; knowledge distillation chen2018darkrank; zagoruyko2017paying; DBLP:journals/pr/DingCH19; pruning han2016deep; he2017channel; ge2017Compressing; lowrank factorization lebedev2015speeding; jaderberg2014speeding; lebedev2016fast; DBLP:journals/pr/WenZXYH18.Compared with other compression methods, model binarization can significantly reduce the consumption of memory. By extremely compressing the bitwidth of parameters in neural networks, the convolution filters in binary neural networks can achieve memory saving. Model binarization also makes the compressed model fully compatible with the XNORBitcount operation to achieve great acceleration, and these operations can even achieve speedup in theory xnornet. Besides, the model binarization less changes the architecture compared with other model compression methods, which makes it easier to implement on resourcelimited devices and attracts attention from the researchers. By simply binarizing fullprecision parameters including weights and activations, we can achieve obvious inference acceleration and memory saving.
DBLP:journals/corr/CourbariauxB16 proposed a binarized neural network by simply binarizing the weight and activation to +1 or 1, which compressed the parameters and accelerated CNNs by efficient bitwise operations. However, the binarization operation in this work caused a significant accuracy drop. After this work, many binarization approaches were designed to decrease the gap between BNNs and fullprecision CNNs. The XNORNet xnornet is one of the most classic model binarization methods, which pointed out that using floatingpoint scalars for each binary filter can achieve significant performance improvement. Therefore, it proposed a deterministic binarization method which reduces the quantization errors of the output matrix by applying the 32bit scalars in each layer, while it incurred more resourceconsuming floatingpoint multiplication and addition. The TWN DBLP:journals/corr/LiL16 and TTQ DBLP:journals/corr/ZhuHMD16 utilized more quantization points to improve the representation capability of quantized neural networks. Unfortunately, the bitwise operation can never be used in these methods to accelerate the network, and the memory consumption also increased. The ABCNet ABCNet shown that approximating weights and activations by applying multiple binary bases can greatly improve the accuracy of BNNs, while it unavoidably decreases the compression and acceleration ratios. The HWGQ DBLP:conf/cvpr/CaiHSV17
considered the quantization error from the perspective of activation function. The LQNets
LQNet applied a large number of learnable fullprecision parameters to get better performance while increasing the memory usage. martinez2020training got strong BNNs with a multistep training pipeline and a welldesigned objective function in the training process. Some binarization methods are devoted to solving the gradient error caused by approximating the binarization () function by a welldesigned estimator in the backward propagation. BNN+ BNN+ also proposed an estimator to reduce this gap and further studied various estimators to find a better solution. DSQ DSQ and IRNet IRNet creatively applied soft estimators that gradually changes its shape to optimize the network. BiReal Liu_2018_ECCV introduced a novel BNNfriendly architecture with BiReal shortcut to improve the performance from the term of accuracy, and the ReActNet liu2020reactnet further improved the architecture and training steps and achieved a better BNN performance.Though some progress has been made on model binarization, existing binarization approaches still cause a serious decrease in accuracy compared with 32bit models. First, since the existing works do not effectively measure and retain the information in BNNs, the massive information loss is still a severe problem exists in the BNN training process. Second, the existing methods only focus on minimizing the gradient error, and seriously neglect the update capability of network parameters. It is a tradeoff between update capability and accurate representation that researchers should take into account when designing the estimators. Additionally, the existing methods, which were proposed to increase the accuracy of BNNs, always incur extra floatingpoint multiplication or addition. Thus we propose DIRNet to retain the information during the training process of BNNs. Further, it eliminate the resourceconsuming floatingpoint operations in the convolutional layer.
3 Preliminaries
The main operation in a layer of DNNs in the forward propagation can be expressed as
(1) 
where indicates the inner product operation, and
represent weight tensors and the input activation tensors, respectively.
is the output of the previous layer. However, a large number of floatingpoint multiplications greatly consume memory and computing resources, which heavily limits the applications of CNNs on embedded devices.Previous work has shown that bitwise operations, including XNOR, Bitcount, and Shift, can greatly accelerate the inference of CNNs on lowpower devices xnornet. Therefore, in order to compress and accelerate the deep CNNs, binary neural networks binarize the 32bit weights and/or activations to 1bit. In most cases, binarization can be expressed as
(2) 
where indicates 32bit weights or 32bit activations , and represents binary weights or binary activations . represents scalars including for binary weights and for binary activations. Usually, the function is used to calculate
(3) 
With the binary weights and activations, the tensor multiplication operations can be approximated by
(4) 
where
indicates the bitwise inner product operation of tensors implemented by bitwise operations XNOR and Bitcount. In addition, since the Shift operation is more hardwarefriendly, some work even replace the multiplication in the inference process of BNNs by Shift, such as the Shiftbased batch normalization
DBLP:journals/corr/CourbariauxB16, which further accelerates the inference speed of BNNs on hardware.However, the derivative of the function is zero almost everywhere, which is obviously incompatible with the backward propagation since exact gradients for activations and/or weights before the discretization would be zero. Therefore, many works adopt the StraightThrough Estimator (STE) bengio2013estimating in gradient propagation, which is or function specifically.
4 Distributionsensitive Information Retention Network
In this paper, we mention that severe information loss during training hinders the high accuracy of BNNs. To be exact, information loss is mostly caused by the function in the forward propagation and the approximation of gradients in the backward propagation, and it greatly limits the performance of BNNs. To address this problem, we propose a novel network, Distributionsensitive Information Retention Network (DIRNet), which retains information during training and deliver excellent performance to BNNs. Besides, all convolution operations in DIRNet are replaced by hardwarefriendly bitwise operations.
4.1 Information Maximized Binarization in the Forward Propagation
In the forward propagation, the BNN usually suffers both information entropy decreases and quantization error, which further causes information loss of weights and activations. To retain the information and minimize the loss in the forward propagation, we propose Information Maximized Binarization (IMB) that jointly considers both information loss and quantization error.
4.1.1 Information Loss in the Forward Propagation
Since the discretization of the parameters by the binarization operation, the fullprecision and binarized parameters suffer a large numerical difference causing significant information loss. In order to make the representations of the binarized network closer to the fullprecision counterparts, the binarization error of the BNN should be minimized. Consider the computation in a multivariate function , where
denotes the variable vector with fullprecision. When the
function represents a neural network, represents the 32bit parameters (weights /activations ). The global error caused by quantizing can be expressed as(5) 
where indicates the variable vector quantized from
. When the probability distribution of
is known, the error distribution and the moments of the error can be computed. For example, the minimization of expected absolute error can be present as
(6) 
where denotes the expectation operator and
denotes the probability density function of
. In general, can be any linear or nonlinear function of its arguments, and an analytical evaluation of this multidimensional integral can be very difficult. In prior work 93812; 35496, a simplifying assumption is made where the quantity of is approximated by its firstorder Taylor series expansion(7) 
For a certain value of , the is constant and nonzero. Therefore, minimizing the global error can be approximated as minimizing the quantization error between the quantized (binarized) vectors and the fullprecision counterparts. The optimization problem in Eq. (6) can be simplified as
(8) 
where is the quantization error of quantized parameters.
There are many studies, such as xnornet; DBLP:journals/corr/abs170808687; LQNet, that focus on binarized neural networks, optimizing the quantizer by minimizing the quantization error. Their objective functions (Eq. (8
) typically) suppose that quantized models should just strictly follow the pattern of fullprecision models, which is not always enough, especially when the parameters are quantized to extremely low bit width. For binary models, the parameters are restricted to two values, which limits the representation capability of parameters and makes the information carried by neurons vulnerable and easy to lose. Besides, the solution space of binary models is also quite different from that of fullprecision models. Without retaining the information during training, it is insufficient and difficult to ensure a highly accurate binarized network only by minimizing the quantization error.
Therefore, our study is basically derived from the perspective of information retention. We state the (precise) definition of information in BNN and then make a series of mathematical analyses for how to maximize it. For a random variable
obeying Bernoulli distribution, each element in
can be viewed as a sample of . The information entropy of in Eq. (2) can be calculated by(9) 
where
denote the probability,
and . By maximizing the information entropy in Eq. (9), we make the binarized parameters have the maximized amount of information, so that the information in the fullprecision counterpart is retained.4.1.2 Information Retention via Information Maximized Binarization
To retain the information and minimize the loss in the forward propagation, we propose Information Maximized Binarization (IMB) that jointly considers both information loss and quantization error. First, we balance weights of the BNN to maximize the information of weights and activations. Under the Bernoulli distribution assumption and symmetric assumption of , when in Eq. (9), the information entropy of the quantized values
takes the maximum value, which means the binarized values should be evenly distributed. However, it is nontrivial to make the weight of BNNs be close to that uniform distribution only through backward propagation.
Fortunately, we find that simply redistribute the fullprecision counterpart of binarized weights can maximize the information entropy of binarized weights and activations simultaneously. Our IMB balances weights to have zeromean attribute by subtracting the mean of fullprecision weights. Moreover, we further standardize the balanced weights to mitigate the negative effect of weight magnitude. The standardized balanced weights are obtained through standardization and balance operations as follows
(10) 
where and
denote the mean and standard deviation, respectively.
has two characteristics: (1) , which maximizes the obtained binary weights’ information entropy. (2) , which makes the fullprecision weights involved in binarization more dispersed. Therefore, compared with the direct use of the balanced progress, the use of standardized balanced progress makes the weights in the network steadily updated, and thus makes the binary weights more stable during the training.Since the value of depends on the sign of and the distribution of is almost symmetric SimultaneouslyOptimizingWeight; ACIQ, the balanced operation can maximize the information entropy of quantized on the whole. And when IMB is used for weights, the information flow of activations in the network can also be maintained. Supposing quantized activations have mean , the mean of can be calculated by
(11) 
Since the IMB for weights is applied in each layer, we have , and the mean of output is zero. Therefore, the information entropy of activations in each layer can be maximized, which means that the information in activations can be retained.
Then, to further minimize the quantization error defined in Eq. (8) and avoid extra expensive floatingpoint calculations in previous binarization methods causing by 32bit scalars, the IMB introduces an integer shiftbased scalar to expand the representation capability of binary weights. The optimal shiftbased scalar can be solved by
(12) 
where stands for left or right Bitshift. is calculated by , thus can be solved as
(13) 
Therefore, our IMB for the forward propagation can be presented as below:
(14)  
The main operations in DIRNet can be expressed as
(15) 
As shown in Fig. 2, the parameters quantized by IMB have the maximum information entropy under the Bernoulli distribution. We call our binarization method "Information Maximized Binarization" because the parameters are balanced before the binarization operations to retain information.
Note that IMB serves as an implicit rectifier that reshapes the data distribution before binarization. In the literature, a few studies also realized this positive effect on the performance of BNNs and adopted empirical settings to redistribute parameters xnornet; Regularizeactdistribution. For example, Regularizeactdistribution proposed the specific degeneration problem of binarization and solved it using a specially designed additional regularization loss. Different from these work, we first straightforwardly take the information perspective to rethink the impact of parameter distribution before binarization and provide the optimal solution by maximizing the information entropy. In this framework, IMB can accomplish the distribution adjustment by simply balancing and standardizing the weights before the binarization. This means that our method can be easily and widely applied to various neural network architectures and be directly plugged into the standard training pipeline with a very limited extra computation cost. Moreover, since the convolution operations in our DIRNet are thoroughly replaced by bitwise operations, including XNOR, Bitcount, and Shift, the implementation of DIRNet can achieve extremely high inference acceleration on edge devices.
4.2 Distributionsensitive Twostage Estimator in the Backward Propagation
In the backward propagation, affected by the limited update range of the estimator and the gradient approximation error simultaneously, the gradient of the BNN suffers from information loss. In order to retain the information originated from the loss function in the backward propagation, we propose a progressive Distributionsensitive Twostage Estimator (DTE) to obtain the approximation of gradients.
4.2.1 Information Loss in the Backward Propagation
Due to the discretization caused by binarization, the approximation of gradients is inevitable in the backward propagation. Therefore, since the impact of quantization cannot be accurately modeled by approximation, a huge loss of information occurs. The approximation can be formulated as
(16) 
where indicates the loss function, represents the approximation of the function and donates the derivative of . In previous work, there are two commonly used approximations practices
(17) 
The function completely ignores the effect of binarization and directly passes the gradient information of output values to input values. As shown in the shaded area of Fig. 4(a), the gradient error is huge and accumulates through layers during the backward propagation. In order to avoid unstable training instead of ignoring the error caused by , it is necessary to design a better estimator to retain accurate information of gradient.
The function considers the clipping attribute of binarization, which means only those inside the clipping interval () can be passed through backward propagation. But only the gradient information inside the clipping interval can be passed. As shown in Fig. 4(b), as for parameters outside , the gradients are clamped to zero, which means that once the value jumps outside of the clipping interval, it will not be updated anymore. This feature greatly limits the updating capability of backward propagation, thereby the approximation makes optimization more difficult and harms the accuracy of models. Strong updating capability is essential for the training of BNNs, especially at the beginning of the training process.
Existing estimators are designed to obtain the gradient close to the derivative of the sign function and retain the updateable capability of the BNN, so most of them have an updateable interval, e.g., for the Clip function, the interval is . However, we observe an interesting trend during the training process about the changes in the distribution of weights. As Fig. 3 shown, the number of weights close to 0 continuously decreases during training, which occurs in most BNNs with various estimators (such as and approximation). The phenomenon causes more weights to be outside the updateable interval and brings great challenges to the design of estimators. For BNNs with approximation, the phenomenon lets more weights be out of and these can not be updated anymore, which limits their updating capability seriously. Some soft approximation functions designed to reduce the gradient error are also affected by this problem since they reduce the updateable interval of the parameter as well. For example, in the later stage of the DTE in IRNet IRNet, the update range of the estimator continues to shrink to reduce the information loss caused by gradient errors. At the end of this stage, less than 3% of the weights can be updated (Fig. 7). In other words, the BNN almost lost its updating capability at this time.
The function causes gradient error between the function in binarization and the gradients in backward propagation, while the and soft approximation functions cause part of gradient outside the updatable interval. Our method try to make a tradeoff to take the advantage of these two types of gradient approximation and avoid being affected by their drawbacks.
4.2.2 Information Retention via Distributionsensitive Twostage Estimator
To make a balance between them, and obtain the optimal approximation of gradients in the backward propagation, we proposed Distributionsensitive Twostage Estimator (DTE)
(18) 
where represents the derivable approximate substitute for the forward function in the backward propagation, and denotes the random variable sampled from the fullprecision parameter . The and are distributionsensitive variables, which changes along with the training process to restrict the shape of approximate function
(19)  
where denotes the current epoch and is the number of total epochs, and is probability mass function of that reflects the distribution of the element values in the parameter . indicates the lower limit for the percentage of parameters with high updating capability, and means that the number of parameters in the range is of the total. And is empirically set to , taking both updating capability and accurate gradient into account. and are and , respectively.
In order to retain the information originated from the loss function in the backward propagation, the DTE proposes a progressive distributionsensitive twostage method to obtain the approximation of gradients.
Stage 1: Retain the updating capability of the backward propagation algorithm. We keep the derivative value of gradient estimation function near one, and then gradually reduce the clipping value from a larger number to one. At the start of this stage, the shape of DTE is depending on the weight distribution of each layer, which ensures all parameters to be fully updated. DTE adaptively changes the clipping value during this stage to get more accurate gradients. The derivation of the DTE in the first stage is presented as:
(20) 
Applying this method, our estimation function evolves from to approximation, which ensures the strong updating capability at the beginning of the training process and alleviates the loss of updating capability.
Stage 2: Keep the balance between accurate gradients and strong updating capability. In this stage, we keep the clipping value as one and gradually push the derivative curve towards the shape of the step function, and ensure that enough parameters are updated during this process. During this process, the shape of DTE is changed according to the parameter distribution, and the derivative around 0 is continuously increased to obtain an accurate gradient until there are not enough parameters to be updated. The derivation of the DTE in the second stage is presented as:
(21) 
Benefited from the proposed method, our estimation function evolves from approximation to the function, which ensures the consistency in forward and backward propagation.
Fig. 4(c) shows the shape change of DTE in each stage. Our DTE updates all parameters in the first stage, and further improves the accuracy of parameters in the second stage. Based on this twostage estimation, DTE can reduce the gap between the forward binarization function and the backward approximation function. Meanwhile, the shape of DTE is adaptively adjusted by parameter distribution to ensure that a certain volume of parameters can be updated in each iteration. And in this way, all the parameters can be reasonably updated.
4.3 Analysis and Discussions
The training process of our DIRNet is summarized in Algorithm 1. In this section, we will analyze DIRNet from different aspects.
4.3.1 Complexity Analysis
Since IMB and DTE are applied during the training process, there is no extra operation for binarizing activations in DIRNet. And in IMB, with the novel shiftbased scalars, the computation costs are reduced compared with the existing solutions with 32bit scalars (e.g., XNORNet, and LQNet), as shown in Table 1. Moreover, we further test the real speed of deployment on hardware and we showcase the results in Sec. 5.3.
Method  Float Operations 



XNORNet  
LQNet  
Ours  0 

and , where , , , , , denote the number of output channels, input channels, kernel width, kernel height, output width, and output height, respectively. The Bitwise operation mainly consists of XNOR, Bitcount and Shift.
4.3.2 Stabilize Training
In IMB, weight standardization is introduced for stabilizing training, which avoids fierce changes of binarized weights. Fig. 5 shows the data distribution of weights without standardization, obviously more concentrated around 0. This phenomenon means the signs of most weights are easy to change during the process of optimization, which directly causes unstable training of binary neural networks. By redistributing the data, weight standardization implicitly sets up a bridge between the forward IMB and backward DTE, contributing to a more stable training of binary neural networks. Moreover, the proposed DTE also stabilizes the training by not only ensuring the updating capability of networks, but also preventing the estimator from being too steep, and thus avoid the gradients from being excessively enlarged.
5 Experiments
We perform image classification on two benchmark datasets: CIFAR10 CIFAR and ImageNet (ILSVRC12) Deng2009ImageNet to evaluate our DIRNet and compare it with other recent SOTA methods.
DIRNet:
We implement our DIRNet based on PyTorch since it has a high degree of flexibility and a powerful automatic differentiation mechanism. To build a binarized model, we just use the binary convolutional layers binarized by our method instead of the convolutional layers of the original models.
Network Structures: We evaluate our DIRNet performance on mainstream and compact CNNs structures, including VGGSmall LQNet, ResNet18, ResNet20 on CIFAR10, and ResNet18, ResNet34 he2016deep, MobileNetV1 mobilenet, EfficientNetB0 tan2019efficientnet, and DARTS liu2018darts on ImageNet dataset in our experiments. To verify the versatility of our method, we also evaluate our DIRNet on networks (such as ResNet and MobileNet) with normal structure and ReActNet liu2020reactnet structure, the latter is specifically proposed for binary networks and enjoys better accuracy. We binarize all convolutional and fully connected layers except the first and last one, and keep the 1x1 convolution to fullprecision in EfficientNet and DARTS. As for the activation layers,
is chosen to be the activation function, rather than ReLU.
Hyperparameters and other setups: We train our DIRNet from scratch (referring to random initialization) without using any pretrained models. In order to evaluate our DIRNet on different CNN structures, we mostly apply the original hyperparameter settings and training steps in their papers xnornet; LQNet; Liu_2018_ECCV; IRNet; liu2020reactnet. Specifically, for experiments on CIFAR10, we train the models for up to 400 epochs. The learning rate starts at 1e1 and decays to 0 during training by the Cosine Annealing schedule loshchilov2016sgdr. For experiments on ImageNet, we train the models for up to 62500 iterations. The learning rate starts at 1e2 and is divided by 10 at 18750, 37500, and 56250 iterations successively. Weight decay of 1e4 and batch size of 128 are adopted following the original paper in all experiments, and SGD is applied as the optimizer with a momentum of 0.9.
5.1 Ablation Study
In this section, we evaluate the performance and effects of our proposed IMB and DTE on BNN.
5.1.1 Effect of IMB
Our proposed IMB adjusts the distribution of weights to maximize the information entropy of binary weights and binary activations in the network. Due to the balance operation before binarization, the binary weight parameters of each layer in the DIRNet have the maximal information entropy. As for binary activations affected by binary weights in DIRNets, the maximization of its information entropy is also guaranteed.
In order to illustrate the information retention capability of IMB, in Fig. 6, we show the information entropy reduction of each layer’s binary activations in the network quantized by IMB and vanilla binary neural network respectively. As the figure shown, vanilla binarization results in a great decrease in the information entropy of binary activations. It is notable that the information loss seems to accumulate across layers in the forward propagation. Fortunately, in the IMB quantized networks, the information entropy of activations of each layer is close to the maximal information entropy under the Bernoulli distribution. IMB can achieve information retention of the binary activations in each layer.
We further evaluate the impact of information on our DIRNet in detail. The information in DIRNet is defined by Eq. (9), and can be adjusted by changing the mean value of activations. Fig. 6 presents the relationship between the information of weights in DIRNet and the final accuracy. The information entropy of binarized activations is determined by the percentage of fullprecision counterparts that less than 0. The results show that information entropy is almost positively correlated with network accuracy. And when the information is maximized, the BNN achieves the highest accuracy, which verifies the effectiveness of IMB. Therefore, information entropy is an important indicator to measure the amount of information that BNN holds, and we can improve BNNs’ performance by maximizing the entropy.
In addition, we analyze the impact of IMB on binarization errors. In Table 2, we compare binarized networks that apply different approaches to quantize weights (activations are binarized directly for the sake of fairness), including vanilla binarization, XNOR binarization, and our IMB. The XNOR binarization uses 32bit scalars while our IMB uses integer shiftbased scalars. Compared with vanilla binarization, BNNs quantized by XNOR and IMB have a much smaller binarization error since the usage of scalars. Our IMB further eliminates all floatingpoint scalars (1.3e3) and related floatingpoint computation compared with XNOR binarization, while the binarization error only increases by 5% (0.1e4) but the quantized network enjoys better accuracy (84.9%). The results show that our IMB has a better balance between inference speedup and binarization error minimization.
Method 






FullPrecision  32/32      91.7  
Binary  1/1  5.1e4  0  83.8  
XNOR  1/1  1.9e4  1.3e3  84.8  
IMB (Ours)  1/1  2.0e4  0  84.9 
5.1.2 Effect of DTE
Firstly, we discuss the setting of the parameter in the DTE, which determines the degree of updating capability that DTE maintains. As mentioned in section 4.2, we control the value of the parameter to ensure that at least of parameters are updated during the whole training process. However, if the value of is set too large, the gradients will be not accurate since the gap between the estimator and the function is huge. Table 3 shows the accuracy of DIRNet under different settings, based on the ResNet20 architecture and the CIFAR10 dataset. The results show that the accuracy increases with the decreasing of the value of in a considerable range (approximately 100%20%). And the models which properly ensure the updating capability perform better than that do not control the lower limit of updating capability at all ( is set to 0%). The results show that compared with the estimator that simply minimizes the gradient error, appropriately improving the minimum updating capability of binarized model and keeping enough parameters updating during training are more helpful to improve BNN performance. Therefore, in our experiment, we empirically set the value of to 10% to achieve a good tradeoff between accurate gradient and updating capability.
(%)  100  90  80  70  60  50  40  30  20  10  0  


84.2  84.4  84.7  84.6  85.1  85.7  85.9  86.2  86.7  86.8  86.5 
Then, in order to illustrate the effect and necessity of DTE, we show the distribution of weight parameters in different epochs of training in Fig. 7. The figures in the first row show the distribution of data, and the ones in the second and third rows show the curve of the corresponding derivative of existing EDE and our DTE, respectively. Among the derivative curves, the blue curves represent the derivative of DTE and the yellow ones represent the derivative of STE (with clipping). Obviously, in the first stage (epoch 10 to epoch 200 in Fig. 7) of DTE, there are lots of data beyond the range of , thereby the estimator should have a larger effective updating range to ensure the updating capability of the BNN. In addition, the peakedness of weight distribution is high and a large amount of data is clustered near zero when training begins. DTE keeps the derivative close to the function at this stage to avoid the derivative value near zero being too large, thereby preventing severe unstable training. Fortunately, as binarization is introduced into training, the weights will be gradually redistributed around in the later stages of training. Therefore, we can slowly increase the derivative value and approximate the standard function to avoid gradient mismatch. The visualized results show that our DTE approximation for backward propagation is consistent with the real data distribution, which is critical to improving the accuracy of networks. Moreover, compared to the existing EDE, the improvement of DTE lies in the ability to maintain the network’s updating capability throughout the training process, especially in the later stages of training. As shown in Fig. 7, at the 400 epoch, the derivative of EDE is almost the same as the function and only of weights can be updated, while the proposed DTE ensures at least (default set as 10%) of weights can be continuously updated.
5.1.3 Ablation Performance
We further evaluate the performance of different parts of DIRNet using the ResNet20 architecture on the CIFAR10 dataset, which helps understand how DIRNet works in practice. Table 4 presents the accuracy of networks with different settings. As the Table 4 shown, both IMB and DTE can improve the accuracy. For the IMB, both the the standardization of weight data in IMB plays an important role. The accuracy of BNN with weight standardization and bitshift scalars is 0.5% and 0.8% higher than that of naive BNN, respectively, while the IMB version achieves 1.1% gain. For the DTE, the key to improving BNN performance is the cooperation of its two stages. The performance of BNN that only applies the stage1 of DTE is even not as good as the naive BNN with approximation. And the results of BNNs only applying stage2 show that, the estimator that ensures a certain updating capability always be retained () is more effective to BNN compared with the estimator that does not control the lower limit (). The phenomenon proves the motivation of DTE, which is specifically ensuring the minimum update capability of the estimator during the training process. The usage of DTE takes 1.7% gain to BNN. Moreover, the improvements in IMB and DTE together can be superimposed, hence we can train binary neural networks with high accuracy using both of our method.
Method 

Accuracy (%)  

FullPrecision  32/32  91.7  
Binary  1/1  83.8  
IMB (w/o weight standardization)  1/1  84.3  
IMB (w/o shiftbased scalars)  1/1  84.6  
IMB  1/1  84.9  
DTE (stage1)  1/1  83.6  
DTE (stage2, )  1/1  84.9  
DTE (stage2, )  1/1  85.1  
DTE  1/1  85.5  
DIRNet (IMB & DTE)  1/1  86.8 
5.2 Comparison with SOTA methods
We have performed a complete evaluation of the DIRNet by comparing it with the existing SOTA methods.
5.2.1 Cifar10
Table 5 lists the performance of different methods on the CIFAR10 dataset, and we compare our DIRNet with these methods on various widely used architectures, such as ResNet18 ResNet18project, ResNet20 ResNet20project, and VGGSmall. We show the comparison with results of RAD Regularizeactdistribution and IRNet IRNet over ResNet18, DoReFaNet dorefa, LQNet LQNet, DSQ DSQ and IRNet IRNet over ResNet20, BNN hubara2016binarized, LAB LossAwareBNN, RAD Regularizeactdistribution, XNORNet xnornet and IRNet IRNet over VGGSmall.
In all cases in the table, our proposed DIRNet has the highest accuracy. Moreover, in the case of using the ResNet architecture, our DIRNet has a significant improvement compared to the existing SOTA methods when using 1bit weights and 1bit activations (1W/1A). For example, with the 1W/1A bitwidth setting, the accuracy of our method is improved by 2.6% compared to DSQ on ResNet20, and the gap between the fullprecision counterpart is reduced to 5.0%. Compared with the IRNet, our DIRNet performs better since it further ensures enough parameters can be updated during the training process, and it can outperform IRNet by generally 0.3% with different backbones and bitwidth settings. Moreover, we set five different random seeds for each set of experiments on the CIFAR10 dataset and recorded the mean and standard deviation of the results on these random seeds. Among the results of all network architectures, the standard deviation of the results using different random numbers is less than 0.13%, and it is even as low as 0.09% on the ResNet18 and VGGSmall, which is much lower than the improvement against IRNet on these architectures (at least 0.3%). The results show that the improvement of our DIRNet is robust and can stably improve the network performance under various settings.
Topology  Method 




ResNet18  FullPrecision  32/32  93.0  
RAD  1/1  90.5  
IRNet  1/1  91.5  
DIRNet (ours)  1/1  91.70.09  
ResNet20  FullPrecision  32/32  91.7  
DoReFa  1/1  79.3  
DSQ  1/1  84.1  
IRNet  1/1  86.5  
DIRNet (ours)  1/1  86.80.13  
FullPrecision  32/32  91.7  
DoReFa  1/32  90.0  
LQNet  1/32  90.1  
DSQ  1/32  90.2  
IRNet  1/32  90.8  
DIRNet (ours)  1/32  91.00.12  
VGGSmall  FullPrecision  32/32  91.7  
LAB  1/1  87.7  
XNOR  1/1  89.8  
BNN  1/1  89.9  
RAD  1/1  90.0  
IRNet  1/1  90.4  
DIRNet (ours)  1/1  90.70.09 
5.2.2 ImageNet
We study the performance of DIRNet over ResNet18, ResNet34, MobileNetV1, DARTS, and EfficientNetB0 structures on the largescale ImageNet dataset. Table 6 lists the comparison with several SOTA quantization methods, including BWN xnornet, HWGQ DBLP:journals/corr/abs170808687, TWN DBLP:journals/corr/LiL16, LQNet LQNet, DoReFaNet dorefa, ABCNet ABCNet, BiReal Liu_2018_ECCV, XNOR++ XNOR++, BWHN DBLP:journals/corr/abs180202733, SQBWN and SQTWN Dong2017Learning, PCNN DBLP:journals/corr/abs181112755, BONN gu2019bayesian, SiBNN wang2020sparsity, RealtoBin martinez2020training, MeliusNet bethge2021meliusnet, and ReActNet liu2020reactnet.
As shown in Table 6, when only quantizing weights over ResNet18 with 1bit weights, DIRNet greatly exceeds most other methods, and even outperforms the TWN with 2bit weights by a notable 4.8%. Meanwhile, DIRNet outperforms IRNet 0.3% on Top1 accuracy and 0.4% on Top5 accuracy based on ResNet34 architecture using 1W/32A setting. Moreover, while using the 1W/1A setting, our DIRNet also surpasses the SOTA binarization methods. The Top1 accuracy of our DIRNet is apparently higher than that of the ReActNet (66.1% vs. 65.9% for ResNet18) and SiBNN (63.3% vs. 67.5% for ResNet34). The results prove that our DIRNet is more competitive than the existing binarization methods.
We further implemented our DIRNet on more compact CNN structures, including DARTS, EfficientNet and MobileNet, and compared with other SOTA binarization methods. Results in Table 6 shows that Our DIRNet outperforms both vanilla BNN and BiReal Net over the DARTS and EfficientNetB0 structures without any additional computational overheads and training steps. Under the setting of 1W/1A, our DIRNet slightly surpasses BiReal by 0.7% on Top1 accuracy and 0.8% on Top5 over the DARTS structure, and surpasses BiReal by nearly 1% on Top1 accuracy and 0.4% on Top5 over the EfficientNetB0 structure. And in both cases DIRNet outperforms BNN by an obvious margin. As for the MobileNet structure, DIRNet also performs well and surpasses the SOTA methods. Under the setting of 1W/1A, our DIRNet only loses 2.4% of the Top1 accuracy compared with the fullprecision counterpart, which is much better than other binarization methods. Experiments on these compact networks show that our binarization scheme is versatile and competitive in various structures.
Topology  Method 

Top1(%)  Top5(%)  

ResNet18  FullPrecision  32/32  69.6  89.2  
ABCNet  1/1  42.7  67.6  
XNOR  1/1  51.2  73.2  
BNN+  1/1  53.0  72.6  
DoReFa  1/2  53.4  –  
BiReal  1/1  56.4  79.5  
XNOR++  1/1  57.1  79.9  
PCNN  1/1  57.3  80.0  
IRNet  1/1  58.1  80.0  
BONN  1/1  58.3  81.6  
SiBNN  1/1  59.7  81.8  
RealtoBin  1/1  65.4  86.2  
ReActNet  1/1  65.9  –  
DIRNet^{2}^{2}footnotemark: 2 (ours)  1/1  66.1  86.4  
FullPrecision  32/32  69.6  89.2  
SQBWN  1/32  58.4  81.6  
BWN  1/32  60.8  83.0  
HWGQ  1/32  61.3  83.2  
TWN  2/32  61.8  84.2  
SQTWN  2/32  63.8  85.7  
BWHN  1/32  64.3  85.9  
IRNet  1/32  66.5  86.8  
DIRNet^{1}^{1}footnotemark: 1 (ours)  1/32  66.6  86.8  
ResNet34  FullPrecision  32/32  73.3  91.3  
ABCNet  1/1  52.4  76.5  
BiReal  1/1  62.2  83.9  
IRNet  1/1  62.9  84.1  
SiBNN  1/1  63.3  84.4  
DIRNet^{2}^{2}footnotemark: 2 (ours)  1/1  67.5  88.2  
FullPrecision  32/32  73.3  91.3  
IRNet  1/32  70.4  89.5  
DIRNet^{1}^{1}footnotemark: 1 (ours)  1/32  70.7  89.9  
DARTS  FullPrecision  32/32  73.3  91.3  
BNN  1/1  52.2  76.6  
BiReal  1/1  61.5  83.8  
DIRNet (ours)  1/1  62.2  84.6  
EfficientNet  FullPrecision  32/32  76.2  92.7  
BNN  1/1  52.7  76.5  
BiReal  1/1  58.7  81.3  
DIRNet (ours)  1/1  59.6  81.7  
MobileNet  FullPrecision  32/32  72.4  –  
BNN  1/1  60.9  –  
MeliusNet22  1/1  63.6  84.7  
MeliusNet29  1/1  65.8  86.2  
MeliusNet42  1/1  69.2  88.3  
ReActNet  1/1  69.5  –  
DIRNet^{2}^{2}footnotemark: 2 (ours)  1/1  69.7  88.5 

Results of networks with normal structure.

Results of networks with ReActNet structure liu2020reactnet.
5.3 Deployment Efficiency on Raspberry Pi 3B
In order to further evaluate the efficiency of our proposed DIRNet when it is deployed on realworld mobile devices, we implemented DIRNet on Raspberry Pi 3B, which has a 1.2 GHz 64bit quadcore ARM CortexA53 and tested the running speed in practice. We use the SIMD instruction SSHL on ARM NEON to ensure the inference framework daBNN zhang2019dabnn is compatible with DIRNet.
Method 





FullPrecision  32/32  46.77  1418.94  
NCNN  8/8  –  935.51  
DSQ  2/2  –  551.22  
DIRNet (w/o scalars)  1/1  4.20  252.16  
DIRNet (ours)  1/1  4.21  261.98 
We must point out that so far, very few studies have reported the inference speed of their models deployed on realworld devices which is one of the most important criteria for evaluating the quantized models, especially when using 1bit binarization. As shown in Table 7, we compare DIRNet with existing highperformance inference implementations including NCNN ncnn and DSQ DSQ. Obviously, the inference speed of DIRNet is much faster than others since all floatingpoint operations in convolutional layers are replaced by bitwise operations, such as XNOR, Bitcount, and Bitshift. And the model size of DIRNet can also be greatly reduced, the shiftbased scalars in DIRNet bring almost no extra time consumption and memory footprint compared with the vanilla binarization method without scalars.
6 Conclusion
In this paper, we introduce a novel DIRNet that retains the information during the forward/backward propagation of binary neural networks. The DIRNet is mainly composed of two practical technologies: the IMB for ensuring diversity in the forward propagation and the DTE for reducing the gradient errors in the backward propagation. From the perspective of information entropy, IMB performs a simple but effective transformation on weights, which maximizes the information loss of both weights and activations at the same time, with no additional operations on activations. In this way, we can maintain the diversity of binary neural networks as much as possible without compromising efficiency. Additionally, a welldesigned gradient estimator DTE reduces the information errors of gradients in the backward propagation. Because of the powerful updating capability and accurate gradients, the performance of DTE exceeds that of STE by a large margin. Our adequate experiments show that DIRNet consistently outperforms the existing SOTA BNNs.
7 Acknowledgement
This work was supported by the National Natural Science Foundation of China (62022009, 61872021), Beijing Nova Program of Science and Technology (Z191100001119050), and State Key Lab of Software Development Environment (SKLSDE2020ZX06).
Comments
There are no comments yet.