Over the past few years, Artificial Intelligence (AI) utilizing Deep Neural Networks (DNNs), especially Convolutional Neural Networks (CNNs) has shown great potential on some specific tasks such as computer vision, including but not limited to classificationkrizhevsky2012imagenet; VeryDeepConvolutional; 7298594; wang2019dynamic; 9027877, detection DBLP:journals/corr/GirshickDDM13; DBLP:journals/corr/Girshick15; DBLP:journals/corr/abs-1904-02701; NIPS2015_5638; Li_2019_CVPR and segmentation Everingham:2010:PVO:1747084.1747104; Zhuang_2019_CVPR. However, deep CNNs usually have a large number of parameters and high computational complexity to satisfy the requirement of high accuracy. Thus a great deal of memory and computing power is always required when running the high accurate CNNs, which significantly limits the deployment of CNNs on lightweight devices such as low-power chips and embedded devices.
Fortunately, Binary Neural Networks (BNNs) can achieve efficient inference and small memory usage utilizing the high-performance instructions including XNOR, Bitcount, and Shift that most low-power devices support Dong_2019_ICCV; Morozov_2019_ICCV; Ajanthan_2019_ICCV; Jung_2019_CVPR; Yang_2019_CVPR; Wang_2019_CVPR; Cao_2019_CVPR; Nagel_2019_ICCV; qin2020bipointnet. Despite the huge speed advantage, existing binary neural networks still suffer a large drop in accuracy compared with their full-precision counterparts DBLP:journals/ijcv/LiuDHZLGD21; DBLP:journals/ijcv/LiuLWYLC20; DBLP:journals/ijcv/SongHGXHS20; DBLP:journals/ijcv/DongNLCSZ19. The reasons for the accuracy drop mainly lie in two aspects.
On the one hand, the limited representation capability and discreteness of binarized parameters lead to significant information loss in the forward propagation. When the 32-bit parameters are binarized to 1-bit, the diversity of the neural network model drops sharply, which is proved to be the key factor in the accuracy drop of BNNs diverse. To increase diversity, some work proposed to introduce additional operations. For example, the ABC-Net ABCNet utilizes multiple binary bases for more representation levels and the WRPN mishra2018wrpn devises wider networks for more parameters. The Bi-Real Net proposed in Liu_2018_ECCV added on a full-precision shortcut to the binarized activations to improve the feature diversity, which also greatly improves the BNNs. But due to the speed and memory limit, any extra floating-point calculation or parameter increase will greatly harm the practical deployment on the edge hardware like Raspberry Pi and BeagleBone kruger2014benchmarking. Therefore, it is still a great challenge for BNNs to achieve high accuracy while can be deployed on lightweight devices as well.
On the other hand, accurate gradients supply correct information for network optimization in backward propagation. But during the training process of BNNs, discrete binarization inevitably causes inaccurate gradients and further the wrong optimization direction. In order to deal with the problem of discreteness, different approximations of binarization for the backward propagation have been proposed DBLP:conf/cvpr/CaiHSV17; Liu_2018_ECCV; BNN+; selfBN; ImprovedTraining, which can be mainly categorized into improving the updating capability and reducing the mismatching area between the function and the approximate one. However, the difference between the early and later training stages is always being ignored. In fact, powerful update capabilities are usually highly required at the beginning of the training process, while small gradient errors become more important at the end of training. Moreover, some works extremely decrease the gap between the function and the estimator in a certain period of the training process, while our study shows that ensuring suitable parameters can be updated in the whole training process is better for BNN optimization. Specifically, when the estimator of BNN extremely approximates the function, though the gradient error between them is small, the gradient values in BNN are almost all zeros and the BNN can hardly be updated, which is called "saturation" Regularize-act-distribution. Therefore, the methods devoted to extremely decreased gradient error may seriously ignore the harm to the parameter updating capability.
In order to address the above-mentioned issues, we study the network binarization from the information flow perspective and propose a novel Distribution-sensitive Information Retention Network (DIR-Net) (see the overview in Fig. 1). We train BNNs with high accuracy by retaining the information during both forward and backward propagation: (1) the DIR-Net introduces a novel binarization approach named Information Maximized Binarization (IMB) in the forward propagation, which balances and standardizes the weight distribution before binarizing. With the IMB, we can minimize the information loss in the forward propagation by maximizing the information entropy of the quantized parameters and minimizing the quantization error. Besides, the IMB is conducted offline and thus brings no time cost during inference. (2) The Distribution-sensitive Two-stage Estimator (DTE) is devised to compute gradients in the backward propagation, which minimizes the multi-type information loss by approximating the function. The shape change of the DTE is distribution-sensitive, which obtains the accurate gradients and, more importantly, ensures that there are always enough parameters updated in the whole training process.
Note that we extend our prior conference publication IRNet that mainly concentrates on a binary neural network method. This paper further comprehensively studies the information loss of BNN from the perspective of mathematics and experience to comprehend the forward and backward propagation of BNN more deeply. Existing works lack the analysis and comprehension of the information loss in binarization, and the manual or fixed strategies are always applied to BNN but significant information loss still exists. Therefore, compared with the conference version, this manuscript further comprehensively studies the information loss problem in binarization, presents new distribution-sensitive improvements to the BNN, and compares the proposed method with more SOTA methods on more architectures. Specifically, first, we present a more in-depth analysis about the information loss in the forward and backward propagation in Sec. 4. For the forward propagation, we provide a mathematically study about the effect of binarization errors on the global level, which further clarifies the motivation of the error minimization in our IMB. For the backward propagation, we show that the changes of weight distribution during BNN training may limit updating capability of BNNs with soft estimators. Second, we propose a novel DIR-Net with distribution-sensitive estimator DTE, which improves the backward propagation process. Instead of changing the shape of the estimator in IR-Net IRNet with a fixed strategy, the DIR-Net further adjusts the shape of the estimator according to the distribution of weights/activations in the backward propagation to retain the information of accurate gradients and the updating capability of BNN. Third, we add detailed ablation experiments in Sec. 5 to verify the effectiveness of techniques in DIR-Net on BNNs (Table 4), and also evaluate the impact of binarization errors (Table 2), clipping interval of DTE ( setting in Table 3), and parameter information entropy (Fig. 6). Fourth, we compare proposed DIR-Net with more SOTA binarization approaches in Table 6 (BONN gu2019bayesian, Si-BNN wang2020sparsity, PCNN DBLP:journals/corr/abs-1811-12755, Real-to-Bin martinez2020training, MeliusNet bethge2021meliusnet, and ReActNet liu2020reactnet) and evaluate it on compact networks (EfficientNet tan2019efficientnet, MobileNet mobilenet and DARTS liu2018darts). The results show that our DIR-Net is versatile and effective, and can improve the performance on these structures. Moreover, we also add and discuss more latest related work of network compression and quantization he2021generative; wang2020towards; phan2020binarizing; chen2020binarized; DBLP:journals/ijcv/LiuLWYLC20; Liu_2019_CVPR; DBLP:journals/ijcv/LiuDHZLGD21; DBLP:journals/ijcv/LiuLWYLC20 to reflect the characteristics and advantages of our DIR-Net.
This work provides a novel and practical view to explain how BNNs work. In addition to the strong capability to retain the information in the forward and backward propagation, DIR-Net has excellent versatility to be extended to various architectures of BNNs and can be trained via the standard training pipeline. We tested our DIR-Net on classification tasks with CIFAR-10 and ImageNet datasets. The results indicate that our DIR-Net performs extremely well in a variety of structures including ResNet-20, VGG-Small, ResNet-18/34, EfficientNet, MobileNet, and DARTS, exceeding other binarization approaches greatly. To validate the performance of DIR-Net on low-power devices, we implement it on Raspberry Pi and it achieves outstanding efficiency.
In summary, our main contributions are listed as follows:
We propose the simple yet efficient Distribution-sensitive Information Retention Network (DIR-Net), which can improve BNNs by retaining information during the training process. Compared with existing fixed-strategy estimators, the estimator of DIR-Net (DTE) estimator ensures enough updating capability and improves the accuracy of BNNs.
We measure the amount of information for binarized parameters by information entropy and present an in-depth analysis about the effects of information loss and binarization error in BNNs.
We investigate both forward and backward processes of binary networks from the unified information perspective, which provides new insight into the mechanism of network binarization.
Experiments demonstrate that our method significantly outperforms other state-of-the-art (SOTA) methods in both accuracy and practicality on mainstream and compact architectures. And we further prove the DTE in proposed DIR-Net can stably improve the performance.
We implement 1-bit BNNs and evaluate their speed on real-world ARM devices, and the results show that our DIR-Net achieves outstanding efficiency.
The rest of this paper is organized as follows. Section II gives a brief review of related model binarization methods and low-power devices. Section III describes the preliminaries of binary neural networks. Section IV describes the proposed approach, formulation, implementation, and discussion in detail. Section V provides the experiments conducted for this work, model analysis, and experimental comparisons with other SOTA methods. In Section VI, we conclude the study.
2 Related Work
Recently, resource-limited embedded devices attract researchers in the area of artificial intelligence by their low-power consumption, tiny size, and high practicality, which significantly promotes the application of artificial intelligence technology. However, the SOTA neural network models suffer massive parameters and large sizes to achieve good performance in different tasks, which also cause significant complex computation and great resource consumption. To compress and accelerate the deep CNNs, many approaches have been proposed, which can be classified into five categories: transferred/compact convolutional filtersshufflenet; yu2017on; DBLP:journals/pami/WangXXT19; quantization/binarization DBLP:conf/eccv/HuLWZC18; DBLP:conf/nips/ChenWP19; Wu2020Rotation; zhu2019unified; knowledge distillation chen2018darkrank; zagoruyko2017paying; DBLP:journals/pr/DingCH19; pruning han2016deep; he2017channel; ge2017Compressing; low-rank factorization lebedev2015speeding; jaderberg2014speeding; lebedev2016fast; DBLP:journals/pr/WenZXYH18.
Compared with other compression methods, model binarization can significantly reduce the consumption of memory. By extremely compressing the bit-width of parameters in neural networks, the convolution filters in binary neural networks can achieve memory saving. Model binarization also makes the compressed model fully compatible with the XNOR-Bitcount operation to achieve great acceleration, and these operations can even achieve speedup in theory xnornet. Besides, the model binarization less changes the architecture compared with other model compression methods, which makes it easier to implement on resource-limited devices and attracts attention from the researchers. By simply binarizing full-precision parameters including weights and activations, we can achieve obvious inference acceleration and memory saving.
DBLP:journals/corr/CourbariauxB16 proposed a binarized neural network by simply binarizing the weight and activation to +1 or -1, which compressed the parameters and accelerated CNNs by efficient bitwise operations. However, the binarization operation in this work caused a significant accuracy drop. After this work, many binarization approaches were designed to decrease the gap between BNNs and full-precision CNNs. The XNOR-Net xnornet is one of the most classic model binarization methods, which pointed out that using floating-point scalars for each binary filter can achieve significant performance improvement. Therefore, it proposed a deterministic binarization method which reduces the quantization errors of the output matrix by applying the 32-bit scalars in each layer, while it incurred more resource-consuming floating-point multiplication and addition. The TWN DBLP:journals/corr/LiL16 and TTQ DBLP:journals/corr/ZhuHMD16 utilized more quantization points to improve the representation capability of quantized neural networks. Unfortunately, the bitwise operation can never be used in these methods to accelerate the network, and the memory consumption also increased. The ABC-Net ABCNet shown that approximating weights and activations by applying multiple binary bases can greatly improve the accuracy of BNNs, while it unavoidably decreases the compression and acceleration ratios. The HWGQ DBLP:conf/cvpr/CaiHSV17
considered the quantization error from the perspective of activation function. The LQ-NetsLQ-Net applied a large number of learnable full-precision parameters to get better performance while increasing the memory usage. martinez2020training got strong BNNs with a multi-step training pipeline and a well-designed objective function in the training process. Some binarization methods are devoted to solving the gradient error caused by approximating the binarization () function by a well-designed estimator in the backward propagation. BNN+ BNN+ also proposed an estimator to reduce this gap and further studied various estimators to find a better solution. DSQ DSQ and IR-Net IRNet creatively applied soft estimators that gradually changes its shape to optimize the network. Bi-Real Liu_2018_ECCV introduced a novel BNN-friendly architecture with Bi-Real shortcut to improve the performance from the term of accuracy, and the ReActNet liu2020reactnet further improved the architecture and training steps and achieved a better BNN performance.
Though some progress has been made on model binarization, existing binarization approaches still cause a serious decrease in accuracy compared with 32-bit models. First, since the existing works do not effectively measure and retain the information in BNNs, the massive information loss is still a severe problem exists in the BNN training process. Second, the existing methods only focus on minimizing the gradient error, and seriously neglect the update capability of network parameters. It is a trade-off between update capability and accurate representation that researchers should take into account when designing the estimators. Additionally, the existing methods, which were proposed to increase the accuracy of BNNs, always incur extra floating-point multiplication or addition. Thus we propose DIR-Net to retain the information during the training process of BNNs. Further, it eliminate the resource-consuming floating-point operations in the convolutional layer.
The main operation in a layer of DNNs in the forward propagation can be expressed as
where indicates the inner product operation, and
represent weight tensors and the input activation tensors, respectively.is the output of the previous layer. However, a large number of floating-point multiplications greatly consume memory and computing resources, which heavily limits the applications of CNNs on embedded devices.
Previous work has shown that bitwise operations, including XNOR, Bitcount, and Shift, can greatly accelerate the inference of CNNs on low-power devices xnornet. Therefore, in order to compress and accelerate the deep CNNs, binary neural networks binarize the 32-bit weights and/or activations to 1-bit. In most cases, binarization can be expressed as
where indicates 32-bit weights or 32-bit activations , and represents binary weights or binary activations . represents scalars including for binary weights and for binary activations. Usually, the function is used to calculate
With the binary weights and activations, the tensor multiplication operations can be approximated by
indicates the bitwise inner product operation of tensors implemented by bitwise operations XNOR and Bitcount. In addition, since the Shift operation is more hardware-friendly, some work even replace the multiplication in the inference process of BNNs by Shift, such as the Shift-based batch normalizationDBLP:journals/corr/CourbariauxB16, which further accelerates the inference speed of BNNs on hardware.
However, the derivative of the function is zero almost everywhere, which is obviously incompatible with the backward propagation since exact gradients for activations and/or weights before the discretization would be zero. Therefore, many works adopt the Straight-Through Estimator (STE) bengio2013estimating in gradient propagation, which is or function specifically.
4 Distribution-sensitive Information Retention Network
In this paper, we mention that severe information loss during training hinders the high accuracy of BNNs. To be exact, information loss is mostly caused by the function in the forward propagation and the approximation of gradients in the backward propagation, and it greatly limits the performance of BNNs. To address this problem, we propose a novel network, Distribution-sensitive Information Retention Network (DIR-Net), which retains information during training and deliver excellent performance to BNNs. Besides, all convolution operations in DIR-Net are replaced by hardware-friendly bitwise operations.
4.1 Information Maximized Binarization in the Forward Propagation
In the forward propagation, the BNN usually suffers both information entropy decreases and quantization error, which further causes information loss of weights and activations. To retain the information and minimize the loss in the forward propagation, we propose Information Maximized Binarization (IMB) that jointly considers both information loss and quantization error.
4.1.1 Information Loss in the Forward Propagation
Since the discretization of the parameters by the binarization operation, the full-precision and binarized parameters suffer a large numerical difference causing significant information loss. In order to make the representations of the binarized network closer to the full-precision counterparts, the binarization error of the BNN should be minimized. Consider the computation in a multivariate function , where
denotes the variable vector with full-precision. When thefunction represents a neural network, represents the 32-bit parameters (weights /activations ). The global error caused by quantizing can be expressed as
where indicates the variable vector quantized from
. When the probability distribution of
is known, the error distribution and the moments of the error can be computed. For example, the minimization of expected absolute error can be present as
where denotes the expectation operator and
denotes the probability density function of. In general, can be any linear or nonlinear function of its arguments, and an analytical evaluation of this multidimensional integral can be very difficult. In prior work 93812; 35496, a simplifying assumption is made where the quantity of is approximated by its first-order Taylor series expansion
For a certain value of , the is constant and non-zero. Therefore, minimizing the global error can be approximated as minimizing the quantization error between the quantized (binarized) vectors and the full-precision counterparts. The optimization problem in Eq. (6) can be simplified as
where is the quantization error of quantized parameters.
There are many studies, such as xnornet; DBLP:journals/corr/abs-1708-08687; LQ-Net, that focus on binarized neural networks, optimizing the quantizer by minimizing the quantization error. Their objective functions (Eq. (8
) typically) suppose that quantized models should just strictly follow the pattern of full-precision models, which is not always enough, especially when the parameters are quantized to extremely low bit width. For binary models, the parameters are restricted to two values, which limits the representation capability of parameters and makes the information carried by neurons vulnerable and easy to lose. Besides, the solution space of binary models is also quite different from that of full-precision models. Without retaining the information during training, it is insufficient and difficult to ensure a highly accurate binarized network only by minimizing the quantization error.
Therefore, our study is basically derived from the perspective of information retention. We state the (precise) definition of information in BNN and then make a series of mathematical analyses for how to maximize it. For a random variable
obeying Bernoulli distribution, each element incan be viewed as a sample of . The information entropy of in Eq. (2) can be calculated by
denote the probability,and . By maximizing the information entropy in Eq. (9), we make the binarized parameters have the maximized amount of information, so that the information in the full-precision counterpart is retained.
4.1.2 Information Retention via Information Maximized Binarization
To retain the information and minimize the loss in the forward propagation, we propose Information Maximized Binarization (IMB) that jointly considers both information loss and quantization error. First, we balance weights of the BNN to maximize the information of weights and activations. Under the Bernoulli distribution assumption and symmetric assumption of , when in Eq. (9), the information entropy of the quantized values
takes the maximum value, which means the binarized values should be evenly distributed. However, it is non-trivial to make the weight of BNNs be close to that uniform distribution only through backward propagation.
Fortunately, we find that simply redistribute the full-precision counterpart of binarized weights can maximize the information entropy of binarized weights and activations simultaneously. Our IMB balances weights to have zero-mean attribute by subtracting the mean of full-precision weights. Moreover, we further standardize the balanced weights to mitigate the negative effect of weight magnitude. The standardized balanced weights are obtained through standardization and balance operations as follows
denote the mean and standard deviation, respectively.has two characteristics: (1) , which maximizes the obtained binary weights’ information entropy. (2) , which makes the full-precision weights involved in binarization more dispersed. Therefore, compared with the direct use of the balanced progress, the use of standardized balanced progress makes the weights in the network steadily updated, and thus makes the binary weights more stable during the training.
Since the value of depends on the sign of and the distribution of is almost symmetric Simultaneously-Optimizing-Weight; ACIQ, the balanced operation can maximize the information entropy of quantized on the whole. And when IMB is used for weights, the information flow of activations in the network can also be maintained. Supposing quantized activations have mean , the mean of can be calculated by
Since the IMB for weights is applied in each layer, we have , and the mean of output is zero. Therefore, the information entropy of activations in each layer can be maximized, which means that the information in activations can be retained.
Then, to further minimize the quantization error defined in Eq. (8) and avoid extra expensive floating-point calculations in previous binarization methods causing by 32-bit scalars, the IMB introduces an integer shift-based scalar to expand the representation capability of binary weights. The optimal shift-based scalar can be solved by
where stands for left or right Bit-shift. is calculated by , thus can be solved as
Therefore, our IMB for the forward propagation can be presented as below:
The main operations in DIR-Net can be expressed as
As shown in Fig. 2, the parameters quantized by IMB have the maximum information entropy under the Bernoulli distribution. We call our binarization method "Information Maximized Binarization" because the parameters are balanced before the binarization operations to retain information.
Note that IMB serves as an implicit rectifier that reshapes the data distribution before binarization. In the literature, a few studies also realized this positive effect on the performance of BNNs and adopted empirical settings to redistribute parameters xnornet; Regularize-act-distribution. For example, Regularize-act-distribution proposed the specific degeneration problem of binarization and solved it using a specially designed additional regularization loss. Different from these work, we first straightforwardly take the information perspective to rethink the impact of parameter distribution before binarization and provide the optimal solution by maximizing the information entropy. In this framework, IMB can accomplish the distribution adjustment by simply balancing and standardizing the weights before the binarization. This means that our method can be easily and widely applied to various neural network architectures and be directly plugged into the standard training pipeline with a very limited extra computation cost. Moreover, since the convolution operations in our DIR-Net are thoroughly replaced by bitwise operations, including XNOR, Bitcount, and Shift, the implementation of DIR-Net can achieve extremely high inference acceleration on edge devices.
4.2 Distribution-sensitive Two-stage Estimator in the Backward Propagation
In the backward propagation, affected by the limited update range of the estimator and the gradient approximation error simultaneously, the gradient of the BNN suffers from information loss. In order to retain the information originated from the loss function in the backward propagation, we propose a progressive Distribution-sensitive Two-stage Estimator (DTE) to obtain the approximation of gradients.
4.2.1 Information Loss in the Backward Propagation
Due to the discretization caused by binarization, the approximation of gradients is inevitable in the backward propagation. Therefore, since the impact of quantization cannot be accurately modeled by approximation, a huge loss of information occurs. The approximation can be formulated as
where indicates the loss function, represents the approximation of the function and donates the derivative of . In previous work, there are two commonly used approximations practices
The function completely ignores the effect of binarization and directly passes the gradient information of output values to input values. As shown in the shaded area of Fig. 4(a), the gradient error is huge and accumulates through layers during the backward propagation. In order to avoid unstable training instead of ignoring the error caused by , it is necessary to design a better estimator to retain accurate information of gradient.
The function considers the clipping attribute of binarization, which means only those inside the clipping interval () can be passed through backward propagation. But only the gradient information inside the clipping interval can be passed. As shown in Fig. 4(b), as for parameters outside , the gradients are clamped to zero, which means that once the value jumps outside of the clipping interval, it will not be updated anymore. This feature greatly limits the updating capability of backward propagation, thereby the approximation makes optimization more difficult and harms the accuracy of models. Strong updating capability is essential for the training of BNNs, especially at the beginning of the training process.
Existing estimators are designed to obtain the gradient close to the derivative of the sign function and retain the updateable capability of the BNN, so most of them have an updateable interval, e.g., for the Clip function, the interval is . However, we observe an interesting trend during the training process about the changes in the distribution of weights. As Fig. 3 shown, the number of weights close to 0 continuously decreases during training, which occurs in most BNNs with various estimators (such as and approximation). The phenomenon causes more weights to be outside the updateable interval and brings great challenges to the design of estimators. For BNNs with approximation, the phenomenon lets more weights be out of and these can not be updated anymore, which limits their updating capability seriously. Some soft approximation functions designed to reduce the gradient error are also affected by this problem since they reduce the updateable interval of the parameter as well. For example, in the later stage of the DTE in IR-Net IRNet, the update range of the estimator continues to shrink to reduce the information loss caused by gradient errors. At the end of this stage, less than 3% of the weights can be updated (Fig. 7). In other words, the BNN almost lost its updating capability at this time.
The function causes gradient error between the function in binarization and the gradients in backward propagation, while the and soft approximation functions cause part of gradient outside the updatable interval. Our method try to make a trade-off to take the advantage of these two types of gradient approximation and avoid being affected by their drawbacks.
4.2.2 Information Retention via Distribution-sensitive Two-stage Estimator
To make a balance between them, and obtain the optimal approximation of gradients in the backward propagation, we proposed Distribution-sensitive Two-stage Estimator (DTE)
where represents the derivable approximate substitute for the forward function in the backward propagation, and denotes the random variable sampled from the full-precision parameter . The and are distribution-sensitive variables, which changes along with the training process to restrict the shape of approximate function
where denotes the current epoch and is the number of total epochs, and is probability mass function of that reflects the distribution of the element values in the parameter . indicates the lower limit for the percentage of parameters with high updating capability, and means that the number of parameters in the range is of the total. And is empirically set to , taking both updating capability and accurate gradient into account. and are and , respectively.
In order to retain the information originated from the loss function in the backward propagation, the DTE proposes a progressive distribution-sensitive two-stage method to obtain the approximation of gradients.
Stage 1: Retain the updating capability of the backward propagation algorithm. We keep the derivative value of gradient estimation function near one, and then gradually reduce the clipping value from a larger number to one. At the start of this stage, the shape of DTE is depending on the weight distribution of each layer, which ensures all parameters to be fully updated. DTE adaptively changes the clipping value during this stage to get more accurate gradients. The derivation of the DTE in the first stage is presented as:
Applying this method, our estimation function evolves from to approximation, which ensures the strong updating capability at the beginning of the training process and alleviates the loss of updating capability.
Stage 2: Keep the balance between accurate gradients and strong updating capability. In this stage, we keep the clipping value as one and gradually push the derivative curve towards the shape of the step function, and ensure that enough parameters are updated during this process. During this process, the shape of DTE is changed according to the parameter distribution, and the derivative around 0 is continuously increased to obtain an accurate gradient until there are not enough parameters to be updated. The derivation of the DTE in the second stage is presented as:
Benefited from the proposed method, our estimation function evolves from approximation to the function, which ensures the consistency in forward and backward propagation.
Fig. 4(c) shows the shape change of DTE in each stage. Our DTE updates all parameters in the first stage, and further improves the accuracy of parameters in the second stage. Based on this two-stage estimation, DTE can reduce the gap between the forward binarization function and the backward approximation function. Meanwhile, the shape of DTE is adaptively adjusted by parameter distribution to ensure that a certain volume of parameters can be updated in each iteration. And in this way, all the parameters can be reasonably updated.
4.3 Analysis and Discussions
The training process of our DIR-Net is summarized in Algorithm 1. In this section, we will analyze DIR-Net from different aspects.
4.3.1 Complexity Analysis
Since IMB and DTE are applied during the training process, there is no extra operation for binarizing activations in DIR-Net. And in IMB, with the novel shift-based scalars, the computation costs are reduced compared with the existing solutions with 32-bit scalars (e.g., XNOR-Net, and LQ-Net), as shown in Table 1. Moreover, we further test the real speed of deployment on hardware and we showcase the results in Sec. 5.3.
and , where , , , , , denote the number of output channels, input channels, kernel width, kernel height, output width, and output height, respectively. The Bitwise operation mainly consists of XNOR, Bitcount and Shift.
4.3.2 Stabilize Training
In IMB, weight standardization is introduced for stabilizing training, which avoids fierce changes of binarized weights. Fig. 5 shows the data distribution of weights without standardization, obviously more concentrated around 0. This phenomenon means the signs of most weights are easy to change during the process of optimization, which directly causes unstable training of binary neural networks. By redistributing the data, weight standardization implicitly sets up a bridge between the forward IMB and backward DTE, contributing to a more stable training of binary neural networks. Moreover, the proposed DTE also stabilizes the training by not only ensuring the updating capability of networks, but also preventing the estimator from being too steep, and thus avoid the gradients from being excessively enlarged.
We perform image classification on two benchmark datasets: CIFAR-10 CIFAR and ImageNet (ILSVRC12) Deng2009ImageNet to evaluate our DIR-Net and compare it with other recent SOTA methods.
We implement our DIR-Net based on PyTorch since it has a high degree of flexibility and a powerful automatic differentiation mechanism. To build a binarized model, we just use the binary convolutional layers binarized by our method instead of the convolutional layers of the original models.
Network Structures: We evaluate our DIR-Net performance on mainstream and compact CNNs structures, including VGG-Small LQ-Net, ResNet-18, ResNet-20 on CIFAR-10, and ResNet-18, ResNet-34 he2016deep, MobileNetV1 mobilenet, EfficientNet-B0 tan2019efficientnet, and DARTS liu2018darts on ImageNet dataset in our experiments. To verify the versatility of our method, we also evaluate our DIR-Net on networks (such as ResNet and MobileNet) with normal structure and ReActNet liu2020reactnet structure, the latter is specifically proposed for binary networks and enjoys better accuracy. We binarize all convolutional and fully connected layers except the first and last one, and keep the 1x1 convolution to full-precision in EfficientNet and DARTS. As for the activation layers,
is chosen to be the activation function, rather than ReLU.
Hyper-parameters and other setups: We train our DIR-Net from scratch (referring to random initialization) without using any pre-trained models. In order to evaluate our DIR-Net on different CNN structures, we mostly apply the original hyper-parameter settings and training steps in their papers xnornet; LQ-Net; Liu_2018_ECCV; IRNet; liu2020reactnet. Specifically, for experiments on CIFAR-10, we train the models for up to 400 epochs. The learning rate starts at 1e-1 and decays to 0 during training by the Cosine Annealing schedule loshchilov2016sgdr. For experiments on ImageNet, we train the models for up to 62500 iterations. The learning rate starts at 1e-2 and is divided by 10 at 18750, 37500, and 56250 iterations successively. Weight decay of 1e-4 and batch size of 128 are adopted following the original paper in all experiments, and SGD is applied as the optimizer with a momentum of 0.9.
5.1 Ablation Study
In this section, we evaluate the performance and effects of our proposed IMB and DTE on BNN.
5.1.1 Effect of IMB
Our proposed IMB adjusts the distribution of weights to maximize the information entropy of binary weights and binary activations in the network. Due to the balance operation before binarization, the binary weight parameters of each layer in the DIR-Net have the maximal information entropy. As for binary activations affected by binary weights in DIR-Nets, the maximization of its information entropy is also guaranteed.
In order to illustrate the information retention capability of IMB, in Fig. 6, we show the information entropy reduction of each layer’s binary activations in the network quantized by IMB and vanilla binary neural network respectively. As the figure shown, vanilla binarization results in a great decrease in the information entropy of binary activations. It is notable that the information loss seems to accumulate across layers in the forward propagation. Fortunately, in the IMB quantized networks, the information entropy of activations of each layer is close to the maximal information entropy under the Bernoulli distribution. IMB can achieve information retention of the binary activations in each layer.
We further evaluate the impact of information on our DIR-Net in detail. The information in DIR-Net is defined by Eq. (9), and can be adjusted by changing the mean value of activations. Fig. 6 presents the relationship between the information of weights in DIR-Net and the final accuracy. The information entropy of binarized activations is determined by the percentage of full-precision counterparts that less than 0. The results show that information entropy is almost positively correlated with network accuracy. And when the information is maximized, the BNN achieves the highest accuracy, which verifies the effectiveness of IMB. Therefore, information entropy is an important indicator to measure the amount of information that BNN holds, and we can improve BNNs’ performance by maximizing the entropy.
In addition, we analyze the impact of IMB on binarization errors. In Table 2, we compare binarized networks that apply different approaches to quantize weights (activations are binarized directly for the sake of fairness), including vanilla binarization, XNOR binarization, and our IMB. The XNOR binarization uses 32-bit scalars while our IMB uses integer shift-based scalars. Compared with vanilla binarization, BNNs quantized by XNOR and IMB have a much smaller binarization error since the usage of scalars. Our IMB further eliminates all floating-point scalars (1.3e3) and related floating-point computation compared with XNOR binarization, while the binarization error only increases by 5% (0.1e4) but the quantized network enjoys better accuracy (84.9%). The results show that our IMB has a better balance between inference speedup and binarization error minimization.
5.1.2 Effect of DTE
Firstly, we discuss the setting of the parameter in the DTE, which determines the degree of updating capability that DTE maintains. As mentioned in section 4.2, we control the value of the parameter to ensure that at least of parameters are updated during the whole training process. However, if the value of is set too large, the gradients will be not accurate since the gap between the estimator and the function is huge. Table 3 shows the accuracy of DIR-Net under different settings, based on the ResNet-20 architecture and the CIFAR-10 dataset. The results show that the accuracy increases with the decreasing of the value of in a considerable range (approximately 100%-20%). And the models which properly ensure the updating capability perform better than that do not control the lower limit of updating capability at all ( is set to 0%). The results show that compared with the estimator that simply minimizes the gradient error, appropriately improving the minimum updating capability of binarized model and keeping enough parameters updating during training are more helpful to improve BNN performance. Therefore, in our experiment, we empirically set the value of to 10% to achieve a good trade-off between accurate gradient and updating capability.
Then, in order to illustrate the effect and necessity of DTE, we show the distribution of weight parameters in different epochs of training in Fig. 7. The figures in the first row show the distribution of data, and the ones in the second and third rows show the curve of the corresponding derivative of existing EDE and our DTE, respectively. Among the derivative curves, the blue curves represent the derivative of DTE and the yellow ones represent the derivative of STE (with clipping). Obviously, in the first stage (epoch 10 to epoch 200 in Fig. 7) of DTE, there are lots of data beyond the range of , thereby the estimator should have a larger effective updating range to ensure the updating capability of the BNN. In addition, the peakedness of weight distribution is high and a large amount of data is clustered near zero when training begins. DTE keeps the derivative close to the function at this stage to avoid the derivative value near zero being too large, thereby preventing severe unstable training. Fortunately, as binarization is introduced into training, the weights will be gradually redistributed around in the later stages of training. Therefore, we can slowly increase the derivative value and approximate the standard function to avoid gradient mismatch. The visualized results show that our DTE approximation for backward propagation is consistent with the real data distribution, which is critical to improving the accuracy of networks. Moreover, compared to the existing EDE, the improvement of DTE lies in the ability to maintain the network’s updating capability throughout the training process, especially in the later stages of training. As shown in Fig. 7, at the 400 epoch, the derivative of EDE is almost the same as the function and only of weights can be updated, while the proposed DTE ensures at least (default set as 10%) of weights can be continuously updated.
5.1.3 Ablation Performance
We further evaluate the performance of different parts of DIR-Net using the ResNet-20 architecture on the CIFAR-10 dataset, which helps understand how DIR-Net works in practice. Table 4 presents the accuracy of networks with different settings. As the Table 4 shown, both IMB and DTE can improve the accuracy. For the IMB, both the the standardization of weight data in IMB plays an important role. The accuracy of BNN with weight standardization and bit-shift scalars is 0.5% and 0.8% higher than that of naive BNN, respectively, while the IMB version achieves 1.1% gain. For the DTE, the key to improving BNN performance is the cooperation of its two stages. The performance of BNN that only applies the stage1 of DTE is even not as good as the naive BNN with approximation. And the results of BNNs only applying stage2 show that, the estimator that ensures a certain updating capability always be retained () is more effective to BNN compared with the estimator that does not control the lower limit (). The phenomenon proves the motivation of DTE, which is specifically ensuring the minimum update capability of the estimator during the training process. The usage of DTE takes 1.7% gain to BNN. Moreover, the improvements in IMB and DTE together can be superimposed, hence we can train binary neural networks with high accuracy using both of our method.
|IMB (w/o weight standardization)||1/1||84.3|
|IMB (w/o shift-based scalars)||1/1||84.6|
|DTE (stage2, )||1/1||84.9|
|DTE (stage2, )||1/1||85.1|
|DIR-Net (IMB & DTE)||1/1||86.8|
5.2 Comparison with SOTA methods
We have performed a complete evaluation of the DIR-Net by comparing it with the existing SOTA methods.
Table 5 lists the performance of different methods on the CIFAR-10 dataset, and we compare our DIR-Net with these methods on various widely used architectures, such as ResNet-18 ResNet-18-project, ResNet-20 ResNet-20-project, and VGG-Small. We show the comparison with results of RAD Regularize-act-distribution and IR-Net IRNet over ResNet-18, DoReFa-Net dorefa, LQ-Net LQ-Net, DSQ DSQ and IR-Net IRNet over ResNet-20, BNN hubara2016binarized, LAB Loss-Aware-BNN, RAD Regularize-act-distribution, XNOR-Net xnornet and IR-Net IRNet over VGG-Small.
In all cases in the table, our proposed DIR-Net has the highest accuracy. Moreover, in the case of using the ResNet architecture, our DIR-Net has a significant improvement compared to the existing SOTA methods when using 1-bit weights and 1-bit activations (1W/1A). For example, with the 1W/1A bit-width setting, the accuracy of our method is improved by 2.6% compared to DSQ on ResNet-20, and the gap between the full-precision counterpart is reduced to 5.0%. Compared with the IR-Net, our DIR-Net performs better since it further ensures enough parameters can be updated during the training process, and it can outperform IR-Net by generally 0.3% with different backbones and bit-width settings. Moreover, we set five different random seeds for each set of experiments on the CIFAR-10 dataset and recorded the mean and standard deviation of the results on these random seeds. Among the results of all network architectures, the standard deviation of the results using different random numbers is less than 0.13%, and it is even as low as 0.09% on the ResNet-18 and VGG-Small, which is much lower than the improvement against IR-Net on these architectures (at least 0.3%). The results show that the improvement of our DIR-Net is robust and can stably improve the network performance under various settings.
We study the performance of DIR-Net over ResNet-18, ResNet-34, MobileNetV1, DARTS, and EfficientNet-B0 structures on the large-scale ImageNet dataset. Table 6 lists the comparison with several SOTA quantization methods, including BWN xnornet, HWGQ DBLP:journals/corr/abs-1708-08687, TWN DBLP:journals/corr/LiL16, LQ-Net LQ-Net, DoReFa-Net dorefa, ABC-Net ABCNet, Bi-Real Liu_2018_ECCV, XNOR++ XNOR++, BWHN DBLP:journals/corr/abs-1802-02733, SQ-BWN and SQ-TWN Dong2017Learning, PCNN DBLP:journals/corr/abs-1811-12755, BONN gu2019bayesian, Si-BNN wang2020sparsity, Real-to-Bin martinez2020training, MeliusNet bethge2021meliusnet, and ReActNet liu2020reactnet.
As shown in Table 6, when only quantizing weights over ResNet-18 with 1-bit weights, DIR-Net greatly exceeds most other methods, and even outperforms the TWN with 2-bit weights by a notable 4.8%. Meanwhile, DIR-Net outperforms IR-Net 0.3% on Top-1 accuracy and 0.4% on Top-5 accuracy based on ResNet-34 architecture using 1W/32A setting. Moreover, while using the 1W/1A setting, our DIR-Net also surpasses the SOTA binarization methods. The Top-1 accuracy of our DIR-Net is apparently higher than that of the ReActNet (66.1% vs. 65.9% for ResNet-18) and Si-BNN (63.3% vs. 67.5% for ResNet-34). The results prove that our DIR-Net is more competitive than the existing binarization methods.
We further implemented our DIR-Net on more compact CNN structures, including DARTS, EfficientNet and MobileNet, and compared with other SOTA binarization methods. Results in Table 6 shows that Our DIR-Net outperforms both vanilla BNN and Bi-Real Net over the DARTS and EfficientNet-B0 structures without any additional computational overheads and training steps. Under the setting of 1W/1A, our DIR-Net slightly surpasses Bi-Real by 0.7% on Top-1 accuracy and 0.8% on Top-5 over the DARTS structure, and surpasses Bi-Real by nearly 1% on Top-1 accuracy and 0.4% on Top-5 over the EfficientNet-B0 structure. And in both cases DIR-Net outperforms BNN by an obvious margin. As for the MobileNet structure, DIR-Net also performs well and surpasses the SOTA methods. Under the setting of 1W/1A, our DIR-Net only loses 2.4% of the Top-1 accuracy compared with the full-precision counterpart, which is much better than other binarization methods. Experiments on these compact networks show that our binarization scheme is versatile and competitive in various structures.
|DIR-Net22footnotemark: 2 (ours)||1/1||66.1||86.4|
|DIR-Net11footnotemark: 1 (ours)||1/32||66.6||86.8|
|DIR-Net22footnotemark: 2 (ours)||1/1||67.5||88.2|
|DIR-Net11footnotemark: 1 (ours)||1/32||70.7||89.9|
|DIR-Net22footnotemark: 2 (ours)||1/1||69.7||88.5|
Results of networks with normal structure.
Results of networks with ReActNet structure liu2020reactnet.
5.3 Deployment Efficiency on Raspberry Pi 3B
In order to further evaluate the efficiency of our proposed DIR-Net when it is deployed on real-world mobile devices, we implemented DIR-Net on Raspberry Pi 3B, which has a 1.2 GHz 64-bit quad-core ARM Cortex-A53 and tested the running speed in practice. We use the SIMD instruction SSHL on ARM NEON to ensure the inference framework daBNN zhang2019dabnn is compatible with DIR-Net.
|DIR-Net (w/o scalars)||1/1||4.20||252.16|
We must point out that so far, very few studies have reported the inference speed of their models deployed on real-world devices which is one of the most important criteria for evaluating the quantized models, especially when using 1-bit binarization. As shown in Table 7, we compare DIR-Net with existing high-performance inference implementations including NCNN ncnn and DSQ DSQ. Obviously, the inference speed of DIR-Net is much faster than others since all floating-point operations in convolutional layers are replaced by bitwise operations, such as XNOR, Bitcount, and Bit-shift. And the model size of DIR-Net can also be greatly reduced, the shift-based scalars in DIR-Net bring almost no extra time consumption and memory footprint compared with the vanilla binarization method without scalars.
In this paper, we introduce a novel DIR-Net that retains the information during the forward/backward propagation of binary neural networks. The DIR-Net is mainly composed of two practical technologies: the IMB for ensuring diversity in the forward propagation and the DTE for reducing the gradient errors in the backward propagation. From the perspective of information entropy, IMB performs a simple but effective transformation on weights, which maximizes the information loss of both weights and activations at the same time, with no additional operations on activations. In this way, we can maintain the diversity of binary neural networks as much as possible without compromising efficiency. Additionally, a well-designed gradient estimator DTE reduces the information errors of gradients in the backward propagation. Because of the powerful updating capability and accurate gradients, the performance of DTE exceeds that of STE by a large margin. Our adequate experiments show that DIR-Net consistently outperforms the existing SOTA BNNs.
This work was supported by the National Natural Science Foundation of China (62022009, 61872021), Beijing Nova Program of Science and Technology (Z191100001119050), and State Key Lab of Software Development Environment (SKLSDE-2020ZX-06).