Pytorch implementation of our paper under review -- ReCU: Reviving the Dead Weights in Binary Neural Networks http://arxiv.org/abs/2103.12369
Binary neural networks (BNNs) have received increasing attention due to their superior reductions of computation and memory. Most existing works focus on either lessening the quantization error by minimizing the gap between the full-precision weights and their binarization or designing a gradient approximation to mitigate the gradient mismatch, while leaving the "dead weights" untouched. This leads to slow convergence when training BNNs. In this paper, for the first time, we explore the influence of "dead weights" which refer to a group of weights that are barely updated during the training of BNNs, and then introduce rectified clamp unit (ReCU) to revive the "dead weights" for updating. We prove that reviving the "dead weights" by ReCU can result in a smaller quantization error. Besides, we also take into account the information entropy of the weights, and then mathematically analyze why the weight standardization can benefit BNNs. We demonstrate the inherent contradiction between minimizing the quantization error and maximizing the information entropy, and then propose an adaptive exponential scheduler to identify the range of the "dead weights". By considering the "dead weights", our method offers not only faster BNN training, but also state-of-the-art performance on CIFAR-10 and ImageNet, compared with recent methods. Code can be available at [this https URL](https://github.com/z-hXu/ReCU).READ FULL TEXT VIEW PDF
Binary Neural Network (BNN) shows its predominance in reducing the compl...
Although deep neural networks are highly effective, their high computati...
Inference for state-of-the-art deep neural networks is computationally
Binary neural networks (BNNs), where both weights and activations are
Optimization of Binarized Neural Networks (BNNs) currently relies on
Recent results show that deep neural networks achieve excellent performa...
The best performing Binary Neural Networks (BNNs) are usually attained u...
Pytorch implementation of our paper under review -- ReCU: Reviving the Dead Weights in Binary Neural Networks http://arxiv.org/abs/2103.12369
Deep Neural Networks (DNNs) have shown tremendous success and advanced many visual tasks [Krizhevsky2017ImageNetCW, Redmon2016YouOL, Everingham2009ThePV, simonyan2014very]. Nevertheless, this comes at the price of massive memory usage and computational burden, which poses a great challenge to the resource-constrained cutting-edge devices such as mobile phones and embedded devices. The community has proposed various approaches to solve this problem. Typical techniques include, but are not limited to, efficient architecture design [Han2020GhostNetMF, howard2017mobilenets, Ma2018ShuffleNetVP], knowledge distillation [Hinton2015DistillingTK, Kim2018ParaphrasingCN, romero2014fitnets], network pruning [Ding2019GlobalSM, Lin2017RuntimeNP, Liu2019MetaPruningML], and network quantization [Zhuang2018TowardsEL, Yang2020SearchingFL, Zhou2016DoReFaNetTL].
Among them, by converting the full-precision parameters and activations into low-bit forms, network quantization has offered a promising solution to yield a light and efficient version of DNNs [Jacob2018Quantization, Gong2019DifferentiableSQ, Zhang2018LQNetsLQ, Yang2019QuantizationN]. In the extreme case of a 1-bit representation, a binary neural network (BNN) restricts the weights and activations to only two possible values, i.e., -1 and +1. In comparison with the original networks, BNNs show overwhelming superiority in reducing the model complexity by around 32 parameter compression, and 58
speedup, using the efficient XNOR and bitcount operations[Bulat2019XNORNetIB].
Despite the superiority of BNNs in memory saving and computation reduction, they suffer a drastic drop in accuracy compared with their real-valued counterparts [Rastegari2016XNORNetIC, courbariaux2016binarized, Courbariaux2015BinaryConnectTD]
, which greatly limits the practical deployment. There are two main reasons for the performance degradation: large quantization error in the forward propagation and gradient mismatch during backpropagation.
Specifically, quantization error refers to the residual between the full-precision weight vector and its binarization[Rastegari2016XNORNetIC, Lin2020RotatedBN], as illustrated in Fig. 1(a). The representational ability of BNNs is indeed limited upon the vertices of a unit square. In contrast, the full-precision weights possess an almost unlimited representation space. Such a representation gap easily results in a large accumulated error when mapping the real-valued weights into the binary space. To solve this, existing approaches try to lessen the quantization error by introducing a scaling factor to reduce the norm difference [Bulat2019XNORNetIB, Bulat2019ImprovedTO], or devising a rotation matrix to align the angle bias [Lin2020RotatedBN]. Gradient mismatch comes from the disagreement between the presumed and actual gradient functions [lin2016overcoming] as illustrated in the pink area of Fig. 1
(b). Since the quantization function in the forward propagation of BNNs has zero gradient almost everywhere, an approximate gradient function is required to enable the network to update. A typical example is the straight through estimator (STE)[Bengio2013EstimatingOP], which however leads to inaccurate optimization directions and thus hurts the stability of network training, especially in the low bitwidth [Alizadeh2019AnES, Bulat2020BATSBA]. To mitigate this, a large collection of works have been proposed, typically by adjusting the network structures [liu2018bi, liu2020reactnet, Chen2020BinarizedNA], or using gradient functions that gradually approach zero [Gong2019DifferentiableSQ, Yang2019QuantizationN, Lin2020RotatedBN].
In this paper, we present a novel perspective to improve the effectiveness and training efficiency of BNNs. Inspired by [Helwegen2019LatentWD], the latent weights, which refer to the real-valued weights used during backpropagation, play an important role in binarizing DNNs. We explore the real-valued weights of a given DNN and find that the weights falling into the two tails of the distribution, as shown in Fig. 2
, are barely updated during the training of BNNs. We call them “dead weights” and find that they harm the optimization and slow down the training convergence of BNNs. To solve this problem, we develop a rectified clamp unit (ReCU), which aims to revive the “dead weights” by moving them towards the distribution peak in order to increase the probability of updating these weights. Through a rigorous analysis, we demonstrate that the quantization error after applying ReCU is a convex function, and thus can be further reduced. Instead of simply minimizing the quantization error, we consider the information entropy of the weights to increase the weight diversity of BNNs. For the first time, a systematical analysis is derived to explain why the weight standardization[Qin2020ForwardAB] can boost the performance of BNNs, and then a generalized weight standardization is proposed to further increase the information entropy. Combining the information entropy and the quantization error, we reveal the inherent contradiction between maximizing the former and minimizing the latter, and then propose an adaptive exponential scheduler to identify the range of the “dead weights” and balance the information entropy of the weights and the quantization error.
We conduct extensive experiments for binarizing networks including ResNet-18/20 [He2016DeepRL] and VGG-small [Zhang2018LQNetsLQ] on CIFAR-10 [krizhevsky2009learning], and ResNet-18/34 [He2016DeepRL] on ImageNet [russakovsky2015imagenet]. The experimental results show that ReCU achieves state-of-the-art performance, as well as faster training convergence even with the simple STE [Bengio2013EstimatingOP] as our weight gradient approximation.
To sum up, this paper makes the following contributions:
We explore the influence of “dead weights” showing that they can adversely affect the optimization of BNNs. To the best of our knowledge, this is the first work to analyze the “dead weights” in BNNs.
We introduce a rectified clamp unit (ReCU) to revive the “dead weights” and then provide a rigorous mathematical proof that a smaller quantization error can be derived using our ReCU.
A mathematical analysis on why the weight standardization helps boost BNNs is provided, and the inherent contradiction between minimizing the quantization error and maximizing the information entropy in BNNs is revealed.
Extensive experiments demonstrate that ReCU not only leads to better performance over many state-of-the-arts [Ding2019RegularizingAD, Qin2020ForwardAB, Lin2020RotatedBN, Zhou2016DoReFaNetTL, Gong2019DifferentiableSQ, Yang2020SearchingFL, Rastegari2016XNORNetIC, Chen2020BinarizedNA, Wan2018TBNCN, liu2018bi, Gu2019ProjectionCN, Gu2019BayesianO1, Cai2017DeepLW, Lin2017TowardsAB], but also results in faster training convergence.
As a pioneering work, Courbariaux et al. [courbariaux2016binarized] binarizes both weights and activations with the sign function. To overcome the almost everywhere zero gradient in the sign function, they considers the STE [Bengio2013EstimatingOP] as an approximation to enable the gradient to back propagate. However, the representational ability of BNNs is very limited in a binary space, leading to a significant drop in accuracy. To mitigate the accuracy gap between BNN and its full-precision counterpart, XNOR-Net [Rastegari2016XNORNetIC] introduces a scaling factor, which is obtained through the -norm of the weights or activations, to reduce the quantization error. XNOR-Net++ [Bulat2019XNORNetIB] fuses the two scaling factors for quantized weights and activations into one parameter, and makes it learnable via the standard backpropagation. The rotated binary neural network (RBNN) [Lin2020RotatedBN] takes into account the influence of the angular bias between the binarized weight vector and its full-precision version, and then devises a bi-rotation scheme with two rotation matrices for angle alignment, which reduces the quantization error.
Other works propose to boost the performance of BNNs by devising new gradient estimation functions or designing quantization-friendly network architectures. For example, [Gong2019DifferentiableSQ, Yang2019QuantizationN, Lin2020RotatedBN] design a continuous activation gradient that gradually approximates the sign function so as to replace the conventional STE [Bengio2013EstimatingOP]. Qin et al. [Qin2020ForwardAB] proposed an error decay estimator to minimize the information loss of gradients during backpropagation. ABC-Net [Lin2017TowardsAB] utilizes more binary bases for weights and activations to strengthen the model performance. ReActNet [liu2020reactnet] constructs a strong baseline by adding parameter-free shortcuts on top of MobileNetV1 [howard2017mobilenets] and achieves 69.4% top-1 accuracy on ILSVRC-2012. Leng et al. [Leng2018ExtremelyLB] modelled the BNN learning as a discretization-constrained optimization problem solved by the ADMM optimizer, so as to avoid the non-differentiable quantization. In [Yang2020SearchingFL], an auxiliary probability matrix is made to search for the discrete quantized weights, implemented in a differentiable manner.
In this section, we briefly review the optimization of BNNs. Given a DNN, for ease of representation, we simply denote its per-layer real-valued weights as and the inputs as . Then, the convolutional result can be expressed as
where represents the standard convolution. For simplicity, we omit the non-linear operations in this paper.
BNN aims to binarize each weight and each activation to . Following XNOR-Net [Rastegari2016XNORNetIC], the binarization can be achieved by the sign function,
To mitigate the large quantization error in binarizing a DNN, XNOR-Net [Rastegari2016XNORNetIC] further introduces two scaling factors for the weights and activations , respectively. In this paper, following [Bulat2019XNORNetIB], we simplify these two scaling factors as one parameter, denoted as . Then, the binary convolution operation can be formulated as
where represents the bit-wise operations including XNOR and POPCOUNT, and denotes the element-wise multiplication. Then, the quantization error in a BNN is defined as
is the probability density function of.
To train a BNN, the forward convolution is achieved using the and binarized by Eq. (2), while the real-valued and are updated during backpropagation. However, the gradient of the sign function is zero-valued almost everywhere, which is not suitable for optimization. Instead, we use the simple STE [Bengio2013EstimatingOP] in this paper to compute the approximate gradient of the loss w.r.t. , as
is the loss function.
As for the gradient w.r.t. the activations, we consider the piece-wise polynomial function [liu2018bi] as follows
From now on, we drop the superscript “” for real-valued weights for simplicity.
As pointed out in [Zhong2020TowardsLB, banner2019post], the latent weights of a quantized network roughly follow the zero-mean Laplace distribution due to their quantization in the forward propagation. As can be seen from Fig. 2, most weights are gathered around the distribution peak (origin point), while many outliers fall into the two tails, far away from the peak.
We argue that these outliers adversely affect the training of a BNN and might be the potential reason for the slow convergence when training BNNs. Specifically, in real-valued networks, weights of different magnitudes make different contributions to the network performance; in other words, it is how far each weight is from the origin that matters. In BNNs, however, there is not much distinction between weights of different magnitudes if they have the same sign since only the signs are kept in the forward inference regardless of their magnitudes. Therefore, from the perspective of optimization, though the magnitudes of the weights are updated during the backpropagation by gradient descent, the chances of changing their signs are unequal. Intuitively, the signs of the weights around the distribution peak are easily changed, while it is the opposite for the outliers in the tails, which greatly limits the representational ability of BNNs and thus causes slow convergence in training. For this reason, we call these outliers “dead weights” in BNNs.
To solve this problem, in Sec. 4.2, we introduce our rectified clamp unit to revive these “dead weights” along with a rigorous proof that our clamp function leads to a smaller quantization error. In Sec. 4.3, we analyze why the weight standardization can boost the performance of BNNs, and reveal the inherent contradiction between minimizing the quantization error and maximizing the information entropy of the weights. Correspondingly, in Sec. 5, we introduce an adaptive exponential scheduler to identify the range of the “dead weights” in order to seek a balance between the quantization error and the information entropy.
To solve the aforementioned problem, we propose ReCU, which aims to move the “dead weights” towards the distribution peak to increase the probability of changing their signs. Specifically, for each real-valued weight , ReCU is formulated as
where and respectively denote the quantile and quantile [zwillinger2002crc] of . With , ReCU relocates to if it is smaller than , and to if it is larger than . In this way, the “dead weights” are revived. To measure the contribution of ReCU in the quantization process, we take into account the quantization error for analysis. In what follows, we show that the weights after applying ReCU can derive a smaller quantization error.
Earlier works [Zhong2020TowardsLB, banner2019post] have shown that the latent weights roughly follow the zero-mean Laplace distribution, i.e., , which implies . Thus, we have
which results in
However, it is difficult to know the exact value of . Luckily, we can obtain its approximation via the maximum likelihood estimation, represented as
where returns the mean of the absolute values of the inputs. Thus, is a function of .
After applying ReCU to , the generalized probability density function of can be written as follows
To obtain the quantization error, we first compute the scaling factor in Eq. (3) using the Riemann-Stieltjes integral as
Similarly, setting makes a function of . From Eq. (14), we have two observations: (1) As plotted in Fig. 3, is a convex function when and reaches the minimum when 111A rigorous proof is provided in the Appendix.. (2) When , Eq. (14) degenerates to the normal quantization error as defined in Eq. (4) where ReCU is not introduced.
However, we cannot keep to pursue the least quantization error. In the next subsection, we analyze another important factor to the network performance, i.e., information entropy, which requires to support good performance, and reveal the contradiction between minimizing the quantization error and maximizing the information entropy. Overall, we have the following inequality
That is, ReCU provides a smaller quantization error than when .
The information entropy of a random variable is the average level of uncertainty in the variable’s possible outcomes, which is also used as a quantitative measure to reflect the weight diversity in BNNs[Lin2020RotatedBN, Qin2020ForwardAB, Raj2020UnderstandingLD, Liu2019CirculantBC]. Usually, the more diverse, the better the performance of a BNN. Given a probability density function on domain , the information entropy is defined as
Accordingly, the information entropy of after applying ReCU can be computed by
which is a function of by substituting in Eq. (11) for .
For ease of the following analysis, we visualize the information entropy w.r.t. varying values of and in Fig. 4. Then, we have
Case 1: . In this situation, the information entropy is fixed to (the dotted white line in Fig. 4).
Case 2: . is a monotonically decreasing function of .
Case 3: . is a monotonically increasing function of .
Recall that is estimated by the mean of the absolute values of in Eq. (11). We have experimentally observed that in the cases of , the information entropy () is too small to enable good performance (see Sec. 5.2.2). Thus, a larger should be derived to overcome this problem. However, in practice, the weights gradually become sparse during network training due to the widely-used -norm regularization in modern neural networks, making the information entropy uncontrollable, which inevitably brings a loss of diversity.
Thus, it is necessary to maintain at a relatively high value (Case 3) in a controllable manner to retain the information entropy. The previous work [Qin2020ForwardAB] maximizes the information entropy by centralizing and standardizing the weights in each forward propagation as follows
denotes the standard deviation. In what
However, we experimentally find that it is the standardization, but not the centralization, that contributes to the performance improvement. The reason comes from the fact that in most cases [He2019SimultaneouslyOW, Qin2020ForwardAB]. This motivates us to generalize Eq. (18) by simply standardizing the weights as
where is a given constant. Then, the mean of the absolute values of the weights after standardization is
It is easy to see that due to the Laplace distribution. Therefore, by setting as in [Qin2020ForwardAB], becomes
which increases the information entropy and explains why dividing by the standard deviation can result in better performance when training BNNs [Qin2020ForwardAB]. To the best of our knowledge, this is the first time that a mathematical explanation is provided. Nevertheless, according to Fig. 4, the information entropy can be increased with a larger . Thus, we further define where is a pre-defined constant, and the standardized weights become
It is easy to see that
The innovation behind this analysis lies in that our standardization transforms the uncontrolled information entropy to an adjustable one by manually setting based on the premise of , and therefore generalizes the information gain of Eq. (18) by [Qin2020ForwardAB]. Thus, by standardizing the weights using Eq. (22) before applying ReCU, the information entropy can be increased in the learning of a BNN.
Nevertheless, the information increase from enlarging is still very limited (see Fig. 4). In contrast, the increase of leads to more information gain, with an unexpected increase in quantization error when , as analyzed in Sec. 4.2. Thus, there exists an inherent contradiction between minimizing the quantization error and maximizing the information entropy in BNNs. To the best of our knowledge, we are the first to find this contradiction. In Sec. 5.2.1, we propose an exponential scheduler for adapting along the network training, so as to seek a balance between the information entropy and the quantization error.
Training Procedures. Given a DNN with its per-layer real-valued weights and the inputs , in the forward propagation, we first standardize and revive the “dead weights” using Eq. (22) and ReCU of Eq. (8), respectively. Then, we compute the scaling factor using Eq. (13), and binarize the inputs and the revived weights using the sign function of Eq. (2). Finally, we complete the binary convolution using Eq. (3) for the forward propagation. During backpropagation, we derive the gradients w.r.t. and using Eq. (5) and Eq. (6), respectively, and update
using the stochastic gradient descent (SGD) described in Sec.5.1.
In this section, we evaluate ReCU on the two widely-adopted CIFAR-10 [krizhevsky2009learning] and ILSVRC-2012 [russakovsky2015imagenet] datasets, and then compare it to several state-of-the-art methods [Qin2020ForwardAB, liu2018bi, Lin2020RotatedBN, Yang2020SearchingFL, Gong2019DifferentiableSQ].
Network Structures. On CIFAR-10, we evaluate ReCU with ResNet-18/20 [He2016DeepRL] and VGG-Small [Zhang2018LQNetsLQ]. Following the compared methods, we binarize all convolutional and fully-connected layers except the first and the last ones. For ResNet-18/20, we adopt the double skip connections as proposed in [liu2018bi] for fair comparison.
On ILSVRC-2012, we choose to binarize ResNet-18/34. Following [Bethge2019BackTS], the downsampling layers are not quantized. Similarly, the double skip connections [liu2018bi] are added.
Training Details. Our network is trained from scratch without depending on a pre-trained model. For all experiments, we use SGD for optimization with a momentum of 0.9 and the weight-decay is set to 5e-4. The initial learning rate is 0.1 and then adjusted by the cosine scheduler [Loshchilov2016SGDRSG]. We follow the data augmentation strategies in [He2016DeepRL] which include random crops and horizontal flips.
In this section, we discuss the hyperparameter settings of ReCU, includingand . Recall that affects the information entropy, while affects both the quantization error and the information entropy. Each experiment is run three times and we report the mean top-1 accuracy (mean std) of ResNet-20 for parametric analyses.
In Sec. 4.2, we demonstrate that the quantization error with ReCU is a convex function of and becomes the minimum when , while the information entropy is a monotonically increasing function of if . Thus, a balance needs to be reached between the quantization error and the information entropy. To this end, following [Qin2020ForwardAB] (Eq. (18)), we set for our analyses.
We first consider setting to a fixed value for the whole training process. As shown in Tab. 1, when , the network reaches the best performance. It is worth noting that a significant drop in accuracy occurs when . This is understandable since it suffers both a large quantization error and small information entropy. Another observation is that ReCU does not obtain the best accuracy when . This is because though the quantization error reaches the minimum when as shown in Fig. 3, the small information entropy cannot support a good performance. In summary, when , we can seek a balance between the quantization error and the information entropy.
Despite its good performance when using a fixed value of
, we find that ReCU increases the variance of the performance whenwhile keeping it stable when as shown in Tab. 1. To solve this, we further propose an exponential scheduler for adapting along the network training. Our motivation lies in that should start with a value falling within to pursue a good accuracy, and then gradually go to the interval to stabilize the variance of performance. Based on this, given an initial and an end threshold , at the
-th training epoch is calculated as follows
where denotes the total number of training epochs.
Tab. 2 shows that ReCU obtains better performance of with and . Besides, it can well overcome the large variance with a fixed .
Tab. 3 displays the results w.r.t. different values of . we use Eq. (24) for adapting with and . The experiments are conducted under three settings for a comprehensive analysis, including training the BNN without our standardization, , and .
As can be observed, without our standardization, the binarized ResNet-20 shows a poor top-1 accuracy of . For a detailed analysis, during network training, the -norm regularization in the current neural network sparsifies the network parameters, which reduces the information entropy as discussed in Sec. 4.3.
With our standardization in hand, the information entropy can be manually controlled by adjusting . In Tab. 3, with a small , despite the better performance of , the improvement is very limited. As discussed in Sec. 4.3, the information entropy is still too small to enable good performance when .
We further set . As can be seen, the network reaches a maximum mean top-1 accuracy of 87.31% when , a significant improvement over the model without our standardization. We also observe that as continues to increase, the performance starts to remain stable, which supports our claim in Sec. 4.3 that the improvement from enlarging is limited.
As discussed in Sec. 4.1, the “dead weights” introduces an obstacle to the training convergence of BNNs. To verify the effectiveness of ReCU in overcoming this problem, we compare the convergence ability when training a BNN with and without ReCU. Fig. 5 shows that the curves equipped with ReCU acquire a significantly faster training convergence, especially for the validation set. This is because, ReCU revives the “dead weights” with the quantile and quantile of the weights, making it easier for them to change their signs, leading to faster training convergence.
To quantitatively evaluate the effectiveness of the proposed ReCU, we conduct extensive experiments on CIFAR-10 [krizhevsky2009learning] and ImageNet [russakovsky2015imagenet]. We also compare it with a number of state-of-the-art methods to demonstrate the advantages of ReCU in boosting the performance of BNNs. In the following experiments, we use Eq. (24) for adapting with and . Besides, is set to 2.
For ResNet-18, we compare ReCU with RAD [Ding2019RegularizingAD], RBNN [Lin2020RotatedBN] and IR-Net [Qin2020ForwardAB]. For ResNet-20, the compared methods include SLB [Yang2020SearchingFL], DSQ [Gong2019DifferentiableSQ], DoReFa [Zhou2016DoReFaNetTL], and IR-Net [Qin2020ForwardAB]. For VGG-Small, we compare ReCU with XNOR-Net [Rastegari2016XNORNetIC], DoReFa [Zhou2016DoReFaNetTL], BNN [courbariaux2016binarized], SLB [Yang2020SearchingFL], IR-Net [Qin2020ForwardAB], RAD [Ding2019RegularizingAD], DSQ [Gong2019DifferentiableSQ], and RBNN [Lin2020RotatedBN]. The experimental results are shown in Tab. 4. As can be observed, ReCU shows the best performance in all the networks. Specifically, with ResNet-18, ReCU obtains 0.6% performance increase over the recent RBNN. Also, it yields 0.9% performance gain over IR-Net in binarizing ResNet-20. Lastly, it retains a top-1 accuracy of 92.2% when binarizing VGG-small, which is better than the search-based SLB result of 92.0%.
|Bi-Real Net [liu2018bi]||1/1||56.4%||79.5%|
|Bi-Real Net [liu2018bi]||1/1||62.2%||83.9%|
Tab. 5 displays the performance comparison in binarizing ResNet-18/34. For ResNet-18, we compare ReCU with BNN [courbariaux2016binarized], Bi-Real Net [liu2018bi], XNOR-Net [Rastegari2016XNORNetIC], SLB [Yang2020SearchingFL], DoReFa [Zhou2016DoReFaNetTL], IR-Net [Qin2020ForwardAB], RBNN [Lin2020RotatedBN], PDNN [Gu2019ProjectionCN] and BONN [Gu2019BayesianO1]. For ResNet-34, ABC-Net [Lin2017TowardsAB], Bi-Real Net [liu2018bi], IR-Net [Qin2020ForwardAB] and RBNN [Lin2020RotatedBN] are compared. From Tab. 5, we can see that ReCU takes the leading place in both the top-1 and top-5 accuracies. Specifically, it obtains a better performance of 61.0% in top-1 and 82.6% in top-5 compared to RBNN’s 59.9% top-1 accuracy and 81.9% top-5 accuracy with ResNet-18. The advantage in ResNet-34 is even more obvious where ReCU obtains 2.0% and 1.4% performance improvements in top-1 and top-5 accuracy, respectively.
In this paper, we present a novel rectified clamp unit (ReCU) to revive the “dead weights” when training BNNs. We first analyze how the “dead weights” block the optimization of BNNs and slow down the training convergence. Then, ReCU is applied to increase the probability of changing the signs of the “dead weights” on the premise of a rigorous proof that ReCU can lead to a smaller quantization error. Besides, we analyze why the weight standardization can increase the information entropy of the weights, and thus benefit the BNN performance. The inherent contradiction between minimizing the quantization error and maximizing the information entropy is revealed for the first time. Correspondingly, an adaptive exponential scheduler is proposed to seek a balance between the quantization error and the information entropy. Experimental results demonstrate that, by reviving the “dead weights”, ReCU leads to not only faster training convergence but also state-of-the-art performance.
This work is supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science Foundation of China (No.U1705262, No.62072386, No.62072387, No.62072389, No.62002305, No.61772443, No.61802324 and No.61702136) and Guangdong Basic and Applied Basic Research Foundation (No.2019B1515120049).
Convexity and Minimum of .
We first revisit the formulations for the quantile and scaling factor in the following
Recall that the quantization error under our framework is given as
According to Eq. (A2), the derivative of w.r.t. can be derived as
Note that is estimated via the maximum likelihood estimation as
which indicates . We can know that
Thus, the extreme value of is irrelevant to . Further, we yield the derivative of w.r.t. as
From Eq. (A6), it is easy to know that if . Therefore, is monotonically increasing when . By solving , we have . That means when , , while when , . That is to say, when , and . Thus is a convex function w.r.t. and reaches the minimum when , which completes the proof.