Exact Backpropagation in Binary Weighted Networks with Group Weight Transformations

07/03/2021 ∙ by Yaniv Shulman, et al. ∙ Aleph Zero Records 0

Quantization based model compression serves as high performing and fast approach for inference that yields models which are highly compressed when compared to their full-precision floating point counterparts. The most extreme quantization is a 1-bit representation of parameters such that they have only two possible values, typically -1(0) or +1, enabling efficient implementation of the ubiquitous dot product using only additions. The main contribution of this work is the introduction of a method to smooth the combinatorial problem of determining a binary vector of weights to minimize the expected loss for a given objective by means of empirical risk minimization with backpropagation. This is achieved by approximating a multivariate binary state over the weights utilizing a deterministic and differentiable transformation of real-valued, continuous parameters. The proposed method adds little overhead in training, can be readily applied without any substantial modifications to the original architecture, does not introduce additional saturating nonlinearities or auxiliary losses, and does not prohibit applying other methods for binarizing the activations. Contrary to common assertions made in the literature, it is demonstrated that binary weighted networks can train well with the same standard optimization techniques and similar hyperparameter settings as their full-precision counterparts, specifically momentum SGD with large learning rates and L_2 regularization. To conclude experiments demonstrate the method performs remarkably well across a number of inductive image classification tasks with various architectures compared to their full-precision counterparts. The source code is publicly available at https://bitbucket.org/YanivShu/binary_weighted_networks_public.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Contemporary artificial neural networks (ANN) have achieved state-of-the-art results in a multitude of learning tasks. Often these models include millions of parameters which form dense structures enabling efficient parallel computing by utilizing specialized software and hardware. However the dependency of these models on substantial hardware resources limits their utility on resource constrained hardware such as mobile and low power embedded devices. One approach to reduce computational resources is

model compression that transforms an initial cumbersome architecture into a more efficient architecture that require less space and compute resources while minimizing performance loss to an acceptable degree. Model compression is typically achieved by reducing the number of parameters in the model and/or by quantizing the parameters and activations so that they use less bits to encode the data flowing through the network.

There are many approaches suggested for reducing the number of parameters including weight pruning 10.5555/2969239.2969366 , architecture learning DBLP:conf/bmvc/SrinivasB16 , distilling knowledge 44873 , structured pruning NIPS2016_41bfd20a and regularization louizos2018learning ; shulman2020diffprune . The interested reader is referred to DBLP:conf/mlsys/BlalockOFG20 ; journals/corr/abs-1710-09282 for recent reviews.

Quantization based model compression serves as high performing and fast approach for inference that yields highly compressed models compared to their full-precision floating point counterparts. The most extreme quantization is a 1-bit representation of parameters and activations such that they have only two possible values, typically -1(0) or +1. An ANN that is restricted to binary representations is typically known as a Binary Neural Network (BNN). Models that constrain the weights to binary values enable efficient implementation of the ubiquitous dot product using only additions without requiring floating point multiplications. Furthermore networks which restrict both weights and activations to binary values enable significant computational acceleration in inference by utilizing highly efficient bitwise XNOR and Bitcount operations that can be further optimized in specialized hardware. Therefore these models are an attractive alternative to full-precision ANNs where power efficiency and constrained compute resources are important considerations

Qin_2020 ; NIPS2015_3e15cc11 ; DBLP:conf/eccv/RastegariORF16 ; electronics8060661 . Another compelling approach is to binarize only parts of the network that benefit the most from the quantization and keep other layers at high precision. In fact most proposed BNNs use partial binarization since typically at least the fully connected output layer and the first convolution layer weights are kept at a higher precision electronics8060661 ; Qin_2020

. Additional examples include retaining the parameters of the batch normalization layers at high precision

10.1007/978-3-030-01237-3_23 , apply a scaling factor to the binary weights DBLP:conf/eccv/RastegariORF16 ; DBLP:journals/corr/abs-1909-13863 ; Martinez2020Training ; SakrCWGS18 or floating point parametrized activations DBLP:journals/corr/abs-1904-05868 ; Martinez2020Training .

Many learning algorithms and in particular neural networks typically employ gradient-based optimizers such as the Backpropagation algorithm Rumelhart:1986we . Models that are designed to have a continuous relationship between parameters and the training objective enable the computation of exact gradients which in turn enable efficient optimization journals/corr/BengioLC13 . Many of the existing methods in the literature for ANN quantization such as NIPS2016_d8330f85 ; Liu_2018_ECCV ; Cai_2017_CVPR ; DBLP:journals/corr/abs-1812-11800 ; DBLP:conf/cvpr/QinGLSWYS20

employ non-differentiable quantization techniques that require the use of gradient estimators resulting in divergence between the forward pass and backpropagation and therefore decreased training efficacy

NIPS2017_1c303b0e . The challenge is then combining discrete valued weights for which the gradient is undefined with the effective backpropagation method for training neural networks.

The main contribution of this work is the introduction of a method to smooth the combinatorial problem of finding a binary vector of weights to minimize the expected loss for a given objective by means of empirical risk minimization with backpropagation. This is achieved by approximating a multivariate binary state over the weights utilizing a deterministic and differentiable transformation of real-valued, continuous parameters. The proposed method adds little overhead in training, can be readily applied without any modifications to the original architecture, does not introduce additional saturating nonlinearities or auxilary losses and does not prohibit applying other methods for binarizing the activations. Contrary to common assertions made in the literature, it is demonstrated that binary weighted networks can train well with the same standard optimization techniques and similar hyperparameter settings as their full-precision counterparts, specifically momentum SGD with large learning rates and regularization Qin_2020 . To conclude experiments demonstrate little and even a modest gain in accuracy for a number of inductive image classification tasks compared to their full-precision counterparts. The source code is publicly available at https://bitbucket.org/YanivShu/binary_weighted_networks_public.

Note the term differentiable

is used in this paper in the context of training neural networks, i.e. allowing a small number of points where the first order derivatives do not exist. A common example is the use of rectifiers in the calculation graph such as the Relu activation

NairH10 .

2 Proposed method

2.1 Binary group weight transformations

Let be the binary valued weights (parameters) of a hypothesis such as a binary weighted neural network where denotes the cardinality of . Let be a training set consisting of i.i.d. instances . The empirical risk associated with the hypothesis is defined as:

(1)
(2)

Where is constrained to take values in and

is a loss function that measures the discrepancy between the true value

and the predicted outcome . The goal of the optimization problem is to find given the hypothesis and data for which the empirical risk is minimal.

Minimizing the objective (1) provably is a hard combinatorial problem with complexity exponential in respect to . Alternative methods such as gradient based optimization cannot be readily used due to not being differentiable w.r.t. . To overcome this challenge a deterministic differentiable relaxation of the hard binary constraints governing is proposed that enables solving a surrogate minimization problem efficiently and deterministically using common gradient based optimizers. To enable efficient backpropagation during training the hard constraint of the weights being exactly binary may be relaxed and replaced with a soft constraint of being approximately one or negative one. Let be a real valued vector and be a differentiable function from the real numbers to the range e.g. the hyperbolic tangent . Equations (3) - (7) define a deterministic and differentiable transformation that maps vectors in to be approximately binary i.e. for some small .

(3)
(4)
(5)
(6)
(7)

Where denotes the -th element of ; and are the mean of and respectively; and . The transformation defined by conceptually comprises of two partitions: and , such that by definition under the assumption that and then and

. The variance of both

and is controlled by and since and it may be set as small as practically useful and therefore is exactly binary in the limit when . Note the gradient of w.r.t. is non-degenerate provided that i.e. there are at least two members in each of and .

Having defined , reconsider the hypothesis and associated empirical risk following reparameterization of given a partition of to subsets :

(8)
(9)
(10)

The objectives in equations (1) and (9) are equivalent in the limit as . However for reasonably low values of the formulation in equations (8) - (10) can be used as a differentiable surrogate to the objective in equation (1) due to replacement of the binary weights with the smoothed approximate binary weights . Subsequently this enables the use of gradient based optimizers to find an approximate solution to the original hard combinatorial problem with low quantization error.

2.2 Reduction of quantization error with regularization

The inclusion of in equations (3) and (4) enables theoretical bounds on the divergence of the binarized weights from respectively. The inclusion of such nonlinearities is a common approach and often the hyperbolic tangent is used for this purpose in training BNNs DBLP:journals/corr/abs-1904-05868 ; Martinez2020Training ; Gong_2019_ICCV ; lahoud2019selfbinarizing ; DBLP:conf/cvpr/QinGLSWYS20 or the hard tanh and its variants NIPS2016_d8330f85 ; SakrCWGS18 . The inclusion of superfluous saturating nonlinearities changes the objective in a non-trivial way and slows training as these typically have substantial areas of their domain where gradients are very small or practically zero. To mitigate these shortcomings it is proposed that in practice the function is removed and instead soft constraints are introduced on to encourage them to not diverge from each other. This invalidates the theoretical guarantees about the variance of the positive and negative partitions of as defined in equations (3) - (7) however it works well in practice and alleviates the need to introduce superfluous saturating nonlinearities. All results discussed in subsequent sections do not include any activations or saturating nonlinearities added to the original full-precision architectures in the forward or backward propagation and instead regularization is applied to to encourage the full-precision parameters to not diverge far from zero.

2.3 Progressive binarization

Experimental results demonstrate that it might be beneficial to gradually increase the separation between the quantized values during training by interpolating the binarized weights

and the continuous parameters as such:

(11)
(12)

Where is the training step number; is the total number of training steps and is a hyperparameter denoting the fraction of total training steps required for to reach and remain at 1.

2.4 Parameter partitioning

Partitioning the parameters is useful to limit the breadth of the dependencies introduced by the mean subtraction in equations (5) and (6

). Whilst the proposed method supports any arbitrary partitioning scheme, in this work the parameters are partitioned by filter for convolutional layers and by neurons for fully connected layers.

2.5 Inference

In inference the parameters are binarized simply by using the function so that they are restricted to exact values in . Note that any zero valued parameters are assigned the value -1. At the completion of training only the binarized weights are retained, there is no need to keep the full-precision parameters nor any partitioning related information.

3 Related work

The proposed method is closely related to the core idea proposed in shulman2020diffprune

where a similar transformation is used to emulate a multivariate Bernoulli random variable. Whereas in

shulman2020diffprune nuisance parameters are added to the model to calculate the MLE for multiplicative binary gates in the context of network pruning, in this work no additional parameters are introduced and the weights themselves are transformed to approximate a multivariate binary state over the network weights.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 1: Typical state evolution of a single layer during training when . (a) regularized full-precision parameters ; (b) the interpolated progressively binarized weights equation (11); (c) the weights used in inference simply calculated as ; (d) the minimum, mean and maximum values of (e) value of for training step; (f) value of for training step.

Network quantization refers to quantizing the weights and/or the activations of an ANN. It is one of a few methods for model compression and efficient model inference and has a large body of work in the literature dedicated to it. The focus of the method proposed in this work is on the extreme scenario of weights binarization to offering the maximal compression and speed gains. Since there are too many methods related to BNNs to mention in detail, the interested reader is referred to journals/corr/abs-1710-09282 ; Qin_2020 ; electronics8060661 for a thorough review. The rest of this section is dedicated to methods that solve the binarization problem by smoothing or reinterpreting the combinatorial problem in a way that enables use of exact gradients with backpropagation.

The method proposed in DBLP:journals/corr/abs-1904-05868 approximates the quantization function with such that the estimation error is controlled by gradually scaling the inputs to the quantizer during training. A different approach is taken by Martinez2020Training suggesting to train identical networks four times with an alternating teacher-student relationship. An auxiliary loss is added to coerce the networks to learn similar activations. Furthermore they also utilize the hyperbolic tangent function to smooth the function. Differentiable Soft Quantization (DSQ) is a method proposed in Gong_2019_ICCV to approximate the standard binary and uniform quantization process. DSQ employs a series of hyperbolic tangent functions to form a smooth function that progressively approaches a discrete like state emulating low-bit uniform quantization e.g., for the 1-bit case. Continuous Binarization introduced in SakrCWGS18

approximates the binary activation threshold operation using parameterized clipping functions and scaled binary activation function. This enables training with exact gradients however the method relies on a custom and lengthy training regime for individual layers and additional regularization. Furthermore the clipping functions are rectified and therefore suffer from zero gradient outside the clip boundaries. Self-Binarizing Networks introduced in

lahoud2019selfbinarizing approach the binarization task by approximating the with hyperbolic tangent which is iteratively sharpened during training. Stochastic Quantization (SQ) dong2017learning propose to quantize only a subset of the parameters at a time based on a stochastic selection criteria such that only a subset of the gradients are estimated during backpropagation.

4 Experiments

4.1 Inductive image classification

To demonstrate the effectiveness of the proposed method the top-1 accuracy is compared between a full-precision architecture and its binary weighted counterpart on a number of inductive image classification tasks. The methodology involves training each model twice, once with full-precision floating point weights and again using the proposed method. Both networks are evaluated at the end of each epoch and the best result achieved on the validation set during training is reported. The models are implemented in TensorFlow

tensorflow2015-whitepaper using custom Dense and Conv2D layers. The optimizer used in all experiments is the weight decay decoupled SGD momentum optimizer DBLP:conf/iclr/LoshchilovH19 with a linear learning rate warmup period of 5 epochs. An exponential reduction schedule is applied to both the learning rate and weight decay. In all experiments of full-precision networks, except for the WRN-28-10 CIFAR10, the schedule updates by a factor of 0.1 at 1/3 and 2/3 of the overall post-warmup training steps. For the WRN-28-10 CIFAR10 experiment the schedule updates are as recommended in BMVC2016_87 . For the binary variants the updates occur at 0.1, 0.25, 0.4, 0.55, 0.7, 0.85 of the overall post-warmup training steps with a factor of 0.3. The parameters of the batch normalization layers are excluded from weight decay. In all experiments is set to 1 for the initial 90% of training steps and during the last 10% of training is incremented every step until a final value of 12. The training parameters for all experiments are summarized in table 2

. The residual blocks all use parameter free identity mapping that downsample skip connections by average pooling and concatenate zeros where required to match the number of activation planes. For the CIFAR data sets classification tasks a basic augmentation of horizontal flip, random translation and zoom is used and in the binary weighted variants all layers are binarized except for the first and last layers of the networks. Note there was no attempt to perform an exhaustive search of hyperparameters for the best possible result therefore these results should be taken as indicative only. All image data sets are taken from TensorFlow Data sets

TFDS with the default train/test split. The source code is publicly available at https://bitbucket.org/YanivShu/binary_weighted_networks_public.

Data Set Architecture 32b Error % 1b Error % Change %
MNIST LeNet5 0.64 0.53 0.11
CIFAR10 VGG-Small 6.41 6.57 -0.16
ResNet-18 5.52 5.63 -0.11
WRN-28-10 4.51 4.39 0.12
CIFAR100 ResNet-18 23.52 24.02 -0.5
WRN-28-10 21.34 20.52 0.82
Table 1: Summary of experimental results by architecture, data set and weight precision. The error rates are the minimum errors obtained during training for the validation set.
Architecture Bits Batch Epochs L.R. W.D.
LeNet5 32 100 200 0.01 1e-4 -
1 1e-3 0.9
VGG-Small 32 128 300 0.1 5e-4 -
1 0.05 1e-3 0
ResNet-18 32 128 300 0.1 5e-4 -
1 400 0.05 1e-3 0.9
WRN-28-10 32 128 200/300 0.1 5e-4 -
1 400 0.05 1e-3 0.9
Table 2: Summary of hyperparameters for all experiments. Bits is the bit depth of the network weights. Batch is the batch size used in training. Epochs is the total number of training epochs. L.R. is the initial post warmup learning rate. W.D. is the initial weight decay scaler. denotes the fraction of total training steps required for to reach and remain at 1, see equation (12). The 200/300 in the WRN-28-10 row indicates the number of training epochs for the CIFAR10/CIFAR100 data sets respectively.

4.2 LeNet5 MNIST classification

The first experiment is the toy classification task of MNIST using the basic CNN LeNet5 Lecun98gradient-basedlearning . In the binary weighted variant all layers except for the last dense prediction layer are binarized.

4.3 CIFAR10 classification

The second experiment is the classification task of the CIFAR10 data set with three different architectures: Vgg-Small like network similar to the one used in 10.1007/978-3-030-01237-3_23 , ResNet-18 7780459 and WRN-28-10 BMVC2016_87 . For WRN-28-10 the baseline architecture is the no dropout variant with identity mapping. A minor modification was done to the architecture by increasing the number of filters in the first convolution layer from 16 to 64.

Method Architecture Error % LAB DBLP:conf/iclr/HouYK17 Qin_2020 VGG-Small 10.5 BWN DBLP:conf/eccv/RastegariORF16 Qin_2020 VGG-Small 9.9 Self-Binarizing Networks lahoud2019selfbinarizing VGG-Small 9.4 BWNH conf/aaai/HuWC18 VGG9 9.2 MPT-1/32 (95) diffenderfer2021multiprize VGG-Small 8.5 BinaryConnect NIPS2015_3e15cc11 Qin_2020 VGG-Small 8.3 Proposed method VGG-Small 6.6 Method Architecture Error % IR-Net DBLP:conf/cvpr/QinGLSWYS20 Qin_2020 ResNet-20 9.8 ProxQuant bai2018proxquant Qin_2020 ResNet-20 9.3 ProxQuant bai2018proxquant ResNet-44 7.8 SQ-BWN dong2017learning ResNet-56 7.2 Proposed method ResNet-18 5.6 MPT (80) +BN diffenderfer2021multiprize ResNet-18 5.2 Proposed method WRN-28-10 4.4
Table 3: Comparison of reported error rates on the CIFAR10 validation set for binary weighted networks. The left table summarizes the results for VGG based architectures and the right table the results for ResNet based architectures. The citations indicate the paper where the method is proposed and the source of the results if different to the paper.

4.4 CIFAR100 classification

The third experiment is the classification task of the CIFAR100 data set using the ResNet-18 7780459 and WRN-28-10 BMVC2016_87 architectures identical to these used in the CIFAR10 experiments. Note that attempting training the full-precision WRN-28-10 network with the same hyperparamter settings and learning rate schedule as specified in BMVC2016_87 resulted in a slightly reduced accuracy.

Method Architecture Error %
Self-Binarizing Networks lahoud2019selfbinarizing VGG-Small 36.5
BWN DBLP:conf/eccv/RastegariORF16 dong2017learning ResNet-56 35.0
BWNH conf/aaai/HuWC18 VGG9 34.4
SQ-BWN dong2017learning ResNet-56 31.6
Proposed method ResNet-18 24.0
Proposed method WRN-28-10 20.5
Table 4: Comparison of reported error rates on the CIFAR100 validation set for binary weighted networks. The citations indicate the paper where the method is proposed and the source of the results if different to the paper.

4.5 Effect of

This section aims to quantify the effect of progressive binarization with different rates . For this purpose models are trained a number of times with all settings unchanged except for modifying . The results summarized in table 5 indicate that the models can train well with or without progressive binarization. Despite no strong evidence to support the usefulness of applying progressive binarization it seems that for the deeper residual networks slow progressive binarization did slightly improve accuracy on the validation set.

Experiment 0 0.3 0.5 0.7 0.9
VGG-Small CIFAR10 6.57 7.04 6.94 7.2 6.88
ResNet-18 CIFAR100 24.61 24.78 24.81 24.56 24.02
WRN-28-10 CIFAR100 20.59 20.67 20.96 20.92 20.52
Table 5: Best accuracy measured on the validation set during training for different values of .
(a)
(b)
(c)
(d)
Figure 2: (a, c) Accuracy measured on the validation sets for the VGG-Small CIFAR10 and Wide ResNet CIFAR100 classification tasks for the binary weighted variant with different values of over the entire training. (b, d) The same for the last 50 epochs of training.

5 Discussion

In this section an analysis is performed to investigate the reasons leading to the outstanding experimental results. Consider the dot product, the core operation of neural networks, and it’s gradient:

(13)
(14)

Where ; and is an arbitrary nonlinearity. In comparison, consider the positive (or negative) group transformation proposed in this work and its gradient:

(15)
(16)
(17)

Where is the mean of the elements of the vector . Equation (17) reveals interesting properties of the proposed method.

The first is that for each of the partitions the gradients are zero centered due to the mean subtraction. This implies that after the gradient update the mean of the parameters will remain unchanged. Assuming the parameters are initialized with zero mean and considering this in conjunction with the regularization this property may have a regularizing effect. If a probabilistic interpretation is assumed similar to DBLP:conf/cvpr/QinGLSWYS20 , maintaining the parameters having a close to symmetric distribution with zero mean may increase the entropy of the weights distribution and therefore the representation power of the network.

Secondly, assume that then the gradients of the full-precision and binary networks are proportional i.e. with equality when . For standard gradient descent the proportionality implies the two models can be trained identically simply by scaling the learning rate. The assumption of is reasonable for inputs that are normalized by methods such as batch normalization 10.5555/3045118.3045167 , instance normalization DBLP:journals/corr/UlyanovVL16 or group normalization citeulike:14571032 . Therefore training of approximate binary weighted networks with gradient descent can be as effective as the training of full-precision networks as long as the two aforementioned conditions are maintained.

6 Conclusion

This paper proposes a novel and effective method for training binary weighted networks by smoothing the combinatorial problem of finding a binary vector of weights to minimize the expected loss for a given objective by means of empirical risk minimization with backpropagation. The method adds little computational complexity and can be readily applied to common architectures using automatic differentiation frameworks. Theoretical analysis and experimental results demonstrate that binary weighted networks can train well with the same standard optimization techniques and similar hyperparameter settings as their full-precision counterparts such as momentum SGD with large learning rates and regularization.

References

  • [1] TensorFlow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets.
  • [2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    Software available from tensorflow.org.
  • [3] Y. Bai, Y.-X. Wang, and E. Liberty. Proxquant: Quantized neural networks via proximal operators. In International Conference on Learning Representations, 2019.
  • [4] Y. Bengio, N. Léonard, and A. C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.
  • [5] D. W. Blalock, J. J. G. Ortiz, J. Frankle, and J. V. Guttag. What is the state of neural network pruning? In I. S. Dhillon, D. S. Papailiopoulos, and V. Sze, editors, Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020.
  • [6] A. Bulat and G. Tzimiropoulos. Xnor-net++: Improved binary neural networks. CoRR, abs/1909.13863, 2019.
  • [7] A. Bulat, G. Tzimiropoulos, J. Kossaifi, and M. Pantic. Improved training of binary networks for human pose estimation and image recognition. CoRR, abs/1904.05868, 2019.
  • [8] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , July 2017.
  • [9] Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017.
  • [10] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • [11] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia. BNN+: improved binary network training. CoRR, abs/1812.11800, 2018.
  • [12] J. Diffenderfer and B. Kailkhura. Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network. In International Conference on Learning Representations, 2021.
  • [13] Y. Dong, R. Ni, J. Li, Y. Chen, J. Zhu, and H. Su. Learning accurate low-bit deep neural networks with stochastic quantization, 2017.
  • [14] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [15] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1135–1143, Cambridge, MA, USA, 2015. MIT Press.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [17] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
  • [18] L. Hou, Q. Yao, and J. T. Kwok. Loss-aware binarization of deep networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
  • [19] Q. Hu, P. Wang, and J. Cheng. From hashing to cnns: Training binary weight networks via hashing. In S. A. McIlraith and K. Q. Weinberger, editors, AAAI, pages 3247–3254. AAAI Press, 2018.
  • [20] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org, 2015.
  • [22] F. Lahoud, R. Achanta, P. Márquez-Neila, and S. Süsstrunk. Self-binarizing networks, 2019.
  • [23] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
  • [24] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training quantized nets: A deeper understanding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [25] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [26] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  • [27] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations, 2018.
  • [28] B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. In International Conference on Learning Representations, 2020.
  • [29] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In J. Fürnkranz and T. Joachims, editors, ICML, pages 807–814. Omnipress, 2010.
  • [30] H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe. Binary neural networks: A survey. Pattern Recognition, 105:107281, Sep 2020.
  • [31] H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, and J. Song. Forward and backward information retention for accurate binary neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 2247–2256. IEEE, 2020.
  • [32] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi.

    Xnor-net: Imagenet classification using binary convolutional neural networks.

    In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pages 525–542. Springer, 2016.
  • [33] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Representations by Back-propagating Errors. Nature, 323(6088):533–536, 1986.
  • [34] C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan, and N. R. Shanbhag. True gradient-based training of deep binary activated neural networks via continuous binarization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018, pages 2346–2350. IEEE, 2018.
  • [35] Y. Shulman. Diffprune: Neural network pruning with deterministic approximate binary gates and regularization. arXiv preprint arXiv:2012.03653, 2020.
  • [36] T. Simons and D.-J. Lee. A review of binarized neural networks. Electronics, 8(6), 2019.
  • [37] S. Srinivas and R. V. Babu. Learning neural network architectures using backpropagation. In R. C. Wilson, E. R. Hancock, and W. A. P. Smith, editors, Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016.
  • [38] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/1607.08022, 2016.
  • [39] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [40] xxx. Group Normalization, Mar. 2018.
  • [41] S. Zagoruyko and N. Komodakis. Wide residual networks. In E. R. H. Richard C. Wilson and W. A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016.
  • [42] D. Zhang, J. Yang, D. Ye, and G. Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, pages 373–390, Cham, 2018. Springer International Publishing.