Contemporary artificial neural networks (ANN) have achieved state-of-the-art results in a multitude of learning tasks. Often these models include millions of parameters which form dense structures enabling efficient parallel computing by utilizing specialized software and hardware. However the dependency of these models on substantial hardware resources limits their utility on resource constrained hardware such as mobile and low power embedded devices. One approach to reduce computational resources ismodel compression that transforms an initial cumbersome architecture into a more efficient architecture that require less space and compute resources while minimizing performance loss to an acceptable degree. Model compression is typically achieved by reducing the number of parameters in the model and/or by quantizing the parameters and activations so that they use less bits to encode the data flowing through the network.
There are many approaches suggested for reducing the number of parameters including weight pruning 10.5555/2969239.2969366 , architecture learning DBLP:conf/bmvc/SrinivasB16 , distilling knowledge 44873 , structured pruning NIPS2016_41bfd20a and regularization louizos2018learning ; shulman2020diffprune . The interested reader is referred to DBLP:conf/mlsys/BlalockOFG20 ; journals/corr/abs-1710-09282 for recent reviews.
Quantization based model compression serves as high performing and fast approach for inference that yields highly compressed models compared to their full-precision floating point counterparts. The most extreme quantization is a 1-bit representation of parameters and activations such that they have only two possible values, typically -1(0) or +1. An ANN that is restricted to binary representations is typically known as a Binary Neural Network (BNN). Models that constrain the weights to binary values enable efficient implementation of the ubiquitous dot product using only additions without requiring floating point multiplications. Furthermore networks which restrict both weights and activations to binary values enable significant computational acceleration in inference by utilizing highly efficient bitwise XNOR and Bitcount operations that can be further optimized in specialized hardware. Therefore these models are an attractive alternative to full-precision ANNs where power efficiency and constrained compute resources are important considerationsQin_2020 ; NIPS2015_3e15cc11 ; DBLP:conf/eccv/RastegariORF16 ; electronics8060661 . Another compelling approach is to binarize only parts of the network that benefit the most from the quantization and keep other layers at high precision. In fact most proposed BNNs use partial binarization since typically at least the fully connected output layer and the first convolution layer weights are kept at a higher precision electronics8060661 ; Qin_2020
. Additional examples include retaining the parameters of the batch normalization layers at high precision10.1007/978-3-030-01237-3_23 , apply a scaling factor to the binary weights DBLP:conf/eccv/RastegariORF16 ; DBLP:journals/corr/abs-1909-13863 ; Martinez2020Training ; SakrCWGS18 or floating point parametrized activations DBLP:journals/corr/abs-1904-05868 ; Martinez2020Training .
Many learning algorithms and in particular neural networks typically employ gradient-based optimizers such as the Backpropagation algorithm Rumelhart:1986we . Models that are designed to have a continuous relationship between parameters and the training objective enable the computation of exact gradients which in turn enable efficient optimization journals/corr/BengioLC13 . Many of the existing methods in the literature for ANN quantization such as NIPS2016_d8330f85 ; Liu_2018_ECCV ; Cai_2017_CVPR ; DBLP:journals/corr/abs-1812-11800 ; DBLP:conf/cvpr/QinGLSWYS20
employ non-differentiable quantization techniques that require the use of gradient estimators resulting in divergence between the forward pass and backpropagation and therefore decreased training efficacyNIPS2017_1c303b0e . The challenge is then combining discrete valued weights for which the gradient is undefined with the effective backpropagation method for training neural networks.
The main contribution of this work is the introduction of a method to smooth the combinatorial problem of finding a binary vector of weights to minimize the expected loss for a given objective by means of empirical risk minimization with backpropagation. This is achieved by approximating a multivariate binary state over the weights utilizing a deterministic and differentiable transformation of real-valued, continuous parameters. The proposed method adds little overhead in training, can be readily applied without any modifications to the original architecture, does not introduce additional saturating nonlinearities or auxilary losses and does not prohibit applying other methods for binarizing the activations. Contrary to common assertions made in the literature, it is demonstrated that binary weighted networks can train well with the same standard optimization techniques and similar hyperparameter settings as their full-precision counterparts, specifically momentum SGD with large learning rates and regularization Qin_2020 . To conclude experiments demonstrate little and even a modest gain in accuracy for a number of inductive image classification tasks compared to their full-precision counterparts. The source code is publicly available at https://bitbucket.org/YanivShu/binary_weighted_networks_public.
2 Proposed method
2.1 Binary group weight transformations
Let be the binary valued weights (parameters) of a hypothesis such as a binary weighted neural network where denotes the cardinality of . Let be a training set consisting of i.i.d. instances . The empirical risk associated with the hypothesis is defined as:
Where is constrained to take values in and
is a loss function that measures the discrepancy between the true valueand the predicted outcome . The goal of the optimization problem is to find given the hypothesis and data for which the empirical risk is minimal.
Minimizing the objective (1) provably is a hard combinatorial problem with complexity exponential in respect to . Alternative methods such as gradient based optimization cannot be readily used due to not being differentiable w.r.t. . To overcome this challenge a deterministic differentiable relaxation of the hard binary constraints governing is proposed that enables solving a surrogate minimization problem efficiently and deterministically using common gradient based optimizers. To enable efficient backpropagation during training the hard constraint of the weights being exactly binary may be relaxed and replaced with a soft constraint of being approximately one or negative one. Let be a real valued vector and be a differentiable function from the real numbers to the range e.g. the hyperbolic tangent . Equations (3) - (7) define a deterministic and differentiable transformation that maps vectors in to be approximately binary i.e. for some small .
Where denotes the -th element of ; and are the mean of and respectively; and . The transformation defined by conceptually comprises of two partitions: and , such that by definition under the assumption that and then and
. The variance of bothand is controlled by and since and it may be set as small as practically useful and therefore is exactly binary in the limit when . Note the gradient of w.r.t. is non-degenerate provided that i.e. there are at least two members in each of and .
Having defined , reconsider the hypothesis and associated empirical risk following reparameterization of given a partition of to subsets :
The objectives in equations (1) and (9) are equivalent in the limit as . However for reasonably low values of the formulation in equations (8) - (10) can be used as a differentiable surrogate to the objective in equation (1) due to replacement of the binary weights with the smoothed approximate binary weights . Subsequently this enables the use of gradient based optimizers to find an approximate solution to the original hard combinatorial problem with low quantization error.
2.2 Reduction of quantization error with regularization
The inclusion of in equations (3) and (4) enables theoretical bounds on the divergence of the binarized weights from respectively. The inclusion of such nonlinearities is a common approach and often the hyperbolic tangent is used for this purpose in training BNNs DBLP:journals/corr/abs-1904-05868 ; Martinez2020Training ; Gong_2019_ICCV ; lahoud2019selfbinarizing ; DBLP:conf/cvpr/QinGLSWYS20 or the hard tanh and its variants NIPS2016_d8330f85 ; SakrCWGS18 . The inclusion of superfluous saturating nonlinearities changes the objective in a non-trivial way and slows training as these typically have substantial areas of their domain where gradients are very small or practically zero. To mitigate these shortcomings it is proposed that in practice the function is removed and instead soft constraints are introduced on to encourage them to not diverge from each other. This invalidates the theoretical guarantees about the variance of the positive and negative partitions of as defined in equations (3) - (7) however it works well in practice and alleviates the need to introduce superfluous saturating nonlinearities. All results discussed in subsequent sections do not include any activations or saturating nonlinearities added to the original full-precision architectures in the forward or backward propagation and instead regularization is applied to to encourage the full-precision parameters to not diverge far from zero.
2.3 Progressive binarization
Experimental results demonstrate that it might be beneficial to gradually increase the separation between the quantized values during training by interpolating the binarized weightsand the continuous parameters as such:
Where is the training step number; is the total number of training steps and is a hyperparameter denoting the fraction of total training steps required for to reach and remain at 1.
2.4 Parameter partitioning
). Whilst the proposed method supports any arbitrary partitioning scheme, in this work the parameters are partitioned by filter for convolutional layers and by neurons for fully connected layers.
In inference the parameters are binarized simply by using the function so that they are restricted to exact values in . Note that any zero valued parameters are assigned the value -1. At the completion of training only the binarized weights are retained, there is no need to keep the full-precision parameters nor any partitioning related information.
3 Related work
The proposed method is closely related to the core idea proposed in shulman2020diffprune
where a similar transformation is used to emulate a multivariate Bernoulli random variable. Whereas inshulman2020diffprune nuisance parameters are added to the model to calculate the MLE for multiplicative binary gates in the context of network pruning, in this work no additional parameters are introduced and the weights themselves are transformed to approximate a multivariate binary state over the network weights.
Network quantization refers to quantizing the weights and/or the activations of an ANN. It is one of a few methods for model compression and efficient model inference and has a large body of work in the literature dedicated to it. The focus of the method proposed in this work is on the extreme scenario of weights binarization to offering the maximal compression and speed gains. Since there are too many methods related to BNNs to mention in detail, the interested reader is referred to journals/corr/abs-1710-09282 ; Qin_2020 ; electronics8060661 for a thorough review. The rest of this section is dedicated to methods that solve the binarization problem by smoothing or reinterpreting the combinatorial problem in a way that enables use of exact gradients with backpropagation.
The method proposed in DBLP:journals/corr/abs-1904-05868 approximates the quantization function with such that the estimation error is controlled by gradually scaling the inputs to the quantizer during training. A different approach is taken by Martinez2020Training suggesting to train identical networks four times with an alternating teacher-student relationship. An auxiliary loss is added to coerce the networks to learn similar activations. Furthermore they also utilize the hyperbolic tangent function to smooth the function. Differentiable Soft Quantization (DSQ) is a method proposed in Gong_2019_ICCV to approximate the standard binary and uniform quantization process. DSQ employs a series of hyperbolic tangent functions to form a smooth function that progressively approaches a discrete like state emulating low-bit uniform quantization e.g., for the 1-bit case. Continuous Binarization introduced in SakrCWGS18
approximates the binary activation threshold operation using parameterized clipping functions and scaled binary activation function. This enables training with exact gradients however the method relies on a custom and lengthy training regime for individual layers and additional regularization. Furthermore the clipping functions are rectified and therefore suffer from zero gradient outside the clip boundaries. Self-Binarizing Networks introduced inlahoud2019selfbinarizing approach the binarization task by approximating the with hyperbolic tangent which is iteratively sharpened during training. Stochastic Quantization (SQ) dong2017learning propose to quantize only a subset of the parameters at a time based on a stochastic selection criteria such that only a subset of the gradients are estimated during backpropagation.
4.1 Inductive image classification
To demonstrate the effectiveness of the proposed method the top-1 accuracy is compared between a full-precision architecture and its binary weighted counterpart on a number of inductive image classification tasks. The methodology involves training each model twice, once with full-precision floating point weights and again using the proposed method. Both networks are evaluated at the end of each epoch and the best result achieved on the validation set during training is reported. The models are implemented in TensorFlowtensorflow2015-whitepaper using custom Dense and Conv2D layers. The optimizer used in all experiments is the weight decay decoupled SGD momentum optimizer DBLP:conf/iclr/LoshchilovH19 with a linear learning rate warmup period of 5 epochs. An exponential reduction schedule is applied to both the learning rate and weight decay. In all experiments of full-precision networks, except for the WRN-28-10 CIFAR10, the schedule updates by a factor of 0.1 at 1/3 and 2/3 of the overall post-warmup training steps. For the WRN-28-10 CIFAR10 experiment the schedule updates are as recommended in BMVC2016_87 . For the binary variants the updates occur at 0.1, 0.25, 0.4, 0.55, 0.7, 0.85 of the overall post-warmup training steps with a factor of 0.3. The parameters of the batch normalization layers are excluded from weight decay. In all experiments is set to 1 for the initial 90% of training steps and during the last 10% of training is incremented every step until a final value of 12. The training parameters for all experiments are summarized in table 2
. The residual blocks all use parameter free identity mapping that downsample skip connections by average pooling and concatenate zeros where required to match the number of activation planes. For the CIFAR data sets classification tasks a basic augmentation of horizontal flip, random translation and zoom is used and in the binary weighted variants all layers are binarized except for the first and last layers of the networks. Note there was no attempt to perform an exhaustive search of hyperparameters for the best possible result therefore these results should be taken as indicative only. All image data sets are taken from TensorFlow Data setsTFDS with the default train/test split. The source code is publicly available at https://bitbucket.org/YanivShu/binary_weighted_networks_public.
|Data Set||Architecture||32b Error %||1b Error %||Change %|
4.2 LeNet5 MNIST classification
The first experiment is the toy classification task of MNIST using the basic CNN LeNet5 Lecun98gradient-basedlearning . In the binary weighted variant all layers except for the last dense prediction layer are binarized.
4.3 CIFAR10 classification
The second experiment is the classification task of the CIFAR10 data set with three different architectures: Vgg-Small like network similar to the one used in 10.1007/978-3-030-01237-3_23 , ResNet-18 7780459 and WRN-28-10 BMVC2016_87 . For WRN-28-10 the baseline architecture is the no dropout variant with identity mapping. A minor modification was done to the architecture by increasing the number of filters in the first convolution layer from 16 to 64.
4.4 CIFAR100 classification
The third experiment is the classification task of the CIFAR100 data set using the ResNet-18 7780459 and WRN-28-10 BMVC2016_87 architectures identical to these used in the CIFAR10 experiments. Note that attempting training the full-precision WRN-28-10 network with the same hyperparamter settings and learning rate schedule as specified in BMVC2016_87 resulted in a slightly reduced accuracy.
|Self-Binarizing Networks lahoud2019selfbinarizing||VGG-Small||36.5|
|BWN DBLP:conf/eccv/RastegariORF16 dong2017learning||ResNet-56||35.0|
4.5 Effect of
This section aims to quantify the effect of progressive binarization with different rates . For this purpose models are trained a number of times with all settings unchanged except for modifying . The results summarized in table 5 indicate that the models can train well with or without progressive binarization. Despite no strong evidence to support the usefulness of applying progressive binarization it seems that for the deeper residual networks slow progressive binarization did slightly improve accuracy on the validation set.
In this section an analysis is performed to investigate the reasons leading to the outstanding experimental results. Consider the dot product, the core operation of neural networks, and it’s gradient:
Where ; and is an arbitrary nonlinearity. In comparison, consider the positive (or negative) group transformation proposed in this work and its gradient:
Where is the mean of the elements of the vector . Equation (17) reveals interesting properties of the proposed method.
The first is that for each of the partitions the gradients are zero centered due to the mean subtraction. This implies that after the gradient update the mean of the parameters will remain unchanged. Assuming the parameters are initialized with zero mean and considering this in conjunction with the regularization this property may have a regularizing effect. If a probabilistic interpretation is assumed similar to DBLP:conf/cvpr/QinGLSWYS20 , maintaining the parameters having a close to symmetric distribution with zero mean may increase the entropy of the weights distribution and therefore the representation power of the network.
Secondly, assume that then the gradients of the full-precision and binary networks are proportional i.e. with equality when . For standard gradient descent the proportionality implies the two models can be trained identically simply by scaling the learning rate. The assumption of is reasonable for inputs that are normalized by methods such as batch normalization 10.5555/3045118.3045167 , instance normalization DBLP:journals/corr/UlyanovVL16 or group normalization citeulike:14571032 . Therefore training of approximate binary weighted networks with gradient descent can be as effective as the training of full-precision networks as long as the two aforementioned conditions are maintained.
This paper proposes a novel and effective method for training binary weighted networks by smoothing the combinatorial problem of finding a binary vector of weights to minimize the expected loss for a given objective by means of empirical risk minimization with backpropagation. The method adds little computational complexity and can be readily applied to common architectures using automatic differentiation frameworks. Theoretical analysis and experimental results demonstrate that binary weighted networks can train well with the same standard optimization techniques and similar hyperparameter settings as their full-precision counterparts such as momentum SGD with large learning rates and regularization.
-  TensorFlow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
-  Y. Bai, Y.-X. Wang, and E. Liberty. Proxquant: Quantized neural networks via proximal operators. In International Conference on Learning Representations, 2019.
-  Y. Bengio, N. Léonard, and A. C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.
-  D. W. Blalock, J. J. G. Ortiz, J. Frankle, and J. V. Guttag. What is the state of neural network pruning? In I. S. Dhillon, D. S. Papailiopoulos, and V. Sze, editors, Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020.
-  A. Bulat and G. Tzimiropoulos. Xnor-net++: Improved binary neural networks. CoRR, abs/1909.13863, 2019.
-  A. Bulat, G. Tzimiropoulos, J. Kossaifi, and M. Pantic. Improved training of binary networks for human pose estimation and image recognition. CoRR, abs/1904.05868, 2019.
-  Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In , July 2017.
-  Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017.
-  M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
-  S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia. BNN+: improved binary network training. CoRR, abs/1812.11800, 2018.
-  J. Diffenderfer and B. Kailkhura. Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network. In International Conference on Learning Representations, 2021.
-  Y. Dong, R. Ni, J. Li, Y. Chen, J. Zhu, and H. Su. Learning accurate low-bit deep neural networks with stochastic quantization, 2017.
-  R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
-  S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1135–1143, Cambridge, MA, USA, 2015. MIT Press.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
-  L. Hou, Q. Yao, and J. T. Kwok. Loss-aware binarization of deep networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
-  Q. Hu, P. Wang, and J. Cheng. From hashing to cnns: Training binary weight networks via hashing. In S. A. McIlraith and K. Q. Weinberger, editors, AAAI, pages 3247–3254. AAAI Press, 2018.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org, 2015.
-  F. Lahoud, R. Achanta, P. Márquez-Neila, and S. Süsstrunk. Self-binarizing networks, 2019.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
-  H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training quantized nets: A deeper understanding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
-  Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
-  I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
-  C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations, 2018.
-  B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. In International Conference on Learning Representations, 2020.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In J. Fürnkranz and T. Joachims, editors, ICML, pages 807–814. Omnipress, 2010.
-  H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe. Binary neural networks: A survey. Pattern Recognition, 105:107281, Sep 2020.
-  H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, and J. Song. Forward and backward information retention for accurate binary neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 2247–2256. IEEE, 2020.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pages 525–542. Springer, 2016.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Representations by Back-propagating Errors. Nature, 323(6088):533–536, 1986.
-  C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan, and N. R. Shanbhag. True gradient-based training of deep binary activated neural networks via continuous binarization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018, pages 2346–2350. IEEE, 2018.
-  Y. Shulman. Diffprune: Neural network pruning with deterministic approximate binary gates and regularization. arXiv preprint arXiv:2012.03653, 2020.
-  T. Simons and D.-J. Lee. A review of binarized neural networks. Electronics, 8(6), 2019.
-  S. Srinivas and R. V. Babu. Learning neural network architectures using backpropagation. In R. C. Wilson, E. R. Hancock, and W. A. P. Smith, editors, Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016.
-  D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/1607.08022, 2016.
-  W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
-  xxx. Group Normalization, Mar. 2018.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. In E. R. H. Richard C. Wilson and W. A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016.
-  D. Zhang, J. Yang, D. Ye, and G. Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, pages 373–390, Cham, 2018. Springer International Publishing.