1 Introduction
Recent methods based on Convolutional Neural Networks (CNNs) have been shown to produce results of high accuracy for a wide range of challenging Computer Vision tasks like image recognition
[22, 32, 16], object detection [29], semantic segmentation [24, 14] and human pose estimation [41, 25]. Two fundamental assumptions made by these methods are that: 1) very large and diverse labelled datasets are available for training, and 2) that at least one highend GPU is available for model training and inference. While it can be assumed that for training both labelled training data and computational resources are available, from a a practical perspective, in many applications (e.g. object recognition or human sensing on mobile devices and robots), it is unreasonable to assume that dedicated highend GPUs are available for inference. The aim of this paper is to enable highly accurate and efficient convolutional networks on devices with limited memory, storage and computational power. Under such constraints, the accuracy and performance of existing methods rapidly drops, and the problem is considered far from being solved.Perhaps, the most promising method for model compression and efficient model inference is network binarization, especially when both activations and weights are binary [9, 10, 28]. In this case, the binary convolution operation can be efficiently implemented with the bitwise XNOR, resulting in speedup of on CPU (this speedup on FPGAs can be even higher) and model compression ratio of [28]. Although no other technique can achieve such impressive speedups and compression rates, this also comes at the cost of reduced accuracy. For example, there is difference in top1 accuracy between a realvalued ResNet18 and its binary counterpart on ImageNet [28], and difference between a realvalued stateoftheart network for human pose estimation and its binary counterpart on MPII [3].
Motivated by the above findings, in this work, we focus on improving the training of binary networks by proposing a series of methodological improvements. In particular, we make the following contributions:

We motivate, provide convincing evidence and describe a series of methodological changes for training binary neural networks including (a) more appropriate nonlinear activation functions (Subsection 3.2), (b) reverseorder initialization (Subsection 3.3), (c) progressive quantization (Subsection 3.4), and (d) network stacking (Subsection 3.5) that, individually and combined, are shown to improve existing stateoftheart network binarization techniques, significantly. (e) We also show to what extent network binarization and knowledge distillation can be combined (Section 3.6).

We show that our improved training of binary networks is task and network agnostic by applying it on two diverse tasks: finegrained recognition and, in particular, human pose estimation and classification, specifically ImageNet classification.

Exhaustive experiments conducted on the challenging MPII dataset show that our method offers an improvement of more than 4% in absolute terms over the stateoftheart (Section 4).

On ImageNet we report a reduction of error rate by 4% over the current stateoftheart (Section 5).
2 Related work
In this section, we review related prior work including network quantization and knowledge distillation for image classification, and methods for efficient human pose estimation.
2.1 Network Quantization
Network quantization refers to quantizing the weights and/or the features of a neural network. It is considered the method of choice for model compression and efficient model inference and a very active topic of research. Seminal work in this area goes back to [8, 23] who introduced techniques for 16 and 8bit quantization. The method of [46] proposed a technique which allocates different numbers of bits (126) for the network parameters, activations and gradients. For more recent work see [38, 40, 45, 35].
The focus of the first methods proposed in this work is on binarization of both weights and features which is the extreme case aiming to quantizing to , thus offering the largest possible compression and speed gains. The work of [9] introduced a technique for training a CNN with binary weights. A followup work [10] demonstrates how to binarize both parameters and activations. This has the advantage that, during the forward pass, multiplications can be replaced with binary operations. The method of [28] proposes to model the weights with binary numbers multiplied by a scaling factor. Using this simple modification which does not sacrifice the beneficial properties of binary networks, [28] was the first to report good results on a large scale dataset (ImageNet [11]).
Our method proposes several extensions to [28], including more appropriate activation functions, reverseorder initialization, progressive quantization, and network stacking, which are shown to produce large improvements of more than % 4 (in absolute terms) for human pose estimation over the stateofthe art [3]. We also report similar improvements for largescale image classification on ImageNet, in particular, we report a reduction of error rate by 4% over the current stateoftheart [28].
2.2 Knowledge Distillation
Recent works [18] have shown that, at least for realvalued networks, the performance of a smaller network can be improved by “distilling the knowledge” of another one, where “knowledge distillation” refers to transferring knowledge from one CNN (the socalled “teacher”) to another (the socalled “student”). Typically, the teacher is a highcapacity model of great accuracy, while the student is a compact model with much fewer parameters (thus also requiring much less computation). Thus, the goal of knowledge distillation is to use the teacher to train a compact student model with similar accuracy to that of the teacher. The term “knowledge” refers to the soft outputs of the teacher. Such soft outputs provide extra supervisory signals of intraclass and interclass similarities learned by teacher. Further extensions include transferring from intermediate representations of the teacher network [30] and from attention maps [43]. While most of the prior work focuses on distilling realvalued neural networks little to nowork has been done on studying the effectiveness of such approaches for binarized neural networks. In this work, we propose to adapt such techniques to binary networks, showing through empirical evidence their positive effect on improving accuracy.
2.3 Human pose estimation
A large number of works have been recently proposed for both singleperson [25, 39, 2, 20, 34, 7, 42, 6] and multiperson [5, 26, 14, 12] human pose estimation. We note that the primary focus of these works is accuracy (especially for the singleperson case) rather than efficient inference under low memory and computational constraints which is the main focus of our work.
Many of the aforementioned methods use the socalled HourGlass (HG) architecture [25] and its variants. While we also used the HG in our work, our focus was to enhance its efficiency while maintaining as much as possible its high accuracy which makes our work different to all aforementioned works. To our knowledge, the only papers that have similar aims are the works of [3] and [35]. [3] and its extension [4] aim to improve binary neural networks for human pose estimation by introducing a novel residual block. [35] aims to improve quantized neural networks by introducing a new HG architecture. In contrast, in this work, we focus on improving binary networks for human pose estimation by (a) improving the binarization process per se, and (b) combining binarization with knowledge distillation. Our method is more general than the improvements proposed in [3] and [35]. We illustrate this by showing the benefits of the proposed method for also improving ImageNet classification with binary networks.
3 Method
This section presents the proposed methodological changes for improving the network binarization process. Throughout this section, we validated the performance gains offered by our method on the single person human pose estimation dataset, MPII. We note that we chose human pose estimation as the main dataset to report the bulk of our results as the dataset is considerably smaller and training is much faster (compared to ImageNet).
Subsection 3.1 describes the strong baseline used in our work, briefly explaining the binarization process proposed in[28] and [3], while the proposed improvements are described in Subsections 3.2, 3.3, 3.4, 3.5 and 3.6.
3.1 Baseline
All results reported herein are against the stateoftheart method of [3] which we used as a strong baseline to report the performance improvements introduced by our method. The method of [3] combines the HourGlass (HG) architecture of [25] with the a newly proposed residual block that was specifically designed for binary CNNs (see Fig. 1). The network was binarized using the approach described in [28] as follows:
(1) 
where
is the input tensor,
is the layer’s weight tensor, a matrix containing the scaling factors for all the subtensors of , and is a scaling factor for the weights. denotes the binary convolution operation which can be efficiently implemented with the bitwise XNOR, resulting in speedup of and model compression ratio of [28]. Note than in practice, we follow [3, 28] and drop since this speedups the network at a negligible performance drop.3.2 Leaky nonlinearities
Previous work [28, 3] has shown that adding a nonlinearity after each convolutional layer can be used to increase the performance of binarized CNNs. In the context of realvalued networks there exists a plethora or works that explore their effect on the overall network accuracy, however, in contrast to this, there is little to no work avaialble for binary networks. Herein, we rigorously explore the choice of the nonlinearity and its impact on the overall performance for the task of human pose estimation showing empirically in the process the negative impact of the previously proposed ReLU. Instead of using a ReLU, we propose to use the recently introduced PReLU [15] function, an adaptation of the leaky ReLU that has a learnable negative slope, which we find it to perform better than both the ReLU and the leaky ReLU.
There are two main arguments for justifying our findings. Firstly, with the help of the function, the binarization process restricts the possible states of the filters and features to . As such, the representational power of the network resides on these two states, and removing one of them during training using a ReLU for each convolutional layer makes the training unstable. See also Fig. 3. Secondly, this instability is further amplified by the fact that the implementation of the sign function is “leaky” at 0, introducing a third unwanted spurious state and the subsequent iterations can cause easy jumps between the two states. See also Fig. 2. Note, that despite the fact that the Batch Normalisation [19] layer mitigates some of this effects by recentering the input distribution, as the experiments show, in practice, the network can achieve significantly better accuracy if the nonlinearity function allows negative values to pass. On the other hand, we know that nonlinearities should be used to increase the representational power of the network. We conclude that a PReLU can be safely used for this purpose removing also the aforementioned instabilities.
3.3 Reverseorder initialization
Initialization of neural networks has been the subject of study of many recent works [13, 15, 31] where it was shown that an appropriate initialization is often required for achieving good performance [33]. The same holds for quantized networks, where most of prior works either use an adaptation of the above mentioned initialization strategies, or start from a pretrained realvalued neural network. However, while the weight binarization alone can be done with little to no accuracy loss [28], quantizing the features has much higher detrimental effect [28, 46]. In addition, since the output signal from is very different to the output of a ReLU layer, the transition from a fully realvalued network to a binary one causes a catastrophic loss in accuracy often comparable with training from scratch.
To alleviate this, we propose the opposite of what is currently considered the standard method to initialize a binary network from a realvalued one: we propose to firstly train a network with real weights and binary features (the features are binarized using the approach presented in Section 3.4) and only after this, it is fully trained to further binarize the weights. By doing so, we effectively split the problem into two subproblems: weight and feature binarization which we then try to solve from the hardest to the easiest one. Fig. 4 shows the advantage of the proposed initialization method against standard pretraining.
3.4 Smooth progressive quantization
Previous works have shown that incrementally quantizing the network either by gradually decreasing the precision or by partitioning and progressively increasing the amount of quantized weights [44] leads to decent performance improvements. While the later is more practical, it requires a careful finetuning of the quantization ratio at each step.
Instead, in this work, we follow a different route by proposing to approximate the quantization function with a smoother one, in which the estimation error is controlled by . By gradually increasing during training, we achieve a progressive binarization. This allows for a natural and smoother transition in which the selection of the weights to be binarized occurs implicitly and can be easily controlled by varying without the need to define a fixed scheduling for increasing the amount of quantized weights as in [44].
In the following, we present a few options we explored to approximate the function alongside their derivatives (see also Fig. 6):
Sigmoid:
(2)  
SoftSign:
(3)  
Tanh:
(4)  
As the function converges to . In a similar fashion, the derivative of the approximation function converges to the Dirac function . In practice, as most of the features are outside of the region with high approximation error (see Fig. 5), we started observing closetobinary results starting with . See Fig. 7.
In our tests we found that all the above approximation functions behaved similarly, however the best performance was obtained using the , while the softsign offered slightly lower performance. As such, the final reported results are obtained using the . During training we progressively increased the value of starting from to .
3.5 Stacked binary networks
As shown in [25], using a stack of HG networks can be used to greatly improve human pose estimation accuracy, allowing the network to gradually refine its prediction at each stage. In a similar fashion, in this work we constructed a stack of binary HG networks also incorporating the improvements introduced in the previous subsections. We would like to verify to what extent stacking can further contribute on top of these improvements. In addition to these improvements, our method differs to [4] in that all the intermediate layers used to join the stacks are also binarized. As the results from section 4 show, stacking further improves upon the improvements reported in the previous subsections.
3.6 Combining binarization with distillation
Recent work on knowledge distillation has focused on realvalued networks [18], largely ignoring the quantized, and especially, the binarized case.
In this work, and in light of the methods proposed in the previous subsections, we also study the effect and effectiveness of knowledge distillation for the case of binary networks, evaluating in the process the following options: (a) using a realvalued teacher and a binary student and (b) using a binary teacher and a binary student with and without feature matching. During training, we used the output heatmaps of the teacher network as soft labels for the Binary Cross Entropy Loss. In addition, we found that the best results can be obtained by combining the ground truth and the soft labels with a weight equal to .
4 Human pose estimation experiments
In this section, we report our results on MPII, one of the most challenging datasets for single person human pose estimation [1]. MPII contains approximately 25,000 images and more than 40,000 persons annotated with up to 16 landmarks and visibility labels. We use the same split for validation and training as in [37] (3,000 for validation and 22,000 for training). We firstly report the performance improvements, using the PCKh metric [1], obtained by applying incrementally the proposed methods in the same order as these methods appear in the paper. We then evaluate the proposed improvements in isolation.
4.1 Results
Baseline:
Leaky nonlinearities (Section 3.2):
The performance improvement obtained by replacing the ReLU with Leaky ReLU and then PReLU as proposed in our work is shown in the 3rd and 4th rows of Table 1. We observe a large improvement of 2.5% in terms of absolute error with the highest gains offered by the PReLU function. Note that we obtained similar accuracy between the variant that uses a single scale factor and the one that uses one per each channel for the negative slope.
Reverseorder initialization (Section 3.3):
We observe an additional improvement of 0.8% by firstly binarizing the features and then the weights, as proposed in our work and as shown in the 5th row of Table 1. This, alongside the results from Fig. 4 show that the proposed strategy is an efficient way of improving the performance of binary networks.
Progressive binarization (Section 3.4):
We observe an additional improvement of 0.4% by the proposed progressive binarization as shown in the 6th row of Table 1.
Stacked binary networks (Section 3.5):
We observe an additional improvement of 1.5% and 1.9% by using 2stack and 3stack HG networks, respectively, as shown in the 4th column of Table 2. While significant improvements can be observed when going from 1 HG to a stack of 2, the gain in performance diminishes when 1 more binary HG is added to the network. A similar phenomenon is also observed (but to less extent though) for the case of realvalued networks.
Binarization plus distillation
As shown in the last row of Table 1, we obtained an improvement of 0.6% via combining binarization and distillation for a binary network with a single HG distilled using a high performing realvalued teacher. Note that the binary network already incorporates the improvements proposed in section 3. Also, the last column of Table 2 shows the improvements obtained by combining binarization with distillation for multistack binary HG networks. We observe an additional improvement of 1.5% and 1.9% by using 2stack and 3stack HG networks, respectively.
While we explored with using both a binary and a realvalued “teacher” given that finding a high performing binary teacher is challenging on its own, we obtained the best results using a realvalued one. However, in both cases the network converged to a satisfactory condition.
Method 




PCKh  

[3]  ✗  ✗  ✗  ✗  76.6%  
[3]  ReLU  ✗  ✗  ✗  76.3%  
Ours  LReLU  ✗  ✗  ✗  78.1%  
Ours  PReLU  ✗  ✗  ✗  79.1%  
Ours  PReLU  ✓  ✗  ✗  79.9%  
Ours  PReLU  ✓  ✓  ✗  80.3%  
Ours  PReLU  ✓  ✓  ✓  80.9%  
[3]Real          85.6% 
#stacks  #params  [4]  Ours w/o distil.  Ours w. distil 

1  6.2M  76.6%  80.3%  80.9% 
2  11.0M  79.9%  81.8%  82.3% 
3  17.8M  81.3%  82.2%  82.7% 
While the above results illustrate the accuracy gains by incrementally applying our improvements it is also important to evaluate the performance gains of each proposed improvement in isolation. As the results from Table 3 show, the proposed techniques also yield high improvements when applied independently. At the same time, when evaluated in isolation the proposed modification offers a noticeable higher performance increase compared with the case where they are gradually added together (i.e 0.8% vs 1.9% for reverseorder initialization).
Method 




PCKh  

[3]  ✗  ✗  ✗  ✗  76.6%  
[3]  ReLU  ✗  ✗  ✗  76.3%  
Ours  LReLU  ✗  ✗  ✗  78.1%  
Ours  PReLU  ✗  ✗  ✗  79.1%  
Ours  ✗  ✓  ✗  ✗  78.5%  
Ours  ✗  ✗  ✓  ✗  78.0%  
Ours  ✗  ✗  ✗  ✓  77.6% 
4.2 Training
We trained all models for human pose estimation (both realvalued and binary) following the same procedure: they were trained for 120 epochs, using a learning rate of
that was dropped every 40 epochs by a factor of 10. For data augmentation, we applied random flipping, scale ( to ) jittering and rotation ( to ). Instead of using the MSE loss, we followed the findings from [3] and used the BCE loss defined as:(5) 
where denotes the ground truth confidence map of the th part at pixel location and is the corresponding predicted output at the same location. For distillation, we simply applied a BCE loss using as ground truth the predictions of the teacher network.
5 Imagenet classification experiments
To emphasize the generalization properties of the proposed improvements: (a) leaky nonlinearities (Section 3.2), (b) reverseorder initialization (Section 3.3), (c) smooth progressive quantization (Section 3.4), and (d) knowledge distillation (Section 3.6), in this section, we show that they are largely task, architecture and blockindependent, by applying them on both of a more traditional architecture (i.e AlexNet [22]) and a resnetbased one (ResNet18) for the task of Imagenet [11] classification.
AlexNet:
Similarly to [28, 10], we removed the local normalization layer preserving the same structure, namely (from input to output): , , , , , , ,
, applying maxpooling after the
st, nd and th layers. Similarly to [28], the first and last layer were kept real.ResNet:
Results:
As Table 4 shows, when compared against the stateoftheart method of [28] and [10], our approach offers a large improvement of up to 4% in terms of absolute error for both Top1 and Top5 error metrics using both AlexNet and ResNet18 architectures. This further validate the generality of our method.
Classification accuracy (%)  
Method  AlexNet  ResNet18  
Top1 accuracy  Top5 accuracy  Top1 accuracy  Top5 accuracy  
BNN [10]  41.8%  67.1%  42.2%  69.2% 
XNORNet [28]  44.2%  69.2%  51.2%  73.2% 
Ours  48.6%  72.8%  53.7%  76.8% 
–expandbbl Real valued [22]  56.6%  80.2%  69.3%  89.2% 
Training:
We trained the binarized version of AlexNet [22] and ResNet18 [16] using Adam [21] starting with a learning rate of that is gradually decreased every 25 epochs by 0.1. We simply augment the data by firstly resizing the images to have 256px over the smallest dimension and then randomly cropping them to for AlexNet and px for ResNet. We believe that further performance gains can be achieved with more aggressive augmentation. At test time, instead of randomcrop we centercrop the images. To alleviate problems introduced by the binarization process, and similarly to [28], we trained the network using a large batch size, specifically 400 for AlexNet and 256 for ResNet18. All models were trained for 80 epochs. Fig. 8 shows the top1 and top5 accuracy across training epochs for AlexNet (the network was initialized using the procedure proposed in Section 3.3).
6 Conclusions
In this work, we proposed a series of novel techniques for highly efficient binarized convolutional neural network. We experimentally validated our results on the challenging problems of human pose estimation and large scale image classification. Mainly, we propose (a) more appropriate nonlinear activation functions, (b) reverseorder initialization, (c) progressive features quantization, and (d) network stacking that improve existing stateoftheart network binarization techniques. Furthermore, we explore the effect and efficiency of knowledge distillation procedures in the context of binary networks using a realvalued “teacher” and binary “student”.
Overall, our results show that a performance improvement of up to 5% in absolute terms is obtained on the challenging human pose estimation dataset of MPII. Finally, we show that our approach is architecture and taskagnostic and can increase the performance of arbitrary networks. In particular, by applying the proposed techniques to Imagenet classification, we report an absolute performance improvement of 4% over the current stateoftheart using both AlexNet and ResNet architectures.
References
 [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
 [2] A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.
 [3] A. Bulat and G. Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In ICCV, 2017.
 [4] A. Bulat and Y. Tzimiropoulos. Hierarchical binary cnns for landmark localization with limited resources. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [5] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. In CVPR, 2017.
 [6] Y. Chen, C. Shen, X.S. Wei, L. Liu, and J. Yang. Adversarial posenet: A structureaware convolutional network for human pose estimation. CoRR, abs/1705.00389, 2017.
 [7] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. Multicontext attention for human pose estimation. arXiv preprint arXiv:1702.07432, 2017.
 [8] M. Courbariaux, Y. Bengio, and J.P. David. Training deep neural networks with low precision multiplications. arXiv, 2014.
 [9] M. Courbariaux, Y. Bengio, and J.P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
 [10] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv, 2016.
 [11] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.

[12]
R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran.
Detectandtrack: Efficient pose estimation in videos.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 350–359, 2018. 
[13]
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pages 249–256, 2010.  [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
 [15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [17] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 [18] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv, 2015.
 [20] L. Ke, M.C. Chang, H. Qi, and S. Lyu. Multiscale structureaware network for human pose estimation. arXiv preprint arXiv:1803.09894, 2018.
 [21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [23] D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks. arXiv, 2015.
 [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [25] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
 [26] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multiperson pose estimation in the wild. In CVPR, 2017.
 [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 [29] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [30] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
 [31] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv, 2014.

[33]
I. Sutskever, J. Martens, G. Dahl, and G. Hinton.
On the importance of initialization and momentum in deep learning.
InInternational conference on machine learning
, pages 1139–1147, 2013.  [34] W. Tang, P. Yu, and Y. Wu. Deeply learned compositional models for human pose estimation. In ECCV, 2018.
 [35] Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, and D. Metaxas. Quantized densely connected unets for efficient landmark localization. In ECCV, 2018.
 [36] T. Tieleman and G. Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.
 [37] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
 [38] F. Tung and G. Mori. Clipq: Deep network compression learning by inparallel pruningquantization. In CVPR, 2018.
 [39] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
 [40] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
 [41] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
 [42] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learning feature pyramids for human pose estimation. In ICCV, 2017.
 [43] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
 [44] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044, 2017.
 [45] A. Zhou, A. Yao, K. Wang, and Y. Chen. Explicit losserroraware quantization for lowbit deep neural networks. In CVPR, 2018.
 [46] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv, 2016.