Perhaps, the most promising method for model compression and efficient model inference is network binarization, especially when both activations and weights are binary [9, 10, 28]. In this case, the binary convolution operation can be efficiently implemented with the bitwise XNOR, resulting in speed-up of on CPU (this speed-up on FPGAs can be even higher) and model compression ratio of . Although no other technique can achieve such impressive speed-ups and compression rates, this also comes at the cost of reduced accuracy. For example, there is difference in top-1 accuracy between a real-valued ResNet-18 and its binary counterpart on ImageNet , and difference between a real-valued state-of-the-art network for human pose estimation and its binary counterpart on MPII .
Motivated by the above findings, in this work, we focus on improving the training of binary networks by proposing a series of methodological improvements. In particular, we make the following contributions:
We motivate, provide convincing evidence and describe a series of methodological changes for training binary neural networks including (a) more appropriate non-linear activation functions (Sub-section 3.2), (b) reverse-order initialization (Sub-section 3.3), (c) progressive quantization (Sub-section 3.4), and (d) network stacking (Sub-section 3.5) that, individually and combined, are shown to improve existing state-of-the-art network binarization techniques, significantly. (e) We also show to what extent network binarization and knowledge distillation can be combined (Section 3.6).
We show that our improved training of binary networks is task and network agnostic by applying it on two diverse tasks: fine-grained recognition and, in particular, human pose estimation and classification, specifically ImageNet classification.
Exhaustive experiments conducted on the challenging MPII dataset show that our method offers an improvement of more than 4% in absolute terms over the state-of-the-art (Section 4).
On ImageNet we report a reduction of error rate by 4% over the current state-of-the-art (Section 5).
2 Related work
In this section, we review related prior work including network quantization and knowledge distillation for image classification, and methods for efficient human pose estimation.
2.1 Network Quantization
Network quantization refers to quantizing the weights and/or the features of a neural network. It is considered the method of choice for model compression and efficient model inference and a very active topic of research. Seminal work in this area goes back to [8, 23] who introduced techniques for 16- and 8-bit quantization. The method of  proposed a technique which allocates different numbers of bits (1-2-6) for the network parameters, activations and gradients. For more recent work see [38, 40, 45, 35].
The focus of the first methods proposed in this work is on binarization of both weights and features which is the extreme case aiming to quantizing to , thus offering the largest possible compression and speed gains. The work of  introduced a technique for training a CNN with binary weights. A follow-up work  demonstrates how to binarize both parameters and activations. This has the advantage that, during the forward pass, multiplications can be replaced with binary operations. The method of  proposes to model the weights with binary numbers multiplied by a scaling factor. Using this simple modification which does not sacrifice the beneficial properties of binary networks,  was the first to report good results on a large scale dataset (ImageNet ).
Our method proposes several extensions to , including more appropriate activation functions, reverse-order initialization, progressive quantization, and network stacking, which are shown to produce large improvements of more than % 4 (in absolute terms) for human pose estimation over the state-of-the art . We also report similar improvements for large-scale image classification on ImageNet, in particular, we report a reduction of error rate by 4% over the current state-of-the-art .
2.2 Knowledge Distillation
Recent works  have shown that, at least for real-valued networks, the performance of a smaller network can be improved by “distilling the knowledge” of another one, where “knowledge distillation” refers to transferring knowledge from one CNN (the so-called “teacher”) to another (the so-called “student”). Typically, the teacher is a high-capacity model of great accuracy, while the student is a compact model with much fewer parameters (thus also requiring much less computation). Thus, the goal of knowledge distillation is to use the teacher to train a compact student model with similar accuracy to that of the teacher. The term “knowledge” refers to the soft outputs of the teacher. Such soft outputs provide extra supervisory signals of intra-class and inter-class similarities learned by teacher. Further extensions include transferring from intermediate representations of the teacher network  and from attention maps . While most of the prior work focuses on distilling real-valued neural networks little to no-work has been done on studying the effectiveness of such approaches for binarized neural networks. In this work, we propose to adapt such techniques to binary networks, showing through empirical evidence their positive effect on improving accuracy.
2.3 Human pose estimation
A large number of works have been recently proposed for both single-person [25, 39, 2, 20, 34, 7, 42, 6] and multi-person [5, 26, 14, 12] human pose estimation. We note that the primary focus of these works is accuracy (especially for the single-person case) rather than efficient inference under low memory and computational constraints which is the main focus of our work.
Many of the aforementioned methods use the so-called HourGlass (HG) architecture  and its variants. While we also used the HG in our work, our focus was to enhance its efficiency while maintaining as much as possible its high accuracy which makes our work different to all aforementioned works. To our knowledge, the only papers that have similar aims are the works of  and .  and its extension  aim to improve binary neural networks for human pose estimation by introducing a novel residual block.  aims to improve quantized neural networks by introducing a new HG architecture. In contrast, in this work, we focus on improving binary networks for human pose estimation by (a) improving the binarization process per se, and (b) combining binarization with knowledge distillation. Our method is more general than the improvements proposed in  and . We illustrate this by showing the benefits of the proposed method for also improving ImageNet classification with binary networks.
This section presents the proposed methodological changes for improving the network binarization process. Throughout this section, we validated the performance gains offered by our method on the single person human pose estimation dataset, MPII. We note that we chose human pose estimation as the main dataset to report the bulk of our results as the dataset is considerably smaller and training is much faster (compared to ImageNet).
Sub-section 3.1 describes the strong baseline used in our work, briefly explaining the binarization process proposed in and , while the proposed improvements are described in Sub-sections 3.2, 3.3, 3.4, 3.5 and 3.6.
All results reported herein are against the state-of-the-art method of  which we used as a strong baseline to report the performance improvements introduced by our method. The method of  combines the HourGlass (HG) architecture of  with the a newly proposed residual block that was specifically designed for binary CNNs (see Fig. 1). The network was binarized using the approach described in  as follows:
is the input tensor,is the layer’s weight tensor, a matrix containing the scaling factors for all the sub-tensors of , and is a scaling factor for the weights. denotes the binary convolution operation which can be efficiently implemented with the bitwise XNOR, resulting in speed-up of and model compression ratio of . Note than in practice, we follow [3, 28] and drop since this speed-ups the network at a negligible performance drop.
3.2 Leaky non-linearities
Previous work [28, 3] has shown that adding a non-linearity after each convolutional layer can be used to increase the performance of binarized CNNs. In the context of real-valued networks there exists a plethora or works that explore their effect on the overall network accuracy, however, in contrast to this, there is little to no work avaialble for binary networks. Herein, we rigorously explore the choice of the non-linearity and its impact on the overall performance for the task of human pose estimation showing empirically in the process the negative impact of the previously proposed ReLU. Instead of using a ReLU, we propose to use the recently introduced PReLU  function, an adaptation of the leaky ReLU that has a learnable negative slope, which we find it to perform better than both the ReLU and the leaky ReLU.
There are two main arguments for justifying our findings. Firstly, with the help of the function, the binarization process restricts the possible states of the filters and features to . As such, the representational power of the network resides on these two states, and removing one of them during training using a ReLU for each convolutional layer makes the training unstable. See also Fig. 3. Secondly, this instability is further amplified by the fact that the implementation of the sign function is “leaky” at 0, introducing a third unwanted spurious state and the subsequent iterations can cause easy jumps between the two states. See also Fig. 2. Note, that despite the fact that the Batch Normalisation  layer mitigates some of this effects by re-centering the input distribution, as the experiments show, in practice, the network can achieve significantly better accuracy if the non-linearity function allows negative values to pass. On the other hand, we know that non-linearities should be used to increase the representational power of the network. We conclude that a PReLU can be safely used for this purpose removing also the aforementioned instabilities.
3.3 Reverse-order initialization
Initialization of neural networks has been the subject of study of many recent works [13, 15, 31] where it was shown that an appropriate initialization is often required for achieving good performance . The same holds for quantized networks, where most of prior works either use an adaptation of the above mentioned initialization strategies, or start from a pretrained real-valued neural network. However, while the weight binarization alone can be done with little to no accuracy loss , quantizing the features has much higher detrimental effect [28, 46]. In addition, since the output signal from is very different to the output of a ReLU layer, the transition from a fully real-valued network to a binary one causes a catastrophic loss in accuracy often comparable with training from scratch.
To alleviate this, we propose the opposite of what is currently considered the standard method to initialize a binary network from a real-valued one: we propose to firstly train a network with real weights and binary features (the features are binarized using the approach presented in Section 3.4) and only after this, it is fully trained to further binarize the weights. By doing so, we effectively split the problem into two sub-problems: weight and feature binarization which we then try to solve from the hardest to the easiest one. Fig. 4 shows the advantage of the proposed initialization method against standard pre-training.
3.4 Smooth progressive quantization
Previous works have shown that incrementally quantizing the network either by gradually decreasing the precision or by partitioning and progressively increasing the amount of quantized weights  leads to decent performance improvements. While the later is more practical, it requires a careful fine-tuning of the quantization ratio at each step.
Instead, in this work, we follow a different route by proposing to approximate the quantization function with a smoother one, in which the estimation error is controlled by . By gradually increasing during training, we achieve a progressive binarization. This allows for a natural and smoother transition in which the selection of the weights to be binarized occurs implicitly and can be easily controlled by varying without the need to define a fixed scheduling for increasing the amount of quantized weights as in .
In the following, we present a few options we explored to approximate the function alongside their derivatives (see also Fig. 6):
As the function converges to . In a similar fashion, the derivative of the approximation function converges to the Dirac function . In practice, as most of the features are outside of the region with high approximation error (see Fig. 5), we started observing close-to-binary results starting with . See Fig. 7.
In our tests we found that all the above approximation functions behaved similarly, however the best performance was obtained using the , while the softsign offered slightly lower performance. As such, the final reported results are obtained using the . During training we progressively increased the value of starting from to .
3.5 Stacked binary networks
As shown in , using a stack of HG networks can be used to greatly improve human pose estimation accuracy, allowing the network to gradually refine its prediction at each stage. In a similar fashion, in this work we constructed a stack of binary HG networks also incorporating the improvements introduced in the previous subsections. We would like to verify to what extent stacking can further contribute on top of these improvements. In addition to these improvements, our method differs to  in that all the intermediate layers used to join the stacks are also binarized. As the results from section 4 show, stacking further improves upon the improvements reported in the previous subsections.
3.6 Combining binarization with distillation
Recent work on knowledge distillation has focused on real-valued networks , largely ignoring the quantized, and especially, the binarized case.
In this work, and in light of the methods proposed in the previous sub-sections, we also study the effect and effectiveness of knowledge distillation for the case of binary networks, evaluating in the process the following options: (a) using a real-valued teacher and a binary student and (b) using a binary teacher and a binary student with and without feature matching. During training, we used the output heatmaps of the teacher network as soft labels for the Binary Cross Entropy Loss. In addition, we found that the best results can be obtained by combining the ground truth and the soft labels with a weight equal to .
4 Human pose estimation experiments
In this section, we report our results on MPII, one of the most challenging datasets for single person human pose estimation . MPII contains approximately 25,000 images and more than 40,000 persons annotated with up to 16 landmarks and visibility labels. We use the same split for validation and training as in  (3,000 for validation and 22,000 for training). We firstly report the performance improvements, using the PCKh metric , obtained by applying incrementally the proposed methods in the same order as these methods appear in the paper. We then evaluate the proposed improvements in isolation.
Leaky non-linearities (Section 3.2):
The performance improvement obtained by replacing the ReLU with Leaky ReLU and then PReLU as proposed in our work is shown in the 3-rd and 4-th rows of Table 1. We observe a large improvement of 2.5% in terms of absolute error with the highest gains offered by the PReLU function. Note that we obtained similar accuracy between the variant that uses a single scale factor and the one that uses one per each channel for the negative slope.
Reverse-order initialization (Section 3.3):
We observe an additional improvement of 0.8% by firstly binarizing the features and then the weights, as proposed in our work and as shown in the 5-th row of Table 1. This, alongside the results from Fig. 4 show that the proposed strategy is an efficient way of improving the performance of binary networks.
Progressive binarization (Section 3.4):
We observe an additional improvement of 0.4% by the proposed progressive binarization as shown in the 6-th row of Table 1.
Stacked binary networks (Section 3.5):
We observe an additional improvement of 1.5% and 1.9% by using 2-stack and 3-stack HG networks, respectively, as shown in the 4-th column of Table 2. While significant improvements can be observed when going from 1 HG to a stack of 2, the gain in performance diminishes when 1 more binary HG is added to the network. A similar phenomenon is also observed (but to less extent though) for the case of real-valued networks.
Binarization plus distillation
As shown in the last row of Table 1, we obtained an improvement of 0.6% via combining binarization and distillation for a binary network with a single HG distilled using a high performing real-valued teacher. Note that the binary network already incorporates the improvements proposed in section 3. Also, the last column of Table 2 shows the improvements obtained by combining binarization with distillation for multi-stack binary HG networks. We observe an additional improvement of 1.5% and 1.9% by using 2-stack and 3-stack HG networks, respectively.
While we explored with using both a binary and a real-valued “teacher” given that finding a high performing binary teacher is challenging on its own, we obtained the best results using a real-valued one. However, in both cases the network converged to a satisfactory condition.
|#stacks||#params||||Ours w/o distil.||Ours w. distil|
While the above results illustrate the accuracy gains by incrementally applying our improvements it is also important to evaluate the performance gains of each proposed improvement in isolation. As the results from Table 3 show, the proposed techniques also yield high improvements when applied independently. At the same time, when evaluated in isolation the proposed modification offers a noticeable higher performance increase compared with the case where they are gradually added together (i.e 0.8% vs 1.9% for reverse-order initialization).
We trained all models for human pose estimation (both real-valued and binary) following the same procedure: they were trained for 120 epochs, using a learning rate ofthat was dropped every 40 epochs by a factor of 10. For data augmentation, we applied random flipping, scale ( to ) jittering and rotation ( to ). Instead of using the MSE loss, we followed the findings from  and used the BCE loss defined as:
where denotes the ground truth confidence map of the th part at pixel location and is the corresponding predicted output at the same location. For distillation, we simply applied a BCE loss using as ground truth the predictions of the teacher network.
5 Imagenet classification experiments
To emphasize the generalization properties of the proposed improvements: (a) leaky non-linearities (Section 3.2), (b) reverse-order initialization (Section 3.3), (c) smooth progressive quantization (Section 3.4), and (d) knowledge distillation (Section 3.6), in this section, we show that they are largely task-, architecture- and block-independent, by applying them on both of a more traditional architecture (i.e AlexNet ) and a resnet-based one (ResNet-18) for the task of Imagenet  classification.
As Table 4 shows, when compared against the state-of-the-art method of  and , our approach offers a large improvement of up to 4% in terms of absolute error for both Top-1 and Top-5 error metrics using both AlexNet and ResNet-18 architectures. This further validate the generality of our method.
|Classification accuracy (%)|
|Top-1 accuracy||Top-5 accuracy||Top-1 accuracy||Top-5 accuracy|
|–expand-bbl Real valued ||56.6%||80.2%||69.3%||89.2%|
We trained the binarized version of AlexNet  and ResNet-18  using Adam  starting with a learning rate of that is gradually decreased every 25 epochs by 0.1. We simply augment the data by firstly resizing the images to have 256px over the smallest dimension and then randomly cropping them to for AlexNet and px for ResNet. We believe that further performance gains can be achieved with more aggressive augmentation. At test time, instead of random-crop we center-crop the images. To alleviate problems introduced by the binarization process, and similarly to , we trained the network using a large batch size, specifically 400 for AlexNet and 256 for ResNet-18. All models were trained for 80 epochs. Fig. 8 shows the top-1 and top-5 accuracy across training epochs for AlexNet (the network was initialized using the procedure proposed in Section 3.3).
In this work, we proposed a series of novel techniques for highly efficient binarized convolutional neural network. We experimentally validated our results on the challenging problems of human pose estimation and large scale image classification. Mainly, we propose (a) more appropriate non-linear activation functions, (b) reverse-order initialization, (c) progressive features quantization, and (d) network stacking that improve existing state-of-the-art network binarization techniques. Furthermore, we explore the effect and efficiency of knowledge distillation procedures in the context of binary networks using a real-valued “teacher” and binary “student”.
Overall, our results show that a performance improvement of up to 5% in absolute terms is obtained on the challenging human pose estimation dataset of MPII. Finally, we show that our approach is architecture and task-agnostic and can increase the performance of arbitrary networks. In particular, by applying the proposed techniques to Imagenet classification, we report an absolute performance improvement of 4% over the current state-of-the-art using both AlexNet and ResNet architectures.
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
-  A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.
-  A. Bulat and G. Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In ICCV, 2017.
-  A. Bulat and Y. Tzimiropoulos. Hierarchical binary cnns for landmark localization with limited resources. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
-  Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang. Adversarial posenet: A structure-aware convolutional network for human pose estimation. CoRR, abs/1705.00389, 2017.
-  X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. Multi-context attention for human pose estimation. arXiv preprint arXiv:1702.07432, 2017.
-  M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networks with low precision multiplications. arXiv, 2014.
-  M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
-  M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran.
Detect-and-track: Efficient pose estimation in videos.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 350–359, 2018.
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv, 2015.
-  L. Ke, M.-C. Chang, H. Qi, and S. Lyu. Multi-scale structure-aware network for human pose estimation. arXiv preprint arXiv:1803.09894, 2018.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks. arXiv, 2015.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
-  G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
-  A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv, 2014.
I. Sutskever, J. Martens, G. Dahl, and G. Hinton.
On the importance of initialization and momentum in deep learning.In
International conference on machine learning, pages 1139–1147, 2013.
-  W. Tang, P. Yu, and Y. Wu. Deeply learned compositional models for human pose estimation. In ECCV, 2018.
-  Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, and D. Metaxas. Quantized densely connected u-nets for efficient landmark localization. In ECCV, 2018.
-  T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.
-  J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
-  F. Tung and G. Mori. Clip-q: Deep network compression learning by in-parallel pruning-quantization. In CVPR, 2018.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
-  J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
-  B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
-  W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learning feature pyramids for human pose estimation. In ICCV, 2017.
-  S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
-  A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044, 2017.
-  A. Zhou, A. Yao, K. Wang, and Y. Chen. Explicit loss-error-aware quantization for low-bit deep neural networks. In CVPR, 2018.
-  S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv, 2016.