EfficientPose: Scalable single-person pose estimation

by   Daniel Groos, et al.

Human pose estimation facilitates markerless movement analysis in sports, as well as in clinical applications. Still, state-of-the-art models for human pose estimation generally do not meet the requirements for real-life deployment. The main reason for this is that the more the field progresses, the more expensive the approaches become, with high computational demands. To cope with the challenges caused by this trend, we propose a convolutional neural network architecture that benefits from the recently proposed EfficientNets to deliver scalable single-person pose estimation. To this end, we introduce EfficientPose, which is a family of models harnessing an effective multi-scale feature extractor, computation efficient detection blocks utilizing mobile inverted bottleneck convolutions, and upscaling improving precision of pose configurations. EfficientPose enables real-world deployment on edge devices through 500K parameter model consuming less than one GFLOP. The results from our experiments, using the challenging MPII single-person benchmark, show that the proposed EfficientPose models substantially outperform the widely-used OpenPose model in terms of accuracy, while being at the same time up to 15 times smaller and 20 times more computationally efficient than its counterpart.



There are no comments yet.


page 3

page 4

page 5


Dual Path Networks for Multi-Person Human Pose Estimation

The task of multi-person human pose estimation in natural scenes is quit...

Efficient Human Pose Estimation with Depthwise Separable Convolution and Person Centroid Guided Joint Grouping

In this paper, we propose efficient and effective methods for 2D human p...

OmniPose: A Multi-Scale Framework for Multi-Person Pose Estimation

We propose OmniPose, a single-pass, end-to-end trainable framework, that...

EvoPose2D: Pushing the Boundaries of 2D Human Pose Estimation using Neuroevolution

Neural architecture search has proven to be highly effective in the desi...

SSP-Net: Scalable Sequential Pyramid Networks for Real-Time 3D Human Pose Regression

In this paper we propose a highly scalable convolutional neural network,...

EfficientHRNet: Efficient Scaling for Lightweight High-Resolution Multi-Person Pose Estimation

Recent years have brought great advancement in 2D human pose estimation....

Single upper limb pose estimation method based on improved stacked hourglass network

At present, most high-accuracy single-person pose estimation methods hav...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human pose estimation (HPE) refers to the computer vision task of localizing human skeletal keypoints from images or video frames. HPE has many real-world applications, ranging from outdoor activity recognition and computer animation to clinical assessments of motor repertoire and skill practice among professional athletes. The proliferation of deep convolutional neural networks (ConvNets) has advanced HPE and further widen its application areas. Although ConvNet-based HPE with its increasingly complex network structures, combined with transfer learning is a very challenging task; the availability of high-performing ImageNet 

Deng et al. (2009) backbones, together with large tailor-made datasets, such as MPII for 2D pose estimation Andriluka et al. (2014), has facilitated developing new improved methods to address the challenges.

An increasing trend in computer vision drives towards more efficient models Sandler et al. (2018); Tan et al. (2019); Howard et al. (2017); Elsen et al. (2019). Recently, EfficientNet Tan and Le (2019a) was released as a scalable ConvNet architecture setting benchmark record on ImageNet with a more computation efficient architecture. However, within HPE there is lack of accurate and computation efficient architectures. State-of-the-art architectures in HPE are computationally expensive and highly complex, thus making the networks hard to replicate, cumbersome to optimize, and impractical to embed into real-world applications. The OpenPose network Cao et al. (2018) is one of the most applied HPE methods in real-world applications Nakai et al. (2018); Noori et al. (2019); Firdaus and Rakun (2019); Chambers et al. (2019)

, and is the first open-source real-time system for HPE. However, the OpenPose architecture is highly computationally inefficient, spending

billion floating-point operations (GFLOPs) per inference. Moreover, the level of detail in OpenPose keypoint estimates is limited by its low-resolution outputs. This makes OpenPose less suitable for precision-demanding applications, such as elite sports and medical assessments, which all depend on high degree of precision in the assessment of movement kinematics. Despite these challenges, OpenPose remains a widely used HPE network for markerless motion capture from which critical decisions are based upon Chambers et al. (2019); Vitali et al. (2019); Barra et al. (2019).

In this paper, we call attention to the lack of publicly available methods for HPE that both are computationally efficient and provide high-precision estimates. To this end, we harness recent progress in ConvNets to propose a novel approach called EfficientPose, which is a family of scalable ConvNets for single-person pose estimation from 2D images. To show how it stands against existing available methods, we compare the precision and computation efficiency of EfficientPose with OpenPose on single-person HPE. The proposed models aim to elicit improved precision levels, while bridging the gap in availability of high-precision HPE networks.

The main contributions of this paper are the following:

  • We propose a novel method, called EfficientPose, that can overcome the shortcomings of the popular OpenPose network on single-person HPE with improved level of precision, rapid convergence during optimization, friendly model size, and low computational cost.

  • With EfficientPose, we suggest an approach providing scalable models that can suit various demands, enabling a trade-off between accuracy and efficiency across diverse application constraints and computational budgets.

  • We propose a new way to incorporate mobile ConvNet components, which can address the need for computation efficient architectures for HPE, thus facilitating real-time HPE on the edge.

  • We perform an extensive comparative study to evaluate our approach. Our experimental results show that the proposed method achieves significantly higher efficiency and accuracy in comparison to the baseline method, OpenPose. In addition, compared to existing state-of-the-art methods, it achieves competitive results, with a much smaller number parameters.

The remainder of this paper is organized as follows. Section 2 describes the architecture of OpenPose and highlights research which it can be improved from. Based on this, Section 3 presents our proposed ConvNet-based approach, EfficientPose. Section 4 describes our experiments and presents the results from comparing EfficientPose with OpenPose and other existing approaches. Section 5 discusses our findings and suggests potential future studies. Finally, Section 6 summarizes and concludes the paper.

To promote research on beneficial applications within movement science, we will make the EfficientPose models available at https://github.com/daniegr/Effici

2 Related work

The proliferation of ConvNets for HPE following the success of DeepPose Toshev and Szegedy (2014) has set the path for accurate human pose estimation (HPE). With OpenPose, Cao et al. Cao et al. (2018) made HPE available to the public. As depicted by Figure 1, OpenPose comprises a multi-stage architecture performing a series of detection passes. Provided an input image of pixels, OpenPose utilizes an ImageNet pretrained VGG-19 backbone Simonyan and Zisserman (2014) to extract basic features (step 1 in Figure 1). The features are supplied to a DenseNet-inspired detection block (step 2) arranged as five dense blocks Huang et al. (2017), each containing three convolutions with PReLU activations He et al. (2015). The detection blocks are stacked in sequence. First, four passes (step 3a-d in Figure 1) of part affinity fields Cao et al. (2017) map associations between body keypoints. Subsequently, two detection passes (step 3e and 3f) predict keypoint heatmaps Tompson et al. (2014) to obtain refined keypoint coordinate estimates. In terms of level of detail in keypoint coordinates, OpenPose is restricted by its output resolution of pixels.

Figure 1: OpenPose architecture utilizing 1) VGG-19 feature extractor, and 2) 4+2 passes of detection blocks performing 4+2 passes of estimating part affinity fields (3a-d) and confidence maps (3e and 3f)

The OpenPose architecture can be improved by recent advancements in ConvNets. First, automated network architecture search has found backbones Tan and Le (2019a); Zoph et al. (2018); Tan and Le (2019b) more precise and efficient in image classification than VGG and ResNets Simonyan and Zisserman (2014); He et al. (2016). Tan et al. Tan and Le (2019a) proposed compound model scaling to balance image resolution, width (number of network channels), and depth (number of network layers). This resulted in scalable EfficientNets Tan and Le (2019a), flexible in model size and precision level. For each model variant , from EfficientNet-B0 to EfficientNet-B7 (), the total number of FLOPs increases by a factor of (Equation 1). Coefficients for depth (), width (), and resolution () are given in Equation 2.


Second, parallel multi-scale feature extraction has raised the precision levels in HPE 

Newell et al. (2016); Ke et al. (2018); Sun et al. (2019); Yang et al. (2017), emphasizing both high spatial resolution and low-scale semantics. However, existing multi-scale approaches in HPE are computationally expensive, reflected by GLOPS, and of large size, typically million parameters Sun et al. (2019); Su et al. (2019); Zhang et al. (2019c); Tang et al. (2018); Yang et al. (2017); Newell et al. (2016); Rafi et al. (2016). To cope with this, we propose cross-resolution features to integrate features from multiple abstraction levels with low overhead in network size and computation efficiency. Third, mobile inverted bottleneck convolution (MBConv) Sandler et al. (2018) with built-in squeeze-and-excitation (SE) Hu et al. (2018) and Swish activation Ramachandran et al. (2017) as integrated in EfficientNets proves more accurate in image classification Tan and Le (2019a, b) compared to regular convolutions He et al. (2016); Huang et al. (2017); Szegedy et al. (2017), while reducing FLOPs by up to  Tan and Le (2019a). The efficiency of MBConv modules stem from the depthwise convolutions operating in a channel-wise manner Sifre and Mallat (2014). This reduces the computational cost by a factor proportional to the number of channels Tan and Le (2019b). The extensive use of regular convolutions with up to input channels in detection blocks reflects the potentials of MBConvs in OpenPose. Further, SE selectively emphasizes discriminative features Hu et al. (2018), which may reduce the required number of convolutions and detection passes. Utilizing MBConv with SE may have the potential to decrease the number of dense blocks in OpenPose. Fourth, transposed convolutions with bilinear kernel Long et al. (2015) upscale low-resolution feature maps enabling a higher level of detail in output confidence maps.

By building upon the work of Tan et al. Tan and Le (2019a), we present a pool of scalable models for single-person HPE, overcoming the shortcomings of the commonly adopted OpenPose architecture. This enables trading off between accuracy and efficiency across computational budgets in real-world applications. In contrast to OpenPose, this opens for ConvNets that are small and computationally efficient enough to run on edge devices with little memory and low processing power. We address the gap in openly available scalable and computation efficient ConvNets for single-person HPE.

3 The EfficientPose approach

3.1 Architecture

The EfficientPose network (Figure 2) introduces several modifications of the OpenPose architecture shown in Figure 1, including 1) high- and low-resolution input images, 2) scalable EfficientNet backbones, 3) cross-resolution features, 4 and 5) scalable Mobile DenseNet detection blocks in fewer detection passes, and 6) bilinear upscaling. For a more thorough component analysis of EfficientPose, see Appendix A.

Figure 2: Proposed architecture comprising 1a) high-resolution and 1b) low-resolution inputs, 2a) high-level and 2b) low-level EfficientNet backbones combined into 3) cross-resolution features, 4) Mobile DenseNet detection blocks, 1+2 passes for estimation of part affinity fields (5a) and confidence maps (5b and 5c), and 6) bilinear upscaling

The input of the network is comprised of high- and low-resolution images (1a and 1b in Figure 2). The low-resolution image is downsampled to half the pixel height and width of the high-resolution image by an initial average pooling layer.

The feature extractor of EfficientPose is composed of the initial blocks of EfficientNets Tan and Le (2019a) pretrained on ImageNet (step 2a and 2b in Figure 2). High-level semantic information is obtained from the high-resolution image using the initial three blocks of a high-scale EfficientNet with (Equation 1), outputting C feature maps (2a). Low-level local information is extracted from the low-resolution image by the first two blocks of a lower-scale EfficientNet-backbone (2b) in the range . Table 1 displays the composition of EfficientNet backbones, from low-scale B0 to high-scale B7. The first block of EfficientNets utilizes the MBConvs described in Figure 3a and 3b, while the second and third blocks comprise the MBConv layers in Figure 3c and 3d.

Block B0 B1 B2 B3 B4 B5 B7
Table 1: The architecture of the initial three blocks of relevant EfficientNet backbones. For , denotes filter size, is number of output feature maps, and

is stride.

denotes batch normalization.

defines input size, corresponding with image resolution on ImageNet, while refers to the depth factor as determined by Equation 1
Figure 3: The composition of MBConvs. From left: a-d) in EfficientNets performs depthwise convolution with filter size and stride , and outputs feature maps. (b and d) extends regular MBConvs by including dropout layer and skip connection. e) in Mobile DenseNets adjusts with E-swish activation and number of feature maps in expansion phase as . All MBConvs take as input feature maps with spatial height and width of and , respectively. is the reduction ratio of SE

The features generated by the low-level and high-level EfficientNet backbones are concatenated to yield cross-resolution features (step 3 in Figure 2). This enables the EfficientPose architecture to selectively emphasize important local factors from the image of interest and overall structures guiding high-quality pose estimation. In this way, we enable an alternative simultaneous handling of features at multiple abstraction levels.

From the extracted features, the desired keypoints are localized through an iterative detection process exploiting intermediate supervision. Each detection pass comprises a detection block and a single convolution for output prediction. The detection blocks across all detection passes elicit the same basic architecture, comprising Mobile DenseNets (see step 4 in Figure 2

). Data from Mobile DenseNets are forwarded to subsequent layers of the detection block using residual connections. The Mobile DenseNet is inspired by DenseNets 

Huang et al. (2017) supporting reuse of features, avoiding redundant layers, and MBConv with SE, enabling low memory footprint. In our adaptation of the MBConv operation ( in Figure 3e), we consistently utilize the highest performing combination from Tan et al. (2019), meaning kernel size () of and expansion ratio of . We also avoid downsampling (i.e. ) and scale the width of Mobile DenseNets by outputting number of channels relative to the high-level backbone (). We modify the original

operation by incorporating E-swish as activation function with

value of  Alcaide (2018). This has a tendency to accelerate progression during training compared to the regular Swish activation Ramachandran et al. (2017). We also adjust the first convolution to generate number of feature maps relative to output feature maps rather than the input channels . This reduces memory consumption and computational latency as , with . With each Mobile DenseNet consisting of three consecutive operations, the module outputs feature maps.

EfficientPose performs detection in two rounds (step 5a-c in Figure 2). First, the overall pose of the person is anticipated through a single pass of skeleton estimation (5a). This aims to facilitate detection of feasible poses and to avoid confusion in case of several persons being present in an image. Skeleton estimation is performed utilizing part affinity fields as proposed in Cao et al. (2017). Following skeleton estimation, two detection passes are performed to estimate heatmaps for keypoints of interest. The former of these acts as a coarse detector (5b), while the latter (5c) refines localization to yield more accurate outputs.

In OpenPose, the heatmaps of the final detection pass are constrained to a low spatial resolution incapable of achieving the amount of detail inherent in the high-resolution input Cao et al. (2018). To improve this limitation of OpenPose, a series of three transposed convolutions performing bilinear upsampling are injected (step 6 in Figure 1

). Thus, we project the low-resolution output onto a space of higher resolution, allowing increased level of detail. To achieve the proper level of interpolation while operating efficiently, each transposed convolution increases the map size by a factor of

, using a stride of with a kernel.

3.2 Variants

In accordance with the original EfficientNet Tan and Le (2019a), we scale EfficientPose with respect to the three main dimensions, resolution, width, and depth, utilizing the coefficients of Equation 2. Table 2 displays the five variants obtained from scaling the EfficientPose architecture. The resolution, defining the spatial dimensions of the input image (), is scaled utilizing the high- and low-level EfficientNet backbones (Table 1) that best matches the resolution of high- and low-resolution inputs. The network width refers to the number of feature maps generated by each . As described in Section 3.1, width scaling is achieved using the same width as the high-level backbone (i.e. ). To achieve proper scaling also in the depth dimension, variation is achieved in the number of Mobile DenseNets ( in Table 2) in the detection blocks. This ensures similar relative sizes of receptive fields across the models and spatial resolutions. For each model variant we select the number () of Mobile DenseNets that best approximates the original depth factor in the high-level EfficientNet backbone (Table 1). More specifically, the number of Mobile DenseNets are determined by Equation 3, rounding to the closest integer. Besides EfficientPose I to IV, the single-resolution model EfficientPose RT is formed to match the scale of the smallest EfficientNet model, providing HPE in extremely low latency applications.

Stage EfficientPose RT EfficientPose I EfficientPose II EfficientPose III EfficientPose IV
High-resolution input
High-level backbone B0 (Block 1-3) B2 (Block 1-3) B4 (Block 1-3) B5 (Block 1-3) B7 (Block 1-3)
Low-resolution input
Low-level backbone B0 (Block 1-2) B0 (Block 1-2) B1 (Block 1-2) B3 (Block 1-2)
Detection block
Prediction pass 1
Prediction pass 2-3
Table 2: Variants of EfficientPose obtained by scaling resolution, width, and depth. Mobile DenseNets computes feature maps. and denotes the number of 2D part affinity fields and confidence maps, respectively. defines transposed convolutions with kernel size , output maps , and stride

3.3 Summary of proposed framework

The EfficientPose framework comprises a family of five ConvNets (EfficientPose RT-IV) that benefits from compound scaling Tan and Le (2019a) and advances in computationally efficient ConvNets for image recognition to construct a scalable network architecture, performing single-person HPE across different computational constraints. More specifically, EfficientPose utilizes both high- and low-resolution images to provide two separate viewpoints that are processed independently through high- and low-level backbones, respectively. The resulting features are concatenated to yield cross-resolution features enabling selective emphasis on global and local image information. The detection stage utilizes a scalable mobile detection block to perform detection in three passes. The first pass estimates person skeletons through part affinity fields Cao et al. (2017) to yield feasible pose configurations. The second and third passes estimate keypoint locations with progressive improvement in precision. Finally, the low-resolution prediction of the third pass is upscaled through bilinear interpolation to yield further improvement of precision level.

4 Experiments and results

4.1 Experimental setup

We evaluate EfficientPose and compare it with OpenPose on the single-person MPII dataset Andriluka et al. (2014), containing images of humans in a wide range of different activities and situations. All models are optimized on MPII using SGD with cyclical learning rates (Appendix B). The training and validation portion of the dataset comprises K images, and by adopting a standard random split we obtain K and K instances for training and validation, respectively. We augment images during training using random horizontal flipping, scaling () and rotation (/ degrees). The evaluation of model accuracy is performed using the metric. is defined as the fraction of predictions residing within a distance from the ground truth location, with being of the diagonal of the head bounding box and the percentage of misjudgment relative to that is accepted. is the standard performance metric for MPII but we also include the stricter metric for assessing models’ ability to yield highly precise keypoint estimates. In a common fashion, final model predictions are obtained applying multi-scale testing procedure Yang et al. (2017); Sun et al. (2019); Tang et al. (2018). Due to restriction in number of attempts for official evaluation on MPII, test metrics are only supplied for the OpenPose baseline, and the most efficient and most accurate models, EfficientPose RT and EfficientPose IV, respectively. To measure model efficiency, both FLOPs and number of parameters are supplied.

4.2 Results

Table 3 shows the results of our experiments when evaluating OpenPose and EfficientPose on the MPII validation dataset. This reveals that EfficientPose consistently outperforms OpenPose with regards to efficiency, with smaller model size and reduction in FLOPs. All model variants elicit improved high-precision localization with gain in
compared to OpenPose. Furthermore, the high-end models EfficientPose II-IV yield improvement in . As Table 4 depicts, EfficientPose IV achieves state-of-the-art results (mean of 91.2) on the official MPII test dataset for models of less than million parameters.

Model Parameters Parameter reduction FLOPs FLOP reduction
OpenPose Cao et al. (2018)
EfficientPose RT
EfficientPose I
EfficientPose II
EfficientPose III
EfficientPose IV
Table 3: Performance of EfficientPose compared to OpenPose on the MPII validation dataset, as evaluated by efficiency (number of parameters and FLOPs, and relative reduction in parameters and FLOPs compared to OpenPose) and accuracy (mean and mean )
Model Parameters Head Shoulder Elbow Wrist Hip Knee Ankle Mean
Pishchulin et al., ICCV’13 Pishchulin et al. (2013)
Tompson et al., NIPS’14 Tompson et al. (2014)
Lifshitz et al., ECCV’16 Lifshitz et al. (2016)
Newell et al., ECCV’16 Newell et al. (2016)
Zhang et al., CVPR’19  Zhang et al. (2019b)
Yang et al., ICCV’17 Yang et al. (2017)
Tang et al., ECCV’18 Tang et al. (2018)
Sun et al., CVPR’19 Sun et al. (2019)
Zhang et al., arXiv’19 Zhang et al. (2019c)
OpenPose Cao et al. (2018)
EfficientPose RT
EfficientPose IV
Table 4: State-of-the-art results in (both for individual body parts and overall mean value) on the official MPII test dataset Andriluka et al. (2014) compared to number of parameters

Compared to OpenPose, EffcientPose also displays rapid convergence during training. We optimized both approaches on similar input resolution, which defaults to for OpenPose, corresponding to EfficientPose II. The training graph demonstrates that EfficientPose converges early while OpenPose requires up to epochs for proper convergence (Figure 4). As Table 5 points out, OpenPose benefits from prolonged training with improvement in during the final epochs, while EfficientPose II has a minor gain of .

Figure 4: The progression of mean error of EfficientPose II and OpenPose on the MPII validation set during the course of training
Model Epochs
OpenPose Cao et al. (2018)
OpenPose Cao et al. (2018)
OpenPose Cao et al. (2018)
EfficientPose II
EfficientPose II
EfficientPose II
Table 5: Model accuracy on the MPII validation dataset in relation to number of training epochs

5 Discussion

Facilitated by cross-resolution features and upscaling of output (see Appendix A), EfficientPose elevates the precision-level compared to OpenPose Cao et al. (2018) by relative improvement in on single-person MPII (Table 3). This reflects the inherent limitation of the OpenPose architecture when it comes to performing HPE in a single-person domain. EfficientPose more consistently supplies movement information of high spatial resolution. Hence, EfficientPose proves more promising in addressing precision-demanding single-person HPE applications like medical assessments and elite sports. Precision of HPE methods is a key factor for analyses of movement kinematics, like segment positions and joint angles, for assessment of sport performance in athletes, or motor disabilities in patients. On the contrary, for some applications (e.g. exercise games and baby monitors), we are more interested in the latency of the system and its ability to respond quickly. In these scenarios, the degree of correctness in keypoint coordinates may be less crucial. The scalability of EfficientPose enables flexibility in these various situations and across different types of hardware, whereas OpenPose suffers from its large model size and number of FLOPs.

The use of MBConvs in HPE is to our knowledge an unexplored path. Nevertheless, EfficientPose approached state-of-the-art performance on the single-person MPII benchmark despite a large reduction in number of parameters (Table 4). This suggests that the parameter efficient MBConvs provide value in HPE as with other computer vision tasks such as image classification and object detection. Thus, MBConv is a promising component for HPE networks, and it will be interesting to assess its effect when combined with other novel HPE architectures such as Hourglass and HRNet Newell et al. (2016); Sun et al. (2019). The use of EfficientNet as backbone, and the proposed cross-resolution feature extractor combining several EfficientNets for improved handling of basic features, are also interesting avenues to further explore. From the present study, we conjecture that EfficientNets could replace commonly used backbones for HPE, such as VGG and ResNets reducing the computational overheads associated with these approaches Simonyan and Zisserman (2014); He et al. (2016). Additionally, a cross-resolution feature extractor could serve precision-demanding applications by improving the performance on (Table 6). We also observed that EfficientPose benefits from compound model scaling across resolution, width and depth. This benefit is reflected by incremental improvements in and from EfficientPose RT through EfficientPose IV (Table 3). By utilizing this benefit to further examine scalable ConvNets for HPE, we can obtain insight into appropriate sizes of HPE models (i.e. number of parameters), required number of FLOPs, and obtainable precision-levels.

In this study, OpenPose and EfficientPose were optimized on the general-purpose MPII Human Pose dataset. For many applications (e.g. action recognition and video surveillance) the variability in MPII may be sufficient for directly applying the models on real-world problems. Nonetheless, other particular scenarios deviate from this setting. MPII comprises mostly grown-up persons in a variety of every day indoor and outdoor activities Andriluka et al. (2014). However, in less natural environments (e.g. movement science laboratories or hospital settings) and with humans of different anatomical proportions such as children and infants Sciortino et al. (2017), careful consideration must be taken. This could require fine-tuning of the MPII models on more specific datasets related to the problem at hand. In our experiments, EfficientPose was more easily trainable than OpenPose (Figure 4 and Table 5). This trait of rapid convergence suggests that it will be more convenient to utilize transfer learning on EfficientPose models on other HPE data.

EfficientPose may also perform multi-person HPE in a more computationally efficient manner compared to OpenPose. State-of-the-art methods for multi-person HPE are dominated by top-down approaches which require computation proportional to the number of individuals in the image Fieraru et al. (2018); Zhang et al. (2019a). In crowded scenes, top-down approaches are highly resource demanding. As the original OpenPose Cao et al. (2018), we can take advantage of part affinity fields to group keypoints into persons to perform multi-person HPE in a bottom-up manner. This reduces the computational overhead to a single network inference per image, and hence we significantly reduce the computation for multi-person HPE.

The architecture of EfficientPose and the training process can be improved in several ways. First, the optimization procedure (Appendix B) was developed for maximum accuracy on OpenPose, and simply reapplied to EfficientPose. Other optimization procedures may be more appropriate for EfficientPose, like alternative optimizers (e.g. Adam Kingma and Ba (2014)

and RMSProp 

Tieleman and Hinton (2012)), and other learning rate and sigma schedules. Second, only the backbone of EfficientPose was pretrained on ImageNet. This could restrict the level of accuracy on HPE as large-scale pretraining not only supplies robust basic features but also higher-level semantics. Thus, it will be valuable to assess the effect of pretraining on model precision in HPE. We can pretrain the majority of ConvNet layers on ImageNet, and retrain these on HPE data. Third, the proposed compound scaling of EfficientPose assume that the scaling relationship between resolution, width, and depth is defined by Equation 2 for both HPE and image recognition. However, the optimal compound scaling coefficients may supposedly differ for HPE, where precision-level is more dependent on image resolution than for image classification. Thus, further studies could conduct neural architecture search across different combinations of resolution, width, and depth to determine the optimal combination of scaling coefficients for HPE. Regardless the scaling coefficients, the scaling of detection blocks in EfficientPose could be improved. The block depth (i.e. number of Mobile DenseNets) slightly deviates from the original depth coefficient in EfficientNets based on the rigid nature of the Mobile DenseNets. A carefully designed detection block could address this challenge by providing more flexibility with regards to number of layers and the receptive field size. Fourth, the computation efficiency of EfficientPose can be further improved by the use of teacher-student network training (i.e. knowledge distillation) Buciluǎ et al. (2006) to transfer knowledge from a high-scale EfficientPose teacher network to a low-scale EfficientPose student network. This technique has already shown promise in HPE when paired with the stacked hourglass architecture Newell et al. (2016); Zhang et al. (2019b). Sparse networks, network pruning, and weight quantization Bulat et al. (2019); Tung and Mori (2018); Elsen et al. (2019)

may also be taken into account to facilitate increasingly accurate, responsive real-life systems for HPE. For high performance inference and deployment on edge devices further speed-up can be achieved by the use of specialized libraries such as NVIDIA TensorRT and TensorFlow Lite 

Developer (2020); TensorFlow (2020).

In summary, EfficientPose tackles single-person HPE with improved degree of precision compared to the commonly adopted OpenPose network Cao et al. (2018). Despite this, the EfficientPose models elicit large reduction in number of parameters and FLOPs. This is achieved by taking advantage of contemporary research within image recognition on computation efficient ConvNet components, most notably MBConvs and EfficientNets Sandler et al. (2018); Tan and Le (2019a). The EfficientPose models are made openly available, serving research initiatives on movement analysis and stimulating further research within high-precision and computation efficient HPE.

6 Conclusion

In this work, we have stressed the need for a publicly accessible method for single-person HPE that suits the demands in both precision and efficiency across various applications and computational budgets. To this end, we have presented a novel method called EfficientPose, which is a scalable ConvNet architecture leveraging a computational efficient multi-scale feature extractor, novel mobile detection blocks, pose association estimates, and bilinear upscaling. To yield model variants that are able to provide flexibility in the compromise between accuracy and efficiency, we have developed the EfficientPose approach to exploit model scalability in three dimensions: resolution, width, and depth. Our experimental results have demonstrated that the proposed approach provides computational efficient models, allowing real-time inference on edge devices. At the same time, our framework offers flexibility to be scaled up to deliver more precise keypoint estimates than commonly used counterparts, at an order of magnitude less parameters and FLOPs. Taking into account the versatility and high precision level of our proposed framework, EfficientPose provides the foundation for next-generation markerless movement analysis.

In our future work, we plan to develop new techniques to further improve the model effectiveness, especially in terms of precision, by investigating optimal compound model scaling for HPE. Moreover, we will deploy EfficientPose on a range of applications, to validate its applicability, as well as feasibility, in real-world scenarios.


The research is funded by RSO funds from the Faculty of Medicine and Health Sciences at the Norwegian University of Science and Technology. The experiments were carried out utilizing computational resources provided by the Norwegian Open AI Lab.


  • E. Alcaide (2018) E-swish: adjusting activations to different network depths. arXiv preprint arXiv:1801.07145. Cited by: §3.1.
  • M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014) 2D human pose estimation: new benchmark and state of the art analysis. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §4.1, Table 4, §5.
  • P. Barra, C. Bisogni, M. Nappi, D. Freire-Obregón, and M. Castrillón-Santana (2019) Gait analysis for gender classification in forensics. In International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications, pp. 180–190. Cited by: §1.
  • C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §5.
  • A. Bulat, G. Tzimiropoulos, J. Kossaifi, and M. Pantic (2019) Improved training of binary networks for human pose estimation and image recognition. arXiv preprint arXiv:1904.05868. Cited by: §5.
  • Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2018) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008. Cited by: §1, §2, §3.1, Table 3, Table 4, Table 5, §5, §5, §5, Appendix B: Optimization procedure.
  • Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. Cited by: §2, §3.1, §3.3.
  • C. Chambers, N. Seethapathi, R. Saluja, H. Loeb, S. Pierce, D. Bogen, L. Prosser, M. J. Johnson, and K. P. Kording (2019) Computer vision to automatically assess infant neuromotor risk. BioRxiv, pp. 756262. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  • N. Developer (2020) NVIDIA tensorrt. Note: https://developer.nvidia.com/tensorrt Cited by: §5.
  • E. Elsen, M. Dukhan, T. Gale, and K. Simonyan (2019) Fast sparse convnets. arXiv preprint arXiv:1911.09723. Cited by: §1, §5.
  • M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele (2018) Learning to refine human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 205–214. Cited by: §5.
  • N. M. Firdaus and E. Rakun (2019) Recognizing fingerspelling in sibi (sistem isyarat bahasa indonesia) using openpose and elliptical fourier descriptor. In Proceedings of the International Conference on Advanced Information Science and System, pp. 1–6. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §2, §5.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2, §2, §3.1.
  • L. Ke, M. Chang, H. Qi, and S. Lyu (2018) Multi-scale structure-aware network for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 713–728. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • I. Lifshitz, E. Fetaya, and S. Ullman (2016) Human pose estimation using deep consensus voting. In European Conference on Computer Vision, pp. 246–260. Cited by: Table 4.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.
  • M. Nakai, Y. Tsunoda, H. Hayashi, and H. Murakoshi (2018) Prediction of basketball free throw shooting by openpose. In

    JSAI International Symposium on Artificial Intelligence

    pp. 435–446. Cited by: §1.
  • A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §2, Table 4, §5, §5.
  • F. M. Noori, B. Wallace, M. Z. Uddin, and J. Torresen (2019)

    A robust human activity recognition approach using openpose, motion features, and deep recurrent neural network

    In Scandinavian Conference on Image Analysis, pp. 299–310. Cited by: §1.
  • L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele (2013) Poselet conditioned pictorial structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595. Cited by: Table 4.
  • U. Rafi, B. Leibe, J. Gall, and I. Kostrikov (2016) An efficient convolutional network for human pose estimation.. In BMVC, Vol. 1, pp. 2. Cited by: §2.
  • P. Ramachandran, B. Zoph, and Q. V. Le (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §2, §3.1.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §1, §2, §5.
  • G. Sciortino, G. M. Farinella, S. Battiato, M. Leo, and C. Distante (2017) On the estimation of children’s poses. In International Conference on Image Analysis and Processing, pp. 410–421. Cited by: §5.
  • L. Sifre and S. Mallat (2014) Rigid-motion scattering for image classification. Ph. D. thesis. Cited by: §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2, §2, §5.
  • L. N. Smith and N. Topin (2019) Super-convergence: very fast training of neural networks using large learning rates. In

    Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications

    Vol. 11006, pp. 1100612. Cited by: Appendix B: Optimization procedure.
  • L. N. Smith (2017) Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. Cited by: Appendix B: Optimization procedure.
  • Z. Su, M. Ye, G. Zhang, L. Dai, and J. Sheng (2019) Improvement multi-stage model for human pose estimation. arXiv preprint arXiv:1902.07837. Cited by: §2, Appendix B: Optimization procedure.
  • K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703. Cited by: §2, §4.1, Table 4, §5, Appendix B: Optimization procedure.
  • C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, Cited by: §2.
  • M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §1, §3.1.
  • M. Tan and Q. V. Le (2019a) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §2, §2, §2, §3.1, §3.2, §3.3, §5.
  • M. Tan and Q. V. Le (2019b) Mixconv: mixed depthwise convolutional kernels. CoRR, abs/1907.09595. Cited by: §2, §2.
  • W. Tang, P. Yu, and Y. Wu (2018) Deeply learned compositional models for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 190–206. Cited by: §2, §4.1, Table 4.
  • TensorFlow (2020) Deploy machine learning models on mobile and iot devices. Note: https://www.tensorflow.org/lite Cited by: §5.
  • T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §5.
  • J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pp. 1799–1807. Cited by: §2, Table 4.
  • A. Toshev and C. Szegedy (2014) Deeppose: human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653–1660. Cited by: §2.
  • F. Tung and G. Mori (2018) Clip-q: deep network compression learning by in-parallel pruning-quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7873–7882. Cited by: §5.
  • A. Vitali, D. Regazzoni, C. Rizzi, and F. Maffioletti (2019) A new approach for medical assessment of patient’s injured shoulder. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Vol. 59179, pp. V001T02A049. Cited by: §1.
  • W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang (2017) Learning feature pyramids for human pose estimation. In proceedings of the IEEE international conference on computer vision, pp. 1281–1290. Cited by: §2, §4.1, Table 4.
  • F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu (2019a) Distribution-aware coordinate representation for human pose estimation. arXiv preprint arXiv:1910.06278. Cited by: §5.
  • F. Zhang, X. Zhu, and M. Ye (2019b) Fast human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3517–3526. Cited by: Table 4, §5.
  • H. Zhang, H. Ouyang, S. Liu, X. Qi, X. Shen, R. Yang, and J. Jia (2019c) Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760. Cited by: §2, Table 4, Appendix B: Optimization procedure.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2.

Appendix A: Ablation study

To determine the effect of different design choices in the EfficientPose architecture, we carried out component analysis.

Training efficiency

We assessed the number of training epochs to determine the appropriate duration of training, avoiding demanding optimization processes. Figure 4 suggests that the largest improvement in model accuracy occurs until around epochs, after which training saturates. Table 5 supports this observation with less than increase in with epochs of training. From this, it was decided to perform the final optimization of the different variants of EfficientPose over epochs. Table 5 also suggests that most of the learning progress occurs during the first epochs. Hence, for the remainder of the ablation study epochs were used to determine the effect of different design choices.

Cross-resolution features

The value of combining low-level local information with high-level semantic information through a cross-resolution feature extractor was evaluated by optimizing the model with and without the low-level backbone. Experiments were conducted on two different variants of the EfficientPose model. On coarse prediction () there is little to no gain in accuracy (Table 6), while for fine estimation () some improvement () is displayed taking into account the negligible cost of more parameters and increase in FLOPs.

Model Cross-resolution features Parameters FLOPs
EfficientPose I
EfficientPose I
EfficientPose II
EfficientPose II
Table 6: Model accuracy on the MPII validation dataset in relation to the use of cross-resolution features

Skeleton estimation

The effect of skeleton estimation through the approximation of part affinity fields was assessed by comparing the architecture with and without the single pass of skeleton estimation. Skeleton estimation yields improved accuracy with gain in and in (Table 7), while only introducing an overhead in size and complexity of and , respectively.

Model Skeleton estimation Parameters FLOPs
EfficientPose I
EfficientPose I
EfficientPose II
EfficientPose II
Table 7: Model accuracy on the MPII validation dataset in relation to the use of skeleton estimation

Number of detection passes

We also determined the appropriate comprehensiveness of detection, represented by number of detection passes. EfficientPose I and II were both optimized on three different variants (Table 8). Seemingly, the models benefit from intermediate supervision with a general trend of increased performance level in accordance with number of detection passes. The major benefit in performance is obtained by expanding from one to two passes of keypoint estimation, reflected by increase in and in . In comparison, a third detection pass yields only relative improvement in compared to two passes, and no gain in while increasing size and computation by and , respectively. From these findings, we decided a beneficial trade-off in accuracy and efficiency would be the use of two detection passes.

Model Detection passes Parameters FLOPs
EfficientPose I
EfficientPose I
EfficientPose I
EfficientPose II
EfficientPose II
EfficientPose II
Table 8: Model accuracy on the MPII validation dataset in relation to the number of detection passes


To assess the impact of upscaling, implemented as bilinear transposed convolutions, we compared the results of the two respective models. Table 9 reflects that upscaling yields improved precision on keypoint estimates by large gains of in and smaller improvements of on coarse detection (). As a consequence of increased output resolution upscaling slightly increases number of FLOPs () with neglectable increase in model size.

Model Upscaling Parameters FLOPs
EfficientPose I
EfficientPose I
EfficientPose II
EfficientPose II
Table 9: Model accuracy on the MPII validation dataset in relation to the use of upscaling

Appendix B: Optimization procedure

Most state-of-the-art approaches for single-person pose estimation are extensively pretrained on ImageNet Sun et al. (2019); Su et al. (2019); Zhang et al. (2019c), enabling rapid convergence for models when adapted to other tasks, such as HPE. In contrast to these approaches, few models, including OpenPose Cao et al. (2018) and EfficientPose, only utilize the most basic pretrained features. This facilitates construction of more efficient network architectures but at the same time requires careful design of optimization procedures for convergence towards reasonable parameter values.

Training of pose estimation models is complicated due to the intricate nature of output responses. Overall, optimization is performed in a conventional fashion by minimizing the mean squared error (MSE) of the predicted output maps with respect to ground truth values across all output responses .

The predicted output maps should ideally have higher values at the spatial locations corresponding to body part positions, while punishing predictions farther away from the correct location. As a result, the ground truth output maps must be carefully designed to enable proper convergence during training. We achieve this by progressively reducing the circumference from the true location that should be rewarded, defined by the

parameter. Higher probabilities

are assigned for positions closer to the ground truth position (Equation 4).


The proposed optimization scheme (Figure 5) incorporates a stepwise

scheme, and utilizes stochastic gradient descent (SGD) with a decaying triangular cyclical learning rate (CLR) policy 

Smith (2017). The parameter is normalized according to the output resolution. As suggested by Smith et al. Smith and Topin (2019), the large learning rates in CLR provides regularization in network optimization. This makes training more stable and may even increase training efficiency. This is valuable for network architectures, such as OpenPose and EfficientPose, less heavily concerned with pretraining (i.e. having larger portions of randomized weights). In our adoption of CLR, we utilize a cycle length of epochs. The learning rate () converges towards (Equation 5), where is the highest learning rate for which the model does not diverge during the first cycle and , whereas and are the initial and final sigma values, respectively.

Figure 5: Optimization scheme displaying learning rates and values corresponding to training of EfficientPose II over epochs