The task of human pose estimation is to determine the precise pixel locations of body keypoints from a single input image [1, 2, 3, 4, 5, 6, 7]. Closely-related tasks include 3D human pose estimation  and human pose estimation in videos [9, 10]
. Human pose estimation is very important for many high-level computer vision tasks, including action and activity recognition[11, 12, 13], semantic content retrieval , human-computer interaction, motion capture , and animation. Estimating human poses from still images is a challenging task. An effective human pose estimation system must be able to handle large pose variations, changes in clothing and lighting conditions, severe body deformations, heavy body occlusions [16, 17, 18]. A key question for addressing these problems is how to extract strong low and mid-level appearance features capturing discriminative as well as relevant contextual information and how to model complex part relationships allowing for effective yet efficient pose inference. Traditional methods for pose estimation are mostly based on Pictorial Structure (PS) models [19, 20, 21, 22, 23, 24], which models the spatial relations of rigid body parts using a tree model. A major drawback of such models is the need to hand-design the structure of the model in order to capture important problem-specific dependencies amongst the different output variables and at the same time allow for tractable inference.
, especially for those applications with complicated loss functions.
Human pose estimation using deep neural networks requires us to map the input images with large variations into multiple body keypoints which must satisfy a set of geometric constraints and interdependence imposed by the human body model. This is a very challenging nonlinear manifold learning process in a very high dimensional feature space. We believe that the deep neural network, which is inherently an algebraic computation system, is not the most efficient way to capture highly sophisticated human knowledge, for example those highly coupled geometric characteristics and interdependence between keypoints in human poses.
In this work, we propose to explore how external knowledge can be effectively represented and injected into the deep neural networks to guide its training process using learned projections for more accurate and robust human pose estimation. Specifically, as illustrated in Fig. 3, we use inception-resnet module and the stacked hourglass structure to construct a fractal network to regress human pose images into heatmaps with no explicit graphical modeling. We encode external knowledge with visual features which characterize the constraints of human body models and evaluate the fitness of intermediate network output. We then inject these external features into the neural network using a projection matrix learned using an auxiliary cost function. The guidance from the external knowledge is only used during the training process, and is turned off during network inference for human pose estimation. The benefit of external knowledge is to guide the training of the neural network. Its effect is implicitly imposed on the tuning of the parameters, instead of explicit feature representation of the network. The injected features for pairs of limbs impose a strong prior during the training, preventing human part keypoint from connecting to noises, e.g., keypoint from other people in the background that is not cropped out for the target person.
The major contributions of this work are summarized as follows: (1) We develop a new framework to represent and project human knowledge to guide the training of deep neural networks for human pose estimation. This external knowledge project framework is generic and can be extended to other learning and training applications and deep neural network design. (2) We propose an efficient network structure, called fractal networks, for human pose estimation to capture the multi-scale interdependence between body joints in the pose model. This fractal network uses an inception-resnet module as the building block.
The rest of the paper is organized as follows. In section II, we provide a brief review of recent works on human pose estimation. Section III introduces the concept of knowledge guided learning, the structure of fractal network, and the design of inception-resnet module. Section V presents our experimental results. Section VI concludes our paper.
Ii Related Work
Ii-a Structured Prediction and Graphical Models
Prior to the advent of neural networks most previous work was based on pictorial structures  which model the human body as a collection of rigid templates and a set of pairwise potentials taking the form of a tree structure, thus allowing for efficient and exact inference at test time. Higher knowledge of the human body is exploited by modeling humans with body parts that are connected via a skeleton structure. Pictorial structure model [32, 31], models the spatial relations of rigid body parts using a tree model. A pre-defined kinematic body model is often used to assume that each body part is independent of all the others except for the ones it is attached to. A major drawback of such models is the need to hand-design the structure of the model in order to capture important problem-specific dependencies amongst the different output variables and at the same time allow for tractable inference.
Recent work includes sophisticated extensions like mixture, hierarchical, multimodal and strong appearance models [33, 20, 22, 19, 34], non-tree models [24, 23] as well as cascaded/sequential prediction models like pose machines . While in  each limb is represented by a single template that is parameterized by location, orientation, shape parameters, and an appearance model, Yang and Ramanan  propose mixtures of part templates where body part is represented by a set of deformable part templates. Although this approach performs well in comparison to classical pictorial structure models for human pose estimation, it has some limitations. For instance, the used scanning-window templates trained with linear SVMs and HOG features  are very sensitive to noise . Hierarchical models [21, 22] represent the relationships between parts at different scales and sizes in a hierarchical tree structure. The underlying assumption of these models is that larger parts (that correspond to full limbs instead of joints) can often have discriminative image structure that can be easier to detect and consequently help reason about the location of smaller, harder-to-detect parts. On the other hand, there are non-tree models [23, 24] to incorporate interactions that introduce loops to augment the tree structure with additional edges that capture symmetry, occlusion and long-range relationships. These methods usually have to rely on approximate inference during both learning and at test time.
Ii-B Deep Neural Networks for Human Pose Regression
. A key feature of these approaches is that they integrate non-linear hierarchical feature extraction with the classification or regression task in hand being also able to capitalize on very large data sets that are now readily available.
Since the work of DeepPose by Toshev et al. , research on human pose estimation has shifted from traditional approaches to deep neural networks (DNN) due to their superior performance. In the context of human pose estimation, it is natural to formulate the problem as a regression one in which CNN features are regressed in order to provide joint predictions of the body parts [16, 10, 42]. For the case of non-visible parts, learning the complex mapping from occluded part appearances to part locations is hard and the network has to rely on contextual information provided by other visible parts to infer the occluded part locations. DeepPose uses a deep neural network to directly regress the coordinates of body joints. Tompson et al.  argued that it is more efficient to use DNN to regress heatmap images at multiple scales. While body models are not a necessary component for effective part localization, constraints between parts allow us to assemble independent detections into a body configuration. Detection-based methods are relying on powerful CNN-based part detectors which are then combined using a graphical model [43, 17] or refined using regression [44, 45]. Regression-based methods try to learn a mapping from image and CNN features to part locations.  achieved promising results by combining CNN-based body part detectors with a body model .
Human pose estimation methods using deep neural networks have proven their significant advantages over traditional approaches. However, deeper and wider networks are often required to improve the feature representation power, which in turn leads to increased difficulty in training the neural networks. Recently, residual learning  has been used to significantly improve the performance of human pose estimation [46, 18]. It was used for part detection in the system of . stacked hourglass network of  elegantly extends fully convolutional networks  and deconvolution nets  with residual learning.
Intermediate supervision , recursive prediction , and inception design [27, 28] are among other successful techniques that have been applied by recent methods for human pose estimation. Recently, researchers recognize that successive predictions can boost the performance of pose estimation, where parts are sequentially refined [16, 35, 51, 52]. In these models an initial prediction is made of all the parts; in subsequent steps, all part predictions are refined based on the image and earlier part predictions. Tompson et al.  use a cascade of networks for refined predictions to achieve significantly improved precision in joint localization. Carreira et al.  introduce a so-called Iterative Error Feedback scheme, where a set of predictions is included in the input, and each pass through the network further refines these predictions. Their method requires multi-stage training and the weights are shared between iterations. Recently, adding supervision to intermediate layers of deep networks is also explored to assist the training process [27, 53]. Methods in [18, 51, 46, 50] use intermediate supervision to add auxiliary supervision branches in the network to assist the training process for human pose estimation. These approaches all employ the inception design by concatenating heatmaps from different stages or abstract levels as the input for the next layers.
One direction for further improvement of human pose estimation is to design convolutional networks that can produce robust visual features. Multi-scale processing by repetitive down-sampling and up-sampling has been introduced in Stacked Hourglass Networks . Another approach to improve human pose estimation performance is to use explicit part-based models [24, 23, 33] or implicitly encode configuration model using its contexts . These methods involve additional sub-networks to detect parts, which increases the overall complexity. In this work, we leverage these ideas and approaches. We propose a fractal network structure using inception-resnet as building blocks to explore the multi-scale interdependence nature of human pose configuration and to capture these characteristics across different scales and resolutions. The network is fractal in that it reflects the co-occurrance of inception and residual design at both the highest and lowest levels of abstractions.
Ii-C Transfer Learning and Guided Training
Nevertheless, training such deep networks has proven to be challenging . Significant efforts has been devoted to alleviate this problem. For instance, there has been another line of work in which a student network is trained from scratch to mimic the behavior of a much larger teacher network. Staring from Bucila et al.’s work  and Hinton et al.’s more general Knowledge Distillation (KD)  approach, the knowledge transfer in learning process has gained a lot of research interest. In this paper, we consider a unique setting of the problem. Instead of transferring knowledge from teacher networks into a student network, we propose an external knowledge representation and projection framework to guide the training process of our deep neural network for human pose estimation. Specifically, we inject hand-designed features that are inferred from ground truth as external knowledge to aid the training of a highly complex network with deep structure and multiple loss functions. Inspired by , which proposed a locality principle to learn task-specific feature mapping for shape regression, we project the external knowledge with a learned feature mapping. The procedure involves domain adaptation and model training simultaneously. Since external knowledge is inferred from ground truth, it is inherently more reliable and effective than the outputs from a teacher network.
Iii Proposed Method
Iii-a Network Structure and Design
Human pose estimation methods using hand-crafted features or graphical structure models based on human knowledge lack the flexibility in learning and the potential to achieve great representation power. On the other hand, pure data-driven neural networks may not be able to capture sophisticated knowledge involved in human pose estimation. In this work, we propose to represent and inject external human knowledge to guide the learning of deep neural networks (DNN), as illustrated in Fig. 1. Our major idea is that, by enforcing constraints and guidance with external knowledge injection, high-level information of long-range dependencies between image and multi-part cues, that are hard to capture with implicit learning, can be better learned under the guidance of mid-level knowledge projection. As shown in Fig. 3, the projected knowledge affects the gradients propagated back to convolutional layers during training, but they are not part of the network during test.
We borrow the ideas from inception-residual networks  and propose to construct a basic inception-resnet module in replacement of convolutional layers for more robust feature representation. Hourglass network is first introduced in  where features are processed across all scales by repetitve down-sampling and up-sampling and then consolidated to best capture various spatial relationships associated with the body. We introduce a modified version of the hourglass network with the proposed inception-resnet module. As shown in Figure 2
, we use proposed inception-resnet modules and improved hourglass sub-networks to construct a fractal network to regress human pose images into heatmaps with no explicit graphical modeling. The network is fractal in that it has the same network configuration at all levels of analysis and abstractions. This fractal network is designed to capture the multi-scale interdependence nature of human pose configuration and to represent these characteristics across different scales and resolutions. In the inception network, we perform channel-wise concatenation of two tensors from different sources. This enforces the information represented by the features stored in these tensors to be complementary to each other. It encourages and directs these two sources to work on different concepts to produce a more robust union representation. In the Resnet model, we perform pixel-wise addition of two tensors with the same number of channels. From our experiments, we find that this network design allows us to train the network more effectively, since it enforces two separate tensors to be simultaneously accurate in order to render the expected outputs.
Iii-B Fractal Network with Inception-resnet Modules
Our motivation in the fractal network design is that, we need the network to focus on various scales across human parts, and at each scale, the network should also have an overall understanding of this receptive field. At higher levels, the network captures dependencies among various human parts. At lower levels, we use same fractal design to capture regional dependencies. It is essential to capture local dependencies in addition to local appearances. Because at a certain high-level scale, the receptive field may involve a human part as well as noises from other parts. These adjacent parts may be from the same or other persons. Therefore, local dependencies are helpful in providing more reliable features to higher-level networks.
The construction of inception-resnet module is shown in Figure 4. Based on the hourglass design proposed in  shown in Figure 5, an improved version of hourglass network is developed in this work as a mid-level sub-network which also uses inception-resnet modules as the basic units, as illustrated in Figure 6. To combine the advantages of both inception and resnet design, we introduce the inception-resnet module as the basic building block to analyze local fields, while using an improved hourglass network to capture the global information of different parts.
At the bottom level, we propose to use inception-resnet module as the basic structure unit of the network. It consists of convolutional layers, batch norm layers and relu units, with channel-wise concatenation and pixel-wise additions. Convolution layers are padded such that the resolution of output is the same as that of the input. Although the concatenation of two branches maintains different level of information, the concatenated features across different channels need to be transformed and normalized by the subsequent convolutional layers. In the proposed inception-resnet module, the concatenation layer is followed by another convolutional layer withkernels. The benefit of this module is that the input and output have the same resolution while the depth of channels can be flexible.
At the sub-network level, we implement the recursive hourglass for levels as shown in Figure 5. In other words, it will process the image at four scales. The hourglass network is nested in itself. The first level of hourglass in our network is an inception-resnet module. As illustrated in Figure. 6, we also borrow the idea of hourglass design by down-sampling and then up-sampling the data while using inception-resnet as proposed common building block. Pixel-wise addition fuses the information from two branches while keeping the input and output resolution the same.
At the top fractal network level, images of size are down-sampled into the resolution of . Subsequently, inputs and outputs of all modules are of size , including the output heatmaps. The network captures and consolidates information across all scales of the image.
Iii-C External Knowledge Representation
The fractal network is used to boost the data representation power of the deep neural network for human pose estimation. As the network grows deeper and more complicated, it requires careful attention to the training process. Furthermore, we recognize that the deep neural network is inherently an algebraic computing system, which might not be the most efficient way to capture the highly sophisticated human knowledge during pose estimation, for example those highly coupled geometric constraints and interdependence among body joints. To address these two issues, in this work, we propose to encode and inject external knowledge into the fractal network to guide the training process of the network using learned projections, enforcing a prior during the training process.
In this work, we propose to inject the geometric representation of knowledge into the heatmap layer of the network. Since the heatmaps to be predicted are correlated to each other as they largely share parameters on former layers, the constraint on one heatmap influences the parameters of these layers and therefore having an impact on the training of other heatmaps. We observe that intermediate layers in our network are low and mid-level visual features; higher-level semantic features are hard to locate and explicitly interpret. The predicted heatmaps are easier to enforce the external knowledge and constraints upon. During the training process, the external knowledge and its visual representations are projected into the background and keypoint heatmaps using a projection matrix. We find that this type of knowledge-guided learning inherently enforces long-range dependencies and configurations among human joints, while leaving the flexibility of representation to the depth of the network, the quality and quantity of training data. In the following, we explain the proposed method in more detail.
During the training process, the external knowledge representation module illustrated in Figure. 1 has access to the original training sample image and its ground-truth joint locations.
Specifically, during feature mapping which is denoted as , we perform Hough Transform on each line traversing two separate joints denoted as and . In Hough space, each line is represented by a coordinate .
In order to represent the information in a less crisp manner, we convert the coordinates into a normalized vector representation. To incorporate the inherent learning of geometrical features such as angles and distance, we also inject the joint locations alongside each line. Based on the visibility of each joint, the line traversing it is encoded with the number of visible joints.
In addition to encoding geometric features, we encode image descriptors such as Histogram of Gradients (HOG)  around each pair of adjacent joints in order to capture visual features to compensate spatial dependencies. While we preserve the flexibility of deep convolutional features that automatically learn visual semantics, we use hand-crafted features as guidance of the learning by enforcing a strong prior during the training of the neural network. We noticed that human joints may connect to those from an adherent person, even though the ground truth joints are not self-occluded or object-occluded. We believe HOG features are helpful in observing edges and therefore distinguishing real and false limbs. The injected features for pairs of limbs impose a strong prior during the training, preventing human part keypoint from connecting to noises, e.g., keypoint from other people in the background that is not cropped out for the target person, which is helpful in the learning of body part interdependencies. As illustrated in Figure 7, the features are concatenated and normalized for the external knowledge representation. For self-occluded and object-occluded joints, we mask corresponding features with zeros. Specifically, we follow the traditional HOG feature extraction schemes, applying filters and horizontally and vertically to generate gradient maps and . Instead of scanning a window for blocks and cells over the image which is done in traditional ways, we locate limbs based on meta-data from the training set and extract histogram of gradients for such regions. The magnitude and orientation of the gradient are respectively computed by:
We use bins for the pooling, followed by block normalization (L2-norm) to mitigate the effect of unbalanced area of regions:
Where is a very small number.
Iii-D Knowledge Projection into the Deep Neural Network
In favor of decoding the abstract external knowledge in higher-dimensional space, we afford 2 fully-connected (FC) layer and 3 convolutional layers for the for geometric features and edge features, between the projection representation and the injected knowledge to learn linear projection , which will be removed during testing as it is undesirable to keep redundant layers.
We inject external features as knowledge via global feature mapping function and learn a global linear projection by minimizing the loss from the knowledge projection layer:
where the first term is the regression target, the second term is a regularization on , and controls the regularization strength. Regularization is necessary because the dimensionality of the features is very high. Since the objective function is quadratic with respect to , we can always reach its global optimum .
Specifically, we enforce two loss functions, one for injected geometric features and one for limb-wise edge features. (1) The ground truth heatmap is convolved by 1x1 kernels, outputting 8 channels of maps. It is padded such that the resolution does not change. A fully connected layer with an output of 224-dimensional geometric feature is added to the convolutional layer. We add L2 loss (weighted by 0.05) for the geometric features and the inferred features from the ground truth. (2) We branch out the 3rd inception-residual module at the early stage and feed its output to a series of convolutional layers with 1x1 kernels. The numbers of output channels are scaled twice by a factor of 1/2 until it reaches 32 channels, followed by a fully connected layer. We add L2 loss (weighted by 0.05) for injected edge features and the inferred edge features.
We denote the pixel location of the -th anatomical landmark (which we refer to as human joint), , where is the set of all pixel locations in image coordinate system. Our goal is to predict the image locations for all joints. The output heatmaps are of size , denoted by , which are predicted beliefs for assigning a pixel location to each joint , producing belief scores for all pixels in the heatmap of joint :
In our experiments, we regress RGB-channel images into a set of heatmaps, of which are human joints while the other one as the background. The heatmaps are then suppressed into joint locations with our proposed 3D-NMS algorithm specially designed for human pose estimation. During training, we provide ground truth heatmaps for each joint by creating Gaussian peaks at ground truth locations. The cost function we aim to minimize for the fractal network is given by:
The overall loss for training is a weighted combination of heatmap cost and projection matrix fitness provided by knowledge-guided learning, with a control parameter on how much guidance should be imposed. The overall network is then trained to minimize the following joint loss function:
where and are loss from knowledge projection layer and the fractal network loss, is the weight parameter decaying during training, and is the trained parameters in the fractal network.
The output of knowledge projection layer will guide the training of fractal network by generating a strong and explicit gradient applied to backward path to the injection layer in the following form:
Where is the weight matrix of injection layer in fractal network. Note that the network update only occurs during training. During testing, the knowledge representation and projection modules are removed.
Iii-E Cross-Heatmap Non-Maximum Suppression
In this work, we introduce a novel pose non-maximum suppression (NMS) algorithm specially designed for human pose estimation. Our experiments in Section V-D show that employing pose-NMS consistently render better predictions for all models across iterations on both MPII  and LSP  datasets. Instead of finding the maximum value at pixel-level to predict joint location as in [18, 50, 51, 54], we detect blobs with high responses in each heatmap. Basically, we gather blobs from all heatmaps for suppression. We first find the blob with maximum response, then suppress other blobs from the same heatmap, and blobs from other heatmaps very close to this blob in image coordinate system. We repeat this procedure until all blobs are removed. The suppression takes place in image coordinate system and channel-wise , therefore called cross-heatmap NMS.
Iv Summary of Training and Testing Procedures
We summarize our training and testing procedures in Algorithm 1 and 2, respectively. There exist around 250 convolutional layers in the original hourglass network, while the proposed network with inception-resnet modules consist of over 300 convolutional layers. The network for training the proposed network has an additional cost with 1 external feature extraction module, 2 fully connected layers, 3 convolutional layers and 2 additonal loss layers. In our implementation, it takes the hourglass network an average of 47ms to feed forward with a single Pascal TITAN X GPU. In comparison, the feed forward time of the proposed network with inception-resnet modules during testing is 62ms.
V Experimental Results
For comprehensive experimental analysis, we will first introduce the datasets, evaluation criteria and implementation details. Then we will present quantitative evaluations on benchmark datasets. Finally, diagnostic experiments, algorithm performance analysis and dicussions are provided for further analysis.
V-a Datasets and Criteria
We evaluate the proposed method on two widely used benchmarks: MPII Human Pose  and extended Leeds Sports Poses (LSP) . The MPII Human Pose dataset includes about K images with k annotated poses. The images are collected from YouTube videos covering daily human activities with highly articulated human poses. The LSP dataset with extended training data consists of 11K training images and 1K testing images from sports activities.
There are three criteria used in the experiments to evaluate the performance of the proposed human pose estimation approach: Percentage of Corrected Parts (PCP) [33, 62, 63], Percentage of Detected Joints (PDJ) [16, 19, 33], and Percentage of Corrected Keypoints (PCK) .
A widely-used criterion for human pose estimation is PCP which evaluates the localization accuracy of body parts (sticks of skeleton). It requires the estimated part end points must be within half of the part length from the ground truth part end points. As pointed by Yang and Ramanan , some previous work requires only the average of theendpoints of a part to be correct (PCP-average), rather than both endpoints (PCP-strict). Moreover, the early PCP implementation  selects the best matched output without penalizing false positives. In all our experiments, we adopt the strictest measure, i.e., PCP-strict with single output, if not specially specified. For more detailed descriptions on PCP, it is recommented to refer to  and .
Though PCP is the initially preferred criterion for evaluation, it has the drawback of penalizing shorter limbs, such as lower arms. Thus PDJ is introduced [16, 19] to measure the detection rate of body joints, where a joint is considered to be detected if the distance between the detected joint and the true joint is less than a fraction of the torso diameter. The torso diameter is usually defined as the distance between opposing joints on the human torso, such as left shoulder and right hip . The Area Under Curve (AUC) can be used as the overall evaluation of the PDJ curve. In the following experiments, we report AUC as our PDJ performance.
The PCK measure is very similar to the PDJ criterion. The only difference is that the torso diameter is replaced with the maximum side length of the external rectangle of ground truth body joints. For full body images with extreme pose (especially when the torso becomes very small), the PCK may be more suitable to evaluate the accuracy of body part localization.
In our experiments, we follow the official benchmark evaluation protocals 111http://human-pose.mpi-inf.mpg.de/#evaluation. Official benchmark on MPII dataset adopts PCKh (using portion of head length as reference) at , while official benchmark on LSP dataset adopts both PCP and PCK at . LSP benchmark provide comparisons on both Observer-Centric (OC) and Person-Centric (PC) evaluations, of which the most widely adopted evaluation protocal is PCK-PC. In addition, both benchmarks adopt AUC scores.
V-B Implementation Details
|Ours (no guidance)||97.9||93.2||89.1||86.4||94.5||93.8||92.9||92.6|
|Ours (with guidance)||98.2||94.4||91.8||89.3||94.7||95.0||93.5||93.9|
V-B1 Data Augmentation
We crop the images with the target human centered at the images with roughly the same scale, and warp the image patch to the size . Then, we randomly rotate () and flip the images, perform random re-scaling ( to ) and color jittering to make the model more robust to scale and illumination changes.
V-B2 Experimental Settings
We use a modified version of Caffe that produces three kinds of outputs from the data layer: the augmented image, the corresponding transformed ground truth heatmaps, and the injected knowledge for the augmented image. The knowledge projection is switched off during testing. We train our model using the initial learning rate of
. The parameters are optimized by RMSprop algorithm. We divide the learning rate by 2 when the validation set hits plateaus. The minimum learning rate is set to . We use 4 Pascal TITAN GPUs to train the model on the merged dataset of MPII and extended LSP for over epochs, and adopt Tompson’s validation split for the MPII dataset used in  to monitor the training process. The same model is used for the testing of both MPII and LSP test sets. According to , there is a prior towards the background that forces the network to converge to zero. It is therefore important to weight the gradient responses so that there is an equal contribution to the parameter update between the foreground and background heatmap pixels. In our training process, we weight the foreground and background by . The neural network takes the cropped images patches or ROI of the images as inputs. However, there exists such situation where the cropped patches or ROI contains limbs from other persons. In this case, our ground truth simply ignores other limbs. For example, any region that is not from the keypoints of the target person is considered as background heatmap in the ground truth. Since the target person is always centered in the cropped image or ROI, it enforces a prior during training. Therefore, limbs from other persons are usually of lower response, reflected by the predicted heatmaps.
During testing, we follow the standard routine to crop image patches with the given rough position and the scale of the test human for MPII dataset. For the LSP dataset, we use image size as the rough scale, and image center as the rough position of the target human to crop the image patches. Before feeding into the neural network, we further pre-process images with normalization and pixel-wise subtraction by estimated mean value. All the experimental results are produced from the original and flipped image pyramids with 2 scales (1 and 0.75). Note that we swap heatmaps of left and right limbs before merging corresponding heatmaps for each joint. The merged heatmaps are transformed into joint coordinates by the proposed cross-heatmap non-maximum suppression method. The feed forward time of the network during testing is 62ms with a single Pascal TITAN X GPU.
V-C Benchmark Evaluation
We use the Percentage Correct Keypoints (PCK)  metric for comparisons on the LSP dataset, and the PCKh measure , where the error tolerance is normalized with respect to head size, for comparisons on the MPII Human Pose dataset. We train our model by adding the MPII training set to the extended LSP training set with person-centric (PC) annotations, which is a standard routine [50, 45, 46, 51].
V-C1 Results on the MPII Human Pose Dataset
The AUC score of our network for MPII dataset is .
Table II reports the comparison of the PCKh performance of our method and previous state-of-the-art at a normalized distance of . Our total PCKh-0.5 score achieves state of the art performance at 91.2%. We apply all techniques described in Section. V-D during testing. Note that we test at same multiple scales (1 and 0.75) as that used on LSP dataset, which may not be ideal. While cropping the images with the given scale of MPII dataset, for some images the feet are cropped out, therefore suffering a comparatively lower detection rate for ankles.
|Newell et al., ECCV’16||98.2||96.3||91.2||87.1||90.1||87.4||83.6||90.9|
|Bulat&Tzimiropoulos, ECCV’16 ||97.9||95.1||89.9||85.3||89.4||85.7||81.7||89.7|
|Wei et al., CVPR’16 ||97.8||95.0||88.7||84.0||88.4||82.8||79.4||88.5|
|Insafutdinov et al., ECCV’16 ||96.8||95.2||89.3||84.4||88.4||83.4||78.0||88.5|
|Rafi et al., BMVC’16 ||97.2||93.9||86.4||81.3||86.8||80.6||73.4||86.3|
|Gkioxary et al., ECCV’16 ||96.2||93.1||86.7||82.1||85.2||81.4||74.1||86.1|
|Lifshitz et al., ECCV’16 ||97.8||93.3||85.7||80.4||85.3||76.6||70.2||85.0|
|Pishchulin et al., CVPR’16 ||94.1||90.2||83.4||77.3||82.6||75.7||68.6||82.4|
|Hu&Ramanan, CVPR’16 ||95.0||91.6||83.0||76.6||81.9||74.5||69.5||82.4|
|Tompson et al., CVPR’15 ||96.1||91.9||83.9||77.8||80.9||72.3||64.8||82.0|
|Carreira et al., CVPR’16 ||95.7||91.7||81.7||72.4||82.8||73.2||66.4||81.3|
|Tompson et al., NIPS’14 ||95.8||90.3||80.5||74.3||77.6||69.7||62.8||79.6|
|Pishchulin et al., ICCV’13 ||74.3||49.0||40.8||34.1||36.5||34.4||35.2||44.1|
V-C2 Results on the Leeds Sports Pose Dataset
The AUC score of our network for LSP dataset is .
Table III reports the PCK at threshold of , and Fig. 9 exhibits PCK over various thresholds. Our approach achieves state-of-the-art performance with PCK value of 93.9%, and outperforms all existing methods on each body part prediction.
Bulat&Tzimiropoulos. ECCV’16 ,
|Wei et al. CVPR’16 ,||97.8||92.5||87.0||83.9||91.5||90.8||89.9||90.5|
|Insafutdinov et al. ECCV’16 ,||97.4||92.7||87.5||84.4||91.5||89.9||87.2||90.1|
|Pishchulin et al. CVPR’16 ,||97.0||91.0||83.8||78.1||91.0||86.7||82.0||87.1|
|Lifshitz et al. ECCV’16 ,||96.8||89.0||82.7||79.1||90.9||86.0||82.5||86.7|
|Belagiannis&Zisserman FG’17 ,||95.2||89.0||81.5||77.0||83.7||87.0||82.8||85.2|
|Yu et al. ECCV’16 ,||87.2||88.2||82.4||76.3||91.4||85.8||78.7||84.3|
|Rafi et al. BMVC’16 ,||95.8||86.2||79.3||75.0||86.6||83.8||79.8||83.8|
|Yang et al. CVPR’16 ,||90.6||78.1||73.8||68.8||74.8||69.9||58.9||73.6|
|Chen&Yuille NIPS’14 ,||91.8||78.2||71.8||65.5||73.3||70.2||63.4||73.4|
|Fan et al. CVPR’15 ,||92.4||75.2||65.3||64.0||76.7||68.3||70.4||73.0|
|Tompson et al. NIPS’14 ,||90.6||79.2||67.9||63.4||69.5||71.0||64.2||72.3|
|Pishchulin et al. ICCV’13 ,||87.2||56.7||46.7||38.0||61.0||57.5||52.7||57.1|
|Wang&Li et al. CVPR’13 ,||84.7||57.1||43.7||36.7||56.7||52.4||50.8||54.6|
Table IV reports the PCP at threshold of .
Bulat&Tzimiropoulos. ECCV’16 ,
|Wei et al. CVPR’16 ,||98.0||82.2||89.1||85.8||77.9||95.0||88.3|
|Insafutdinov et al. ECCV’16 ,||97.0||90.6||86.9||86.1||79.5||95.4||87.8|
|Yu et al. ECCV’16 ,||98.0||93.1||88.1||82.9||72.6||83.0||85.4|
|Pishchulin et al. CVPR’16 ,||97.0||88.8||82.0||82.4||71.8||95.8||84.3|
|Lifshitz et al. ECCV’16 ,||97.3||88.8||84.4||80.6||71.4||94.8||84.3|
|Belagiannis&Zisserman FG’17 ,||96.0||86.7||82.2||79.4||69.4||89.4||82.1|
|Rafi et al. BMVC’16 ,||97.6||87.3||80.2||76.8||66.2||93.3||81.2|
|Yang et al. CVPR’16 ,||95.6||78.5||71.8||72.2||61.8||83.9||74.8|
|Chen&Yuille NIPS’14 ,||96.0||77.2||72.2||69.7||58.1||85.6||73.6|
|Fan et al. CVPR’15 ,||95.4||77.7||69.8||62.8||49.1||86.6||70.1|
|Tompson et al. NIPS’14 ,||90.3||70.4||61.1||63.0||51.2||83.7||66.6|
|Pishchulin et al. ICCV’13 ,||88.7||63.6||58.4||46.0||35.2||85.1||58.0|
|Wang&Li et al. CVPR’13,||87.5||56.0||55.8||43.1||32.1||79.1||54.1|
V-D Algorithm Performance Analysis and Ablation Study
Since the ground truth of MPII dataset is not publicly available and it is forbidden to frequently submit MPII test results to the official, we perform component analysis of our proposed method on the LSP dataset. We analyze the contribution of each component in Table I.
We compare the proposed inception-resnet module and the basic resnet module employed by stacked hourglass networks . Since their performance is not reported on LSP dataset, we implement their network within our system to render fair comparisons. Under identical settings, our network with inception-resnet module achieves superior performance over that with basic resnet module by improving the accuracy by %. We also compare our network under standard training with the same network under knowledge projection and guided learning. Results show that better performance is achieved with knowledge guided training with an accuracy improvement of %. We then analyze contributions of other techniques employed mainly during testing, i.e., flipping the image, testing the image at multiple scales, and using proposed NMS algorithm for pose estimation. Testing on original and flipped images improves performance by 0.7%, while testing on both original and scales further improves performance by another %. Cross-heatmap non-maximum suppression improves the PCK value by %.
It should be noted that our implementation222Code and models available at: http://github.com/Guanghan/GNet-pose in PyCaffe  may not fully reproduce equivalent performance on MPII dataset of the hourglass network , which is implemented in Torch . However, we discuss with performance analysis to show that our proposed knowledge guided training is able to improve the performance on top of existing deep neural network. We expect that the same performance gain can be achieved on other network structures.
In this work, we have proposed to encode and inject external human knowledge into deep neural networks to guide its training process with learned projections for more effective human pose estimation. We adopt the stacked hourglass design and propose to use inception-resnet as the building block of our fractal network to regress human pose into heatmaps with no explicit graphical modeling. Utilizing a multi-resolution feature representation with guided learning, the network learns an empirical set of low and high-level features which are typically more tolerant to variations in the training set. Knowledge-guided learning is a generic scheme that can be potentially used to aid other deep neural network training tasks. The effectiveness of the proposed inception-resnet module and the benefit in guided learning with knowledge projection is evaluated on two widely used benchmarks.
-  L. Fu, J. Zhang, and K. Huang, “Orgm: Occlusion relational graphical model for human pose estimation,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 927–941, 2017.
-  M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Body parts dependent joint regressors for human pose estimation in still images,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 11, pp. 2131–2143, 2014.
-  X. Zhang, C. Li, W. Hu, X. Tong, S. Maybank, and Y. Zhang, “Human pose estimation and tracking via parsing a tree structure based human model,” IEEE Transactions On Systems, Man, And Cybernetics: Systems, vol. 44, no. 5, pp. 580–592, 2014.
-  H. Jiang, “Human pose estimation using consistent max covering,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 9, pp. 1911–1918, 2011.
-  L. Zhao, X. Gao, D. Tao, and X. Li, “Learning a tracking and estimation integrated graphical model for human pose tracking,” IEEE transactions on neural networks and learning systems, vol. 26, no. 12, pp. 3176–3186, 2015.
-  Q. Li, F. He, T. Wang, L. Zhou, and S. Xi, “Human pose estimation by exploiting spatial and temporal constraints in body-part configurations,” IEEE Access, vol. 5, pp. 443–454, 2017.
-  M. Eichner and V. Ferrari, “Human pose co-estimation and applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2282–2288, 2012.
-  V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic, “3d pictorial structures revisited: Multiple human pose estimation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 1929–1942, 2016.
-  F. Zhou and F. De la Torre, “Spatio-temporal matching for human pose estimation in video,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 8, pp. 1492–1504, 2016.
-  T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1913–1921.
N. Ikizler-Cinbis and S. Sclaroff, “Web-based classifiers for human action recognition,”IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 1031–1045, 2012.
-  A. Marcos-Ramiro, D. Pizarro, M. Marron-Romera, and D. Gatica-Perez, “Let your body speak: Communicative cue extraction on natural interaction using rgbd data,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1721–1732, 2015.
-  X. Cai, W. Zhou, L. Wu, J. Luo, and H. Li, “Effective active skeleton representation for low latency human action recognition,” IEEE Transactions on Multimedia, vol. 18, no. 2, pp. 141–154, 2016.
-  R. Ren and J. Collomosse, “Visual sentences for pose retrieval over low-resolution cross-media dance collections,” IEEE Transactions on Multimedia, vol. 14, no. 6, pp. 1652–1661, 2012.
-  H. Kadu and C.-C. J. Kuo, “Automatic human mocap data classification,” IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2191–2202, 2014.
-  A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in CVPR, 2014.
-  J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network and a graphical model for human pose estimation,” in NIPS, 2014.
-  A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in ECCV, 2016.
-  B. Sapp and B. Taskar, “Modec: Multimodal decomposable models for human pose estimation,” in CVPR, 2013.
-  L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet conditioned pictorial structures,” in CVPR, 2013.
-  M. Sun and S. Savarese, “Articulated part-based model for joint object detection and pose estimation,” in ICCV, 2011.
-  Y. Tian, C. L. Zitnick, and S. G. Narasimhan, “Exploring the spatial hierarchy of mixture models for human pose estimation,” in ECCV, 2012.
-  M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human pose estimation using body parts dependent joint regressors,” in CVPR, 2013.
-  L. Karlinsky and S. Ullman, “Using linking features in learning non-parametric part models,” in ECCV, 2012.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in CVPR, 2016.
-  J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014, pp. 2654–2662.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” ICLR, 2015.
-  P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” International journal of computer vision, vol. 61, no. 1, pp. 55–79, 2005.
-  M. A. Fischler and R. A. Elschlager, “The representation and matching of pictorial structures,” IEEE Transactions on computers, vol. 100, no. 1, pp. 67–92, 1973.
-  Y. Yang and D. Ramanan, “Articulated human detection with flexible mixtures of parts,” TPAMI, vol. 35, no. 12, pp. 2878–2890, 2013.
-  L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Strong appearance and expressive spatial models for human pose estimation,” in ICCV, 2013.
-  V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh, “Pose machines: Articulated pose estimation via inference machines,” in ECCV, 2014.
N. Dalal and B. Triggs, “Histograms of oriented gradients for human
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
-  X. Zhu, C. Vondrick, D. Ramanan, and C. C. Fowlkes, “Do we need more training data or better models for object detection?.” in BMVC, vol. 3. Citeseer, 2012, p. 5.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 640–651, 2017.
-  V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust optimization for deep regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2830–2838.
-  X. Chen and A. L. Yuille, “Articulated pose estimation by a graphical model with image dependent pairwise relations,” in NIPS, 2014.
-  J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object localization using convolutional networks,” in CVPR, 2015.
-  L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for multi person pose estimation,” in CVPR, 2016.
-  E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, “Deepercut: A deeper, stronger, and faster multi-person pose estimation model,” in ECCV, 2016.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
-  M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in ICCV, 2011.
-  L. Wang, C.-Y. Lee, Z. Tu, and S. Lazebnik, “Training deeper convolutional networks with deep supervision,” arXiv preprint arXiv:1505.02496, 2015.
-  V. Belagiannis and A. Zisserman, “Recurrent human pose estimation,” arXiv preprint arXiv:1605.02914, 2016.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in CVPR, 2016.
-  J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose estimation with iterative error feedback,” in CVPR, 2016.
-  C.-Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets.” in AISTATS, vol. 2, no. 3, 2015, p. 5.
-  A. Bulat and G. Tzimiropoulos, “Human pose estimation via convolutional part heatmap regression,” in ECCV, 2016.
-  D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “The difficulty of training deep architectures and the effect of unsupervised pre-training.” in AISTATS, vol. 5, 2009, pp. 153–160.
-  C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 535–541.
-  G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
-  S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps via regressing local binary features,” in CVPR, 2014.
-  C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261, 2016.
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014.
-  S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose estimation.” in BMVC, 2010.
-  V. Ferrari, M. Marin-Jimenez, and A. Zisserman, “Progressive search space reduction for human pose estimation,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
-  M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari, “2d articulated human pose estimation and retrieval in (almost) unconstrained still images,” International journal of computer vision, vol. 99, no. 2, pp. 190–214, 2012.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014.
T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a
running average of its recent magnitude,”
COURSERA: Neural networks for machine learning, vol. 4, no. 2, 2012.
-  P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural networks for scene labeling.” in ICML, 2014.
-  U. Rafi, I. Kostrikov, J. Gall, and B. Leibe, “An efficient convolutional network for human pose estimation,” in BMVC, 2016.
-  G. Gkioxari, A. Toshev, and N. Jaitly, “Chained predictions using convolutional neural networks,” in ECCV, 2016.
-  I. Lifshitz, E. Fetaya, and S. Ullman, “Human pose estimation using deep consensus voting,” in ECCV, 2016.
-  P. Hu and D. Ramanan, “Bottom-up and top-down reasoning with hierarchical rectified gaussians,” in CVPR, 2016.
-  X. Yu, F. Zhou, and M. Chandraker, “Deep deformation network for object landmark localization,” in ECCV, 2016.
-  W. Yang, W. Ouyang, H. Li, and X. Wang, “End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation,” in CVPR, 2016.
-  X. Fan, K. Zheng, Y. Lin, and S. Wang, “Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation,” in CVPR, 2015.
-  F. Wang and Y. Li, “Beyond physical connections: Tree models in human pose estimation,” in CVPR, 2013.
-  R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like environment for machine learning,” in BigLearn, NIPS Workshop, 2011.