1 Introduction
Despite a long history of prior work, human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Complex joint interdependencies, partial or full joint occlusions, variations in body shape, clothing or lighting, and unrestricted viewing angles result in a very high dimensional input space, making naive search methods intractable.
Recent approaches to this problem fall into two broad categories: 1) more traditional deformable part models [sapp13cvpr]
and 2) deeplearning based discriminative models
[jainiclr2014, deeppose]. Bottomup partbased models are a common choice for this problem since the human body naturally segments into articulated parts. Traditionally these approaches have relied on the aggregation of handcrafted lowlevel features such as SIFT [lowe1999object] or HoG [Dalal2005], which are then input to a standard classifier or a higher level generative model. Care is taken to ensure that these engineered features are sensitive to the part that they are trying to detect and are invariant to numerous deformations in the input space (such as variations in lighting). On the other hand, discriminative deeplearning approaches learn an empirical set of low and highlevel features which are typically more tolerant to variations in the training set and have recently outperformed partbased models
[sapp13cvpr]. However, incorporating priors about the structure of the human body (such as our prior knowledge about joint interconnectivity) into such networks is difficult since the lowlevel mechanics of these networks is often hard to interpret.In this work we attempt to combine a Convolutional Network (ConvNet) PartDetector – which alone outperforms all other existing methods – with a partbased SpatialModel into a unified learning framework. Our translationinvariant ConvNet architecture utilizes a multiresolution feature representation with overlapping receptive fields. Additionally, our SpatialModel is able to approximate MRF loopy belief propagation, which is subsequently backpropagated through, and learned using the same learning framework as the PartDetector. We show that the combination and joint training of these two models improves performance, and allows us to significantly outperform existing stateoftheart models on the task of human body pose recognition.
2 Related Work
For unconstrained image domains, many architectures have been proposed, including “shapecontext” edgebased histograms from the human body [mori2002estimating] or just silhouette features [Grauman2003]. Many techniques have been proposed that extract, learn, or reason over entire body features. Some use a combination of local detectors and structural reasoning [ramanan2005strike] for coarse tracking and [buehler2009learning] for persondependent tracking). In a similar spirit, more general techniques using “Pictorial Structures” such as the work by Felzenszwalb et al. [felzenszwalb2008discriminatively] made this approach tractable with so called ‘Deformable Part Models (DPM)’. Subsequently a large number of related models were developed [andriluka2009pictorial, Eichner:2009:BAM, yang11cvpr, dantone13cvpr]. Algorithms which model more complex joint relationships, such as Yang and Ramanan [yang11cvpr], use a flexible mixture of templates modeled by linear SVMs. Johnson and Everingham [johnson11cvpr] employ a cascade of body part detectors to obtain more discriminative templates. Most recent approaches aim to model higherorder part relationships. Pishchulin [pishchulin13cvpr, pishchulin13iccv] proposes a model that augments the DPM model with Poselet [PoseletsICCV09] priors. Sapp and Taskar [sapp13cvpr] propose a multimodal model which includes both holistic and local cues for mode selection and pose estimation. Following the Poselets approach, the Armlets approach by Gkioxari et al. [Gkioxari:2013:APE] employs a semiglobal classifier for part configuration, and shows good performance on realworld data, however, it is tested only on arms. Furthermore, all these approaches suffer from the fact that they use hand crafted features such as HoG features, edges, contours, and color histograms.
The best performing algorithms today for many vision tasks, and human pose estimation in particular ([deeppose, jainiclr2014, tompsonTOG14]) are based on deep convolutional networks. Toshev et al. [deeppose] show stateofart performance on the ‘FLIC’ [sapp13cvpr] and ‘LSP’ [Johnson10]
datasets. However, their method suffers from inaccuracy in the highprecision region, which we attribute to inefficient direct regression of pose vectors from images, which is a highly nonlinear and difficult to learn mapping.
Joint training of neuralnetworks and graphical models has been previously reported by Ning et al.
[ning05] for image segmentation, and by various groups in speech and language modeling [bourlard1995remap, Morin05]. To our knowledge no such model has been successfully used for the problem of detecting and localizing body part positions of humans in images. Recently, Ross et al. [ross_learning_message_passing] use a messagepassing inspired procedure for structured prediction on computer vision tasks, such as 3D point cloud classification and 3D surface estimation from single images. In contrast to this work, we formulate our messageparsing inspired network in a way that is more amenable to backpropagation and so can be implemented in existing neural networks. Heitz et al. [Heitz_cascadedclassification] train a cascade of offtheshelf classifiers for simultaneously performing object detection, region labeling, and geometric reasoning. However, because of the forward nature of the cascade, a later classifier is unable to encourage earlier ones to focus its effort on fixing certain error modes, or allow the earlier classifiers to ignore mistakes that can be undone by classifiers further in the cascade. Bergtholdt et al. [bergtholdt2010study] propose an approach for object class detection using a partsbased model where they are able to create a fully connected graph on parts and perform MAPinference using search, but rely on SIFT and color features to create the unary and pairwise potentials.3 Model
3.1 Convolutional Network PartDetector
The first stage of our detection pipeline is a deep ConvNet architecture for body part localization. The input is an RGB image containing one or more people and the output is a heatmap, which produces a perpixel likelihood for key joint locations on the human skeleton.
A slidingwindow ConvNet architecture is shown in Fig 1. The network is slid over the input image to produce a dense heatmap output for each bodyjoint. Our model incorporates a multiresolution input with overlapping receptive fields. The upper convolution bank in Fig 1 sees a standard 64x64 resolution input window, while the lower bank sees a larger 128x128 input context downsampled to 64x64. The input images are then Local Contrast Normalized (LCN [torch7]) (after downsampling with antialiasing in the lower resolution bank) to produce an approximate Laplacian pyramid. The advantage of using overlapping contexts is that it allows the network to see a larger portion of the input image with only a moderate increase in the number of weights. The role of the Laplacian Pyramid is to provide each bank with nonoverlapping spectral content which minimizes network redundancy.
An advantage of the SlidingWindow model (Fig 1) is that the detector is translation invariant. However a major drawback is that evaluation is expensive due to redundant convolutions. Recent work [fastcnn, overfeatSermanet] has addressed this problem by performing the convolution stages on the full input image to efficiently create dense feature maps. These dense feature maps are then processed through convolution stages to replicate the fullyconnected network at each pixel. An equivalent but efficient version of the sliding window model for a single resolution bank is shown in Fig 2. Note that due to pooling in the convolution stages, the output heatmap will be a lower resolution than the input image.
For our PartDetector, we combine an efficient sliding windowbased architecture with multiresolution and overlapping receptive fields; the subsequent model is shown in Fig 3
. Since the large context (low resolution) convolution bank requires a stride of
pixels in the lower resolution image to produce the same dense output as the sliding window model, the bank must process four downsampled images, each with a pixel offset, using shared weight convolutions. These four outputs, along with the high resolution convolutional features, are processed through a 9x9 convolution stage (with 512 output features) using the same weights as the first fully connected stage (Fig 1) and then the outputs of the low resolution bank are added and interleaved with the output of high resolution bank.To improve training time we simplify the above architecture by replacing the lowerresolution stage with a single convolution bank as shown in Fig 4 and then upscale the resulting feature map. In our practical implementation we use 3 resolution banks. Note that the simplified architecture is no longer equivalent to the original slidingwindow network of Fig 1 since the lower resolution convolution features are effectively decimated and replicated leading into the fullyconnected stage, however we have found empirically that the performance loss is minimal.
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. We use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heatmap. The target is a 2D Gaussian with a small variance and mean centered at the groundtruth joint locations. At training time we also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.
3.2 HigherLevel SpatialModel
The PartDetector (Section 3.1
) performance on our validation set predicts heatmaps that contain many false positives and poses that are anatomically incorrect; for instance when a peak for face detection is unusually far from a peak in the corresponding shoulder detection. Therefore, in spite of the improved PartDetector context, the feed forward network still has difficulty learning an implicit model of the constraints of the body parts for the full range of body poses. We use a higherlevel
SpatialModelto constrain joint interconnectivity and enforce global pose consistency. The expectation of this stage is to not increase the performance of detections that are already close to the groundtruth pose, but to remove false positive outliers that are anatomically incorrect.
Similar to Jain et al. [jainiclr2014], we formulate the SpatialModel as an MRFlike model over the distribution of spatial locations for each body part. However, the biggest drawback of their model is that the body part priors and the graph structure are explicitly hand crafted. On the other hand, we learn the prior model and implicitly the structure of the spatial model. Unlike [jainiclr2014], we start by connecting every body part to itself and to every other body part in a pairwise fashion in the spatial model to create a fully connected graph. The PartDetector (Section 3.1) provides the unary potentials for each body part location. The pairwise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, given that body part is located at the center pixel, the convolution prior is the likelihood of the body part occurring in pixel location . For a body part , we calculate the final marginal likelihood as:
(1) 
where is the joint location, is the conditional prior described above,
is a bias term used to describe the background probability for the message from joint
to , and is the partition function. Evaluation of Eq 1 is analogous to a single round of sumproduct belief propagation. Convergence to a global optimum is not guaranteed given that our spatial model is not tree structured. However, as it can been seen in our results (Fig 7(b)), the inferred solution is sufficiently accurate for all poses in our datasets. The learned pairwise distributions are purely uniform when any pairwise edge should to be removed from the graph structure. Fig 5 shows a practical example of how the SpatialModel is able to remove an anatomically incorrect strong outlier from the face heatmap by incorporating the presence of a strong shoulder detection. For simplicity, only the shoulder and face joints are shown, however, this example can be extended to incorporate all body part pairs. If the shoulder heatmap shown in Fig 5 had an incorrect falsenegative (i.e. no detection at the correct shoulder location), the addition of the background bias would prevent the output heatmap from having no maxima in the detected face region.Fig 5 contains the conditional distributions for face and shoulder parts learned on the FLIC [sapp13cvpr] dataset. For any part the distribution is the identity map, and so the message passed from any joint to itself is its unary distribution. Since the FLIC dataset is biased towards frontfacing poses where the right shoulder is directly to the lower right of the face, the model learns the correct spatial distribution between these body parts and has high probability in the spatial locations describing the likely displacement between the shoulder and face. For datasets that cover a larger range of the possible poses (for instance the LSP [Johnson10] dataset), we would expect these distributions to be less tightly constrained, and therefore this simple SpatialModel will be less effective.
For our practical implementation we treat the distributions above as energies to avoid the evaluation of
. There are 3 reasons why we do not include the partition function. Firstly, we are only concerned with the maximum output value of our network, and so we only need the output energy to be proportional to the normalized distribution. Secondly, since both the part detector and spatial model parameters contain only shared weight (convolutional) parameters that are equal across pixel positions, evaluation of the partition function during backpropagation will only add a scalar constant to the gradient weight, which would be equivalent to applying a perbatch learningrate modifier. Lastly, since the number of parts is not known a priori (since there can be unlabeled people in the image), and since the distributions
describe the part location of a single person, we cannot normalize the PartModel output. Our final model is a modification to Eq 1:(2)  
Note that the above formulation is no longer exactly equivalent to an MRF, but still satisfactorily encodes the spatial constraints of Eq 1. The networkbased implementation of Eq 2 is shown in Fig 6. Eq 2 replaces the outer multiplication of Eq 1
with a log space addition to improve numerical stability and to prevent coupling of the convolution output gradients (the addition in log space means that the partial derivative of the loss function with respect to the convolution output is not dependent on the output of any other stages). The inclusion of the
SoftPlus and ReLU stages on the weights, biases and input heatmap maintains a strictly greater than zero convolution output, which prevents numerical issues for the values leading into the Log stage. Finally, a SoftPlus stage is used to maintain continuous and nonzero weight and bias gradients during training. With this modified formulation, Eq 2 is trained using backpropagation and SGD.The convolution sizes are adjusted so that the largest joint displacement is covered within the convolution window. For our 90x60 pixel heatmap output, this results in large 128x128 convolution kernels to account for a joint displacement radius of 64 pixels (note that padding is added on the heatmap input to prevent pixel loss). Therefore for such large kernels we use FFT convolutions based on the GPU implementation by Mathieu et al.
[fft].The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. This initialization improves learned performance, decreases training time and improves optimization stability. During training we randomly flip and scale the heatmap inputs to improve generalization performance.
3.3 Unified Model
Since our SpatialModel (Section 3.2) is trained using backpropagation, we can combine our PartDetector and SpatialModel stages in a single Unified Model. To do so, we first train the PartDetector separately and store the heatmap outputs. We then use these heatmaps to train a SpatialModel. Finally, we combine the trained PartDetector and SpatialModels and backpropagate through the entire network.
This unified finetuning further improves performance. We hypothesize that because the SpatialModel is able to effectively reduce the output dimension of possible heatmap activations, the PartDetector can use available learning capacity to better localize the precise target activation.
4 Results
The models from Sections 3.1 and 3.2 were implemented within the Torch7 [torch7] framework (with custom GPU implementations for the nonstandard stages above). Training the PartDetector takes approximately 48 hours, the SpatialModel 12 hours, and forwardpropagation for a single image through both networks takes 51ms ^{1}^{1}1We use a 12 CPU workstation with an NVIDIA Titan GPU.
We evaluated our architecture on the FLIC [sapp13cvpr] and extendedLSP [Johnson10] datasets. These datasets consist of still RGB images with 2D groundtruth joint information generated using Amazon Mechanical Turk. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly frontfacing standing up poses (with 1016 images used for testing), while the extendedLSP dataset contains a wider variety of poses of athletes playing sport (10442 training and 1000 test images). The FLIC dataset contains many frames with more than a single person, while the joint locations from only one person in the scene are labeled. Therefore an approximate torso bounding box is provided for the single labeled person in the scene. We incorporate this data by including an extra “torsojoint heatmap” to the input of the SpatialModel so that it can learn to select the correct feature activations in a cluttered scene.
The FLICfull dataset contains 20928 training images, however many of these training set images contain samples from the 1016 test set scenes and so would allow unfair overtraining on the FLIC test set. Therefore, we propose a new dataset  called FLICplus (http://cims.nyu.edu/tompson/flic_plus.htm)  which is a 17380 image subset from the FLICplus dataset. To create this dataset, we produced unique scene labels for both the FLIC test set and FLICplus training sets using Amazon Mechanical Turk. We then removed all images from the FLICplus training set that shared a scene with the test set. Since 253 of the sample images from the original 3987 FLIC training set came from the same scene as a test set sample (and were therefore removed by the above procedure), we added these images back so that the FLICplus training set is a superset of the original FLIC training set. Using this procedure we can guarantee that the additional samples in FLICplus are sufficiently independent to the FLIC test set samples.
For evaluation of the testset performance we use the measure suggested by Sapp et. al. [sapp13cvpr]. For a given normalized pixel radius (normalized by the torso height of each sample) we count the number of images in the testset for which the distance of the predicted UV joint location to the groundtruth location falls within the given radius.
Fig 6(a) and 6(b) show our model’s performance on the the FLIC testset for the elbow and wrist joints respectively and trained using both the FLIC and FLICplus training sets. Performance on the LSP dataset is shown in Fig 6(c) and 7(a). For LSP evaluation we use personcentric (or nonobservercentric) coordinates for fair comparison with prior work [deeppose, dantone13cvpr]. Our model outperforms existing stateoftheart techniques on both of these challenging datasets with a considerable margin.

Fig 7(b) illustrates the performance improvement from our simple SpatialModel. As expected the SpatialModel has little impact on accuracy for low radii threshold, however, for large radii it increases performance by 8 to 12%. Unified training of both models (after independent pretraining) adds an additional 45% detection rate for large radii thresholds.
The impact of the number of resolution banks is shown in Fig 7(c)). As expected, we see a big improvement when multiple resolution banks are added. Also note that the size of the receptive fields as well as the number and size of the pooling stages in the network also have a large impact on the performance. We tune the network hyperparameters using coarse metaoptimization to obtain maximal validation set performance within our computational budget (less than 100ms per forwardpropagation).
Fig 9 shows the predicted joint locations for a variety of inputs in the FLIC and LSP testsets. Our network produces convincing results on the FLIC dataset (with low joint position error), however, because our simple SpatialModel is less effective for a number of the highly articulated poses in the LSP dataset, our detector results in incorrect joint predictions for some images. We believe that increasing the size of the training set will improve performance for these difficult cases.
5 Conclusion
We have shown that the unification of a novel ConvNet PartDetector and an MRF inspired SpatialModel into a single learning framework significantly outperforms existing architectures on the task of human body pose recognition. Training and inference of our architecture uses commodity level hardware and runs at close to realtime frame rates, making this technique tractable for a wide variety of application areas.
For future work we expect to further improve upon these results by increasing the complexity and expressiveness of our simple spatial model (especially for unconstrained datasets like LSP).
6 Acknowledgments
The authors would like to thank Mykhaylo Andriluka for his support. This research was funded in part by the Office of Naval Research ONR Award N000141210327.