One of the hardest tasks in computer vision is determining the high degree-of-freedom configuration of a human body with all its limbs, complex self-occlusion, self-similar parts, and large variations due to clothing, body-type, lighting, and many other factors. The most challenging scenario for this problem is from a monocular RGB image and with no prior assumptions made using motion models, pose models, background models, or any other common heuristics that current state-of-the-art systems utilize. Finding a face in frontal or side view is relatively simple, but determining the exact location of body parts such as hands, elbows, shoulders, hips, knees and feet, each of which sometimes only occupy a few pixels in the image in front of an arbitrary cluttered background, is significantly harder.
The best performing pose estimation methods, including those based on deformable part models, typically are based on body part detectors. Such body part detectors commonly consist of multiple stages of processing. The first stage of processing in a typical pipeline consists of extracting sets of low-level features such as SIFT [lowe1999object], HoG [dalal2005histograms]
, or other filters that describe orientation statistics in local image patches. Next, these features are pooled over local spatial regions and sometimes across multiple scales to reduce the size of the representation and also develop local shift/scale invariance. Finally, the aggregate features are mapped to a vector, which is then either input to 1) a standard classifier such as a support vector machine (SVM) or 2) the next stage of processing (e.g. assembling the parts into a whole). Much work is devoted to engineering the system to produce a vector representation that is sensitive to class (e.g. head, hands, torso) while remaining invariant to the various nuisance factors (lighting, viewpoint, scale, etc.)
An alternative approach is representation learning: relying on the data instead of feature engineering, to learn a good representation that is invariant to nuisance factors. For a recent review, see [Bengio2012]. It is common to learn multiple layers of representation, which is referred to as deep learning
. Several such techniques have used unsupervised or semi-supervised learning to extract multi-layer domain-specific invariant representations, however, it is purely supervised techniques that have won several recent challenges by large margins, including ImageNet LSVRC 2012 and 2013[krizhevsky2012imagenet, zeiler2013visualizing]. These end-to-end learning systems have capitalized on advances in computing hardware (notably GPUs), larger datasets like ImageNet, and algorithmic advances (specifically gradient-based training methods and regularization).
While these methods are now proven in generic object recognition, their use in pose estimation has been limited. Part of the challenge in making end-to-end learning work for human pose estimation is related to the nonrigid structure of the body, the necessity for precision (deep recognition systems often throw away precise location information through pooling), and the complex, multi-modal nature of pose.
In this paper, we present the first end-to-end learning approach for full-body human pose estimation. While our approach is based on convolutional networks (convnets) [LeCun1998], we want to stress that the naïve implementation of applying this model “off-the-shelf” will not work. Therefore, the contribution of this work is in both a model that outperforms state of the art deformable part models (DPMs) on a modern, challenging dataset, and also an analysis of what is needed to make convnets work in human pose estimation. In particular, we present a two-stage filtering approach whereby the response maps of convnet part detectors are denoised by a second process informed by the part hierarchy.
2 Related Work
Detecting people and their pose has been investigated for decades. Many early techniques rely on sliding-window part detectors based on hand-crafted or learned features or silhouette extraction techniques applied to controlled recording conditions. Examples include [farhadi2007transfer, wren1997pfinder, athitsos2004boostmap, nowlan1995convolutional]. We refer to [poppe2007vision] for a complete survey of this era. More recently, several new approaches have been proposed that are applied to unconstrained domains. In such domains, good performance has been achieved with so-called “bag of features” followed by regression-based, nearest neighbor or SVM-based architectures. Examples include “shape-context” edge-based histograms from the human body [mori2002estimating, agarwal2006recovering] or just silhouette features [Grauman2003]. Shakhnarovich et al. [Shakhnarovich2003] learn a parameter sensitive hash function to perform example-based pose estimation. Many relevant techniques have also been applied to hand tracking such as [wang2009real]. A more general survey of the large field of hand tracking can be found in [erol2007vision].
Many techniques have been proposed that extract, learn, or reason over entire body features. Some use a combination of local detectors and structural reasoning (see [ramanan2005strike] for coarse tracking and [buehler2009learning] for person-dependent tracking). In a similar spirit, more general techniques using pictorial structures [andriluka2009pictorial, andriluka2010monocular, Ferrari2009, Sapp2010, pishchulin12cvpr, pishchulin11bmvc], “poselets” [PoseletsICCV09], and other part-models [Felzenszwalb2010PAMI, yang2011articulated] have received increased attention. We will focus on these techniques and their latest incarnations in the following sections.
Further examples come from the HumanEva dataset competitions [Sigal2010], or approaches that use higher-resolution shape models such as SCAPE [anguelov2005scape] and further extensions [HasStoSunRosSei09, Coreg:Patent:2012]. These differ from our domain in that the images considered are of higher quality and less cluttered. Also many of these techniques work on images from a single camera, but need video sequence input (not single images) to achieve impressive results [stoll2011fast, zuffiestimating].
As an example of a technique that works for single images against cluttered backgrounds, Shotton et al.’s Kinect based body part detector [shotton2013real]
uses a random forest of decision trees trained on synthetic depth data to create simple body part detectors. In the proposed work, we also adopt simple part-based detectors, however, we focus on a different learning strategy.
There are a number of successful end-to-end representation learning techniques which perform pose estimation on a limited subset of body parts or body poses. One of the earliest examples of this type was Nowlan and Platt’s convolutional neural network hand tracker[nowlan1995convolutional], which tracked a single hand. Osadchy et al. applied a convolutional network to simultaneously detect and estimate the pitch, yaw and roll of a face [osadchy2007synergistic]. Taylor et al. [taylor2010embedding] trained a convolutional neural network to learn an embedding in which images of people in similar pose lie nearby. They used a subset of body parts, namely, the head and hand locations to learn the “gist” of a pose, and resorted to nearest-neighbour matching rather than explicitly modeling pose. Perhaps most relevant to our work is Taylor et al.’s work on tracking people in video [taylor2010tracking], augmenting a particle filter with a structured prior over human pose and dynamics based on learning representations. While they estimated a posterior over the whole body (60 joint angles), their experiments were limited to the HumanEva dataset [Sigal2010], which was collected in a controlled laboratory setting. The datasets we consider in our experiments are truly poses “in the wild”, though we do not consider dynamics.
A factor limiting earlier methods from tacking full pose-estimation with end-to-end learning methods, in particular deep networks, was the limited amount of labeled data. Such techniques, with millions or more parameters, require more data than structured techniques that have more a priori knowledge, such as DPMs. We attack this issue on two fronts. First, directly, by using larger labeled training sets which have become available in the past year or two, such as FLIC [sapp13cvpr]. Second, indirectly, by better exploiting the data we have. The annotations provided by typical pose estimation datasets contain much richer information compared to the class labels in object recognition datasets In particular, we show that the relationships among parts contained in these annotations can be used to build better detectors.
To perform pose estimation with a convolutional network architecture [LeCun1998] (convnet), the most obvious approach would be to map the image input directly to a vector coding the articulated pose: i.e. the type of labels found in pose datasets. The convnet output would represent the unbounded 2-D or 3-D positions of joints, or alternatively a hierarchy of joint angles. However, we found that this worked very poorly. One issue is that pooling, while useful for improving translation invariance during object recognition, destroys precise spatial information which is necessary to accurately predict pose. Convnets that produce segmentation maps, for example, avoid pooling completely [Turaga2010, FarabetCouprieNajmanLeCun2012]. Another issue is that the direct mapping from input space to kinematic body pose coefficients is highly non-linear and not one-to-one. However, even if we took this route, there is a deeper issue with attempting to map directly to a representation of full body pose. Valid poses represent a much lower-dimensional manifold in the high-dimensional space in which they are captured. It seems troublesome to make a discriminative network map to a space in which the majority of configurations do not represent valid poses. In other words, it makes sense to restrict the net’s output to a much smaller class of valid configurations.
Rather than perform multiple-output regression using a single convnet to learn pose coefficients directly, we found that training multiple convnets to perform independent binary body-part classification, with one network per feature, resulted in improved performance on our dataset. These convnets are applied as sliding windows to overlapping regions of the input, and map a window of pixels to a single binary output: the presence or absence of that body part. The result of applying the convnet is a response-map indicating the confidence of the body part at that location. This lets us use much smaller convnets, and retain the advantages of pooling, at the expense of having to maintain a separate set of parameters for each body part. Of course, a series of independent part detectors cannot enforce consistency in pose in the same way as a structured output model, which produces valid full-body configurations. In the following sections, we first describe in detail the convolutional network architecture and then a method of enforcing pose consistency using parent-child relationships.
3.1 Convolutional Network Architecture
The lowest level of our two-stage feature detection pipeline is based on a standard convnet architecture, an overview of which is shown in Figure 2. Convnets, like their fully-connected, deep neural network counterparts, perform end-to-end feature learning and are trained with the back-propagation algorithm. However, they differ in a number of respects, most notably local connectivity, weight sharing, and local pooling. The first two properties significantly reduce the number of free parameters, and reduce the need to learn repeated feature detectors at different locations of the input. The third property makes the learned representation invariant to small translations of the input.
The convnet pipeline shown in Figure 2 starts with a 6464 pixel RGB input patch which has been local contrast normalized (LCN) [yann_lcn_cite] to emphasize geometric discontinuities and improve generalization performance [pinto2008real]. The LCN layer is comprised of a 99 pixel local subtractive normalization, followed by a 9
9 local divisive normalization. The input is then processed by three convolution and subsampling layers, which use rectified linear units (ReLUs)[glorot2011deep]
As expected, we found that internal pooling layers help to a) reduce computational complexity111The number of operations required to calculate the output of the the three fully-connected layers is in the size of the input vectors. Therefore, even small amounts of pooling in earlier stages can drastically reduce training time. and b) improve classification tolerance to small input image translations. Unfortunately, pooling also results in a loss of spatial precision. Since the target application for this convnet was offline (rather than real-time) body-pose detection, and since we found that with sufficient training exemplars, invariance to input translations can be learned, we choose to use only 2 stages of pooling (where the total image downsampling rate is ).
Following the three stages of convolution and subsampling, the top-level pooled map is flattened to a vector and processed by three fully connected
layers, analogous to those used in deep neural networks. Each of these output stages is composed of a linear matrix-vector multiplication with learned bias, followed by a point-wise non-linearity (ReLU). The output layer has a single logistic unit, representing the probability of the body part being present in that patch.
To train the convnet, we performed standard batch stochastic gradient descent. From the training set images, we set aside a validation set to tune the network hyper-parameters, such as number and size of features, learning rate, momentum coefficient, etc. We used Nesterov momentum[sutskeverimportance]
as well as RMSPROP[tieleman2012rmsprop] to accelerate learning and we used L2 regularization and dropout [hinton2012improving] on the input to each of the fully-connected linear stages to reduce over-fitting the restricted-size training set.
3.2 Enforcing Global Pose Consistency with a Spatial Model
When applied to the validation set, the raw output of the network presented in Section 3.1 produces many false-positives. We believe this is due to two factors: 1) the small image context as input to the convnet (64
64 pixels or approximately 5% of the input image area) does not give the model enough contextual information to perform anatomically consistent joint position inference and 2) the training set size is limited. We therefore use a higher-level spatial model with simple body-pose priors to remove strong outliers from the convnet output. We do not expect this model to improve the performance of poses that are close to the ground truth labels (within 10 pixels for instance), but rather it functions as a post processing step to de-emphasize anatomically impossible poses due to strong outliers.
The inter-node connectivity of our simple spatial model is displayed in Figure 3. It consists of a linear chain of kinematic 2D nodes for a single side of the human body. Throughout our experiments we used the left shoulder, elbow and wrist; however we could have used the right side joints without loss of generality (since detection of the right body parts simply requires a horizontal mirror of the input image). For each node in the chain, our convnet detector generates response-map unary distributions , , , over the dense pixel positions , for the face, shoulder, elbow and wrist joints respectively. For the remainder of this section, all distributions are assumed to be a function over the pixel position, and so the notation will be dropped. The output of our spatial model will produce filtered response maps: , , , and .
The body part priors for a pair of joints , , are calculated by creating a histogram of joint locations over the training set, given that the adjacent joint is located at the image center (). The histograms are then smoothed (using a gaussian filter) and normalized. The learned priors for , , and are shown in Figure 4. Note that due to symmetry, the prior for is a 180° rotation of
(as is the case of other adjacent pairs). Rather than assume a simple Gaussian distribution for modeling pairwise interactions of adjacent nodes, as is standard in many parts-based detector implementations, we have found that the these non-parametric spatial priors lead to improved detection performance.
Given the full set of prior conditional distributions and the convnet unary distributions, we can now construct the filtered distribution for each part by using an approach that is analogous to the sum-product belief propagation algorithm. For body part , with a set of neighbouring nodes , the final distribution is defined as:
where is a mixing parameter and controls the confidence of each joint’s unary distribution towards its final filtered distribution (we used
for our experiments). The final joint distribution is therefore a product of the unary distribution for that joint, as well as the beliefs from neighbouring nodes (as with standard sum-product belief propagation). In log space, the above product for the shoulder joint becomes:
We also perform an equivalent computation for the elbow and wrist joints. The face joint is treated as a special case. Empirically, we found that incorporating image evidence from the shoulder joint to the filtered face distribution resulted in poor performance. This is likely due to the fact that the convnet does a very good job of localizing the face position, and so incorporating noisy evidence from the shoulder detector actually increases uncertainty. Instead, we use a global position prior for the face, , which is obtained by learning a location histogram over the face positions in the training set images, as shown in Figure 5. In log space, the output distribution for the face is then given by:
Lastly, since the learned neural network convolution features and the spatial priors are not explicitly invariant to scale, we must run the convnet and spatial model on images at multiple scales at test time, and then use the most likely joint location across those scales as the final joint location. For datasets containing examples with multiple persons (known a priori), we use non-maximal suppression [Neubeck:2006:ENS:1170749.1172615] to find multiple local maxima across the filtered response-maps from each scale, and we then take the top most likely joint candidates from each person in the scene.
We evaluated our architecture on the FLIC [sapp13cvpr] dataset, which is comprised of 5003 still RGB images taken from an assortment of Hollywood movies. Each frame in the dataset contains at least one person in a frontal pose (facing the camera), and each frame was processed by Amazon Mechanical Turk to obtain ground truth labels for the joint positions of the upper body of a single person. The FLIC dataset is very challenging for state-of-the-art pose estimation methodologies because the poses are unconstrained, body parts are often occluded, and clothing and background are not consistent.
We use training images from the dataset, which we also mirror horizontally to obtain a total of examples. Since the training images are not at the same scale, we also manually annotate the bounding box for the head in these training set images, and bring them to canonical scale. Further, we crop them to such that the center of the shoulder annotations lies at (160 px, 80 px). We do not perform this image normalization at test time. Following the methodology of Felzenszwalb et al. [felzenszwalb2008discriminatively], at test time we run our model on images with only one person (351 images of the 1016 test examples). As stated in Section 3, the model is run on 6 different input image scales and we then use the joint location with highest confidence across those scales as the final location.
For training the convnet we use Theano[theano], which provides a Python-based framework for efficient GPU processing and symbolic differentiation of complex compound functions. To reduce GPU memory usage while training, we cache only 100 mini-batches on the GPU; this allows us to use larger convnet models and keep all training data on a single GPU. As part of this framework, our system has two main threads of execution: 1) a training function which runs on the GPU evaluating the batched-SGD updates, and 2) a data dispatch function which preprocesses the data on the CPU and transfers it on the GPU when thread 1) is finished processing the 100 mini batches. Training each convnet on an NVIDIA TITAN GPU takes 1.9ms per patch (fprop + bprop) = 41min total. We test on a cpu cluster with 5000 nodes. Testing takes: 0.49sec per image (0.94x scale) = 2.8min total. NMS and spatial model take negligible time.
For testing, because of the shared nature of weights for all windows in each image, we convolve the learned filters with the full image instead of individual windows. This dramatically reduces the time to perform forward propagation on the full test set.
To evaluate our model on the FLIC dataset we use a measure of accuracy suggested by Sapp et al. [sapp13cvpr]: for a given joint precision radius we report the percentage of joints in the test set correct within the radius threshold (where distance is defined as 2D Euclidean distance in pixels). In Figure 6 we evaluate this performance measure on the the wrist, elbow and shoulder joints. We also compare our detector to the DPM [felzenszwalb2008discriminatively] and MODEC [sapp13cvpr] architectures. Note that we use the same subset of 351 images when testing all detectors.
|a) Wrist||b) Elbow||c) Shoulder|
Figure 6 shows that our architecture out-performs or is equal to the MODEC and DPM detectors for all three body parts. For the wrist and elbow joints our simple spatial model improves joint localization for approximately 5% of the test set cases (at a 5 pixel threshold), which enables us to outperform all other detectors. However, for the shoulder joint our spatial model actual decreases the joint location accuracy for large thresholds. This is likely due to the poor performance of the convnet on the elbow.
As expected, the spatial model cannot improve the joint accuracy of points that are already close to the correct value, however it is never-the-less successful in removing outliers for the wrist and elbow joints. Figure 7 is an example where a strong false positive results in an incorrect part location before the spatial model is applied, which is subsequently removed after applying our spatial model.
|a) RGB and joints||b) distribution before||c) distribution after spatial model.|
We have shown successfully how to improve the state-of-the-art on one of the most complex computer vision tasks: unconstrained human pose estimation. Convnets are impressive low-level feature detectors, which when combined with a global position prior is able to outperform much more complex and popular models. We explored many different higher level structural models with the aim to further improve the results, but the most generic higher level spatial model achieved the best results. As mentioned in the introduction, this is counter-intuitive to common belief for human kinematic structures, but it mirrors results in other domains. For instance in speech recognition, researchers observed, if the learned transition probabilities (higher level structure) are reset to equal probabilities, the recognition performance, now mainly driven by the emission probabilities does not reduce significantly [MorganPrivateHMM]. Other domains are discussed in more detail by [lucchi2011spatial].
We expect to obtain further improvement by enlarging the training set with a new pose-based warping technique that we are currently investigating. Furthermore, we are also currently experimenting with multi-resolution input representations, that take a larger spatial context into account.
This research was funded in part by the Office of Naval Research ONR Award N000141210327 and by a Google award.