Mass Displacement Networks

08/12/2017 ∙ by Natalia Neverova, et al. ∙ Facebook 0

Despite the large improvements in performance attained by using deep learning in computer vision, one can often further improve results with some additional post-processing that exploits the geometric nature of the underlying task. This commonly involves displacing the posterior distribution of a CNN in a way that makes it more appropriate for the task at hand, e.g. better aligned with local image features, or more compact. In this work we integrate this geometric post-processing within a deep architecture, introducing a differentiable and probabilistically sound counterpart to the common geometric voting technique used for evidence accumulation in vision. We refer to the resulting neural models as Mass Displacement Networks (MDNs), and apply them to human pose estimation in two distinct setups: (a) landmark localization, where we collapse a distribution to a point, allowing for precise localization of body keypoints and (b) communication across body parts, where we transfer evidence from one part to the other, allowing for a globally consistent pose estimate. We evaluate on large-scale pose estimation benchmarks, such as MPII Human Pose and COCO datasets, and report systematic improvements when compared to strong baselines.



There are no comments yet.


page 2

page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advent of deep learning has reduced the amount of hand-engineered processing required for computer vision by integrating many operations such as pooling, normalization, and resampling within Convolutional Neural Networks (CNN). The succession of such operations gradually discards the effects of irrelevant signal transformations, allowing the higher layers of CNNs to exhibit increased robustness to small input perturbations. While this invariance is desirable for high-level vision tasks, it can harm tasks such as pose estimation where one aims at precise spatial localization, rather than abstraction.

It is therefore common to apply some form of computer vision-based post-processing on top of CNN-based scores to obtain sharp, localized geometric features. One of the first steps in this direction has been the use of structured prediction on top of semantic segmentation, e.g. by combining image-based DenseCRF Koltun13 inference with CNNs for semantic segmentation ChenPKMY14 , training both systems jointly crfrnn , or more recently learning CNN-based pairwise terms in structured prediction modules Chandra16 ; Hengel16 . All of these works involve coupling decisions so as to reach some consistency in the labeling of global structures, typically coming in the form of smoothness constraints. While this is meaningful for tasks where information is spread out, such as semantic segmentation, we are interested in more general transformations, some of which are illustrated in Fig. 1. For instance, we consider tasks that require outputs in the forms of 1-D or 0-D outputs (boundary and keypoint detection, respectively), effectively collapsing the spatially extended output of a CNN into lower-dimensional structures. Even though in principle this could be cast in structured prediction terms, the resulting optimization problem amounts to maximizing a submodular function Blaschko11 and can only be approximately optimized. We therefore turn to geometry-based, rather than optimization-based methods, and pursue their incorporation in the context of deep learning.

Our starting point is the understanding that requiring high spatial accuracy from a purely CNN-based deep architecture is misusing the network’s abilities: by design, the CNN feature maps get increasingly smooth as we go deeper. We can instead combine these smooth CNN-based classification results with an equally smooth displacement field

obtained from another CNN branch, indicating to every pixel where its mass should be displaced. This is achieved by separately predicting values of x- and y- components of the displacement vectors (or

offsets) of all pixels. Even though the displacement field may be smooth, if its value is accurate, then result can become sharp – in Fig. 1 we are displaying some indicative examples of a smooth response being manipulated by smooth displacement fields that turn it into quite different shapes, that could be appropriate for a variety of visual tasks.

(a)                  (b)                  (c)                  (d)

Figure 1: The low spatial resolution of CNNs results in overly smooth per-pixel confidence scores (mass), as shown in the image on the left. Rather than stretch the CNN’s capabilities in order to obtain spatially sharp responses, we propose instead to append a dispacement field as another CNN output that rearranges the classification scores, lending more evidence to the ground truth positions. The and -components of different displacement fields are shown in the top row on the right (the middle row shows the same components presented as a vector field and displayed in color, for illustrative purposes). These are combined by a Mass Displacement module into a sharp decision, shown in the bottom row. This can amount to making the classification obtain a particular shape, e.g. through alignment with image boundaries (a), a 1D structure such as a line, or a curve (b), a point (c), or displacing to another position (d). While both the raw network outputs (mass) and the displacement fields are smooth, the final results are sharp.

What we are proposing can be understood as reinventing geometric voting in the context of deep learning: in a host of computer vision tasks Ballard81 ; LeibeLS08 ; MajiM09 ; GallYRGL11 ; RazaviGKG12 ; BarinovaLK12 voting can be used to first associate an observation with positions that it supports and then shortlist structures that are supported by multiple observations, e.g. many points voting for a line or a cycle Ballard81 , object parts voting for an object’s 2D LeibeLS08 or 3D pose ism3d , or many object hypotheses voting for a single object bounding box GidarisK15 . Our work was actually motivated by the recent success of such schemes for landmark localization in gpapan , instance segmentation in WuSH16b , and bounding box post-processing in GidarisK15 .

All of these approaches however are plagued by the heuristic nature of geometric voting, that makes them only applicable as post-processing steps. For example in

gpapan posterior probabilities are being displaced and then accumulated which results in score maps that can be larger than one – disqualifying them from training with losses appropriate for classification. The authors end up using the cross-entropy loss for the original CNN and the L2 loss for the second stage, while also not training the displacement fields end-to-end – as such it is unclear if the displacement fields are really pointing to the positions that they should. Instead, in this work we develop this somehow ad-hoc post-processing into a module that can easily be combined with existing architectures and trained end-to-end.

In particular, we treat geometric voting as a differentiable operation, allowing us to train the CNN-based score maps and displacement fields in an end-to-end manner, ensuring that both arguments to the voting function are optimizing the final system’s performance. Each displaced point is dilated by a kernel to support a region around its novel position, and in the output space every position accumulates evidence from input points that can support it. The currently common approach of adding the posterior probabilities is also not justified probabilistically Williams . Instead we consider a probabilistically sound method of accumulating evidence that forces the final outcomes to stay smaller than one – as such the output of our operation lends itself to training with probabilistic criteria, such as the cross-entropy loss.

Since our approach combines spatial transformation with the geometric manipulation of a probability mass function, we call a network incorporating our method a Mass Displacement Network (MDN). We explore two tasks: (i) human body landmark localization through within-part voting, where the coarse score map of a part is sharpened by a voting process (ii) human pose estimation through across-part voting, where every body part score map votes for the presence of other parts. We provide systematic demonstrations of improvements achieved by MDNs over strong baselines on large-scale benchmarks in human pose estimation both in single person (MPII Human Pose dataset) and multi-person (COCO dataset) setups.

Connections to other works:

Apart from the works mentioned already, our approach has connections to Spatial Transformer Networks (STNs), introduced in

JaderbergSZK15 to bring raw images into correspondence and remove intra-class variation that can be modelled in terms of image deformations. The tacit assumption underlying STNs is that the input and output fields are related by a diffeomorphic transformation, such as a similarity transformation or an affine map, meaning that the dimensionality of structures is preserved. Instead, here we consider transformations that allow us to collapse 2D structures into lower-dimensional structures, such as points, or lines. Furthermore, STNs typically consider a single global parametric transformation, while we have a non-parametric transformation determined by a fully convolutional layer. Finally, as we explain in Sec. 2

, STNs are designed like image interpolation operations, and are typically used at the input of a network, while we cater for evidence accumulation, and our module is intended to be appended at the end of a network, or generally after some decisions have been produced by a CNN.

In a work parallel to our own daiactive

, the authors have introduced active CNNs, which allow a neuron to pick incoming neurons from input positions determined dynamically through a CNN-based deformation. While this work shares with us the idea of using a convolutional, CNN-based deformation field, in our case we have input neurons deciding where they move to, rather than output neurons deciding from where to pool their information. As such our approach seems to be better suited for collapsing densities and accumulating evidence to certain positions, while the work of

daiactive seems better suited for the task of discarding the effect of deformations.

Figure 2:

Architecture of a Mass Displacement Network (MDN): the convolutional layers of a CNN are trained with loss functions that allow for some uncertainty in the localization of landmarks, accommodating their inherently smooth responses. A voting operation combines these and collapses the smooth CNN predictions into sharp landmarks. We treat the voting mechanism as a differentiable module and use it for end-to-end training.

2 Mass Displacement Networks

We start by describing in Sec. 2.1 the non-probabilistic, geometric voting process currently employed in recent works gpapan ; Hengel16 and then propose a principled variant that relies on the noisy-or rule Pear88 allowing us to use the cross-entropy loss during training. We then turn in Sec. 2.2 to the equations used for end-to-end training of the resulting Mass Displacement Network.

2.1 Additive and Noisy-OR Voting

We consider that both our local evidence functions and the output structures reside in a two-dimensional space. In particular we consider that for any position a convolutional network provides us with two outputs: firstly an estimate of the local confidence for the presence of a feature , and secondly an estimate of the predicted structure’s position. The latter is expressed as an horizontal/vertical displacement (or, offset) that should be applied to the current position to obtain the refined estimate :

For landmark localization in within-part voting the displacement field can act like a residual correction signal, while for across-part voting it reflects relative part locations. We can accommodate spatial uncertainty in the predicted position by supporting a structure not only at , but also in the vicinity of the same point. This can be accomplished by dilating the local confidence with a kernel, e.g. , that allows us to smoothly decrease our support as we move further away from . Combining evidence from multiple points is typically done through summation:


where for every output position we sum the support delivered by all input positions . This has been the setting used for instance in gpapan and Hengel16 for landmark localization and instance segmentation, respectively. In these works a CNN is trained with a cross-entropy loss for and a regression loss for , while Eq. 1 is used at test time to deliver more accurate estimates of the desired structures.

The operation in Eq. 1 can be justified in the context of image interpolation, as in the case of Spatial Transformer Networks JaderbergSZK15

, or in standard Kernel Density Estimation (KDE), but not as a method of accumulating evidence

Williams . The main problem, detailed in Appendix A, is that we cannot simultaneously guarantee that the input and output fields both lie in , so that they can be trained with the cross-entropy loss, and that a confident posterior at will confidently support its displaced replica at , i.e. .

We can guarantee both requirements by replacing summation with maximization (i.e. perform a “Transformed-Max-Pooling operation”). Our experiments with this approach were underwhelming, understandably because we do not really accumulate evidence from many points, but rather rely on the single most confident one. Instead we propose to use differentiable approximations of the maximum operation

Pear88 ; ViolaPZ05 ; babenko that allow us to softly combine multiple pieces of evidence while ensuring that the outputs are probabilistically valid.

In particular we use the noisy-or combination rule Pear88 which provides a probabilistic counterpart to a logical OR-ing operation. We consider that we have pieces of evidence about the presence of a feature, each being true with a probability of . The noisy-or operation expresses the probability of the presence of the feature as follows: , namely the feature is absent if all supporting pieces of evidence are simultaneously absent – as such, any additional piece of evidence can only increase the estimated value of . If now we replace in the above formula with we obtain the following rule for combining evidence in the MDN:


We note that we can use a first-order approximation to obtain Eq. 1 from Eq. 2 if all of the individual terms are very small, which in hindsight gives some explanation for the practical success of Eq. 1. However, in Eq. 2 we have which allows us to use the cross-entropy loss throughout training, by virtue of being probabilistically meaningful. Our experiments show that this yields equally good results as the currently broadly used heuristic of regressing to Gaussian functions TompsonGJLB15 ; NewellYD16 ; BulatT16 , while being simpler and cleaner.

2.2 Back-propagation through an MDN module

The input-output mapping defined by Eq. 1 is differentiable with respect to both input functions, , and as such lends itself to end-to-end training with back-propagation. Given a gradient signal that dictates how the output layer activations should change to decrease the network loss , we obtain the update equations for and

through the following chain rule:


where the summation runs over the top-layer neurons that send gradients back to neuron . Turning to the computation of the partial derivatives in Eq. 3, the use of displacement fields means that we no longer have a standard convolutional layer; an input position can potentially influence any other output position , as dictated by Eq. 2. For convenience we rewrite Eq. 2 as follows:


indicates the amount by which influences . Using the same steps as in ViolaPZ05 , in case of a Gaussian kernel we have:

where are the horizontal components of respectively.

Figure 3: MDN computation in practice: when presented with an image, the three convolutional branches of our network deliver the smooth posterior probabilities and horizontal and vertical offsets and shown in the middle row (for simplicity, for every kind of output here we display a sum over all planes corresponding to different keypoints). The MDN combines these into the sharper joint estimates shown on the right.

3 Experimental Evaluation

We present experiments in two setups: firstly, single-person pose estimation on the MPII Human Pose dataset AndrilukaPGB14 , where the position and scale of a human is considered known in advance. This disentangles the performance of the pose estimation and object detection systems. Secondly, we consider human pose estimation “in-the-wild” on the COCO dataset mscoco , where one needs to jointly tackle detection and pose estimation. We use different baselines for both setups, since there is no common strong baseline for both. In both cases MDNs systematically improve strong baselines.

          Model   No voting Bilinear   kernel Gaussian kernel
Baseline, additive 84.31 87.54 87.70 88.01 88.11 88.15 88.19 88.19
Baseline, noisyOR 87.49 87.63 87.84 87.98 88.08 88.19 88.11
Baseline, max 86.69 86.38 86.22 86.03 85.34 85.12
Spatial Transformer JaderbergSZK15 88.28
MDN-additive 88.60 88.63 88.61
MDN-noisyOR 88.61 88.58 88.32
Table 1: Relative performance of the MDN applied to isolated landmarks and trained with different combination rules. All models are based on ResNet-152 and are tested on the validation set of MPII Single person. The third baseline is obtained by applying a max operation instead of a sum (additive MD) or a product (noisyOR). denotes the kernel size.
Model Resnet-50 Resnet-101 Resnet-152 Hourglass-8
Baseline, no voting 83.29 84.28 84.31 89.24
Baseline, additive, bilinear kernel 86.50 87.50 87.54 89.43
Baseline, additive, Gaussian 87.17 88.10 88.19 89.70
Baseline, noisyOR, bilinear kernel 86.42 87.46 87.49 89.49
Baseline, noisyOR, Gaussian 87.12 88.08 88.11 89.67
MDN-additive, bilinear kernel 87.23 88.42 88.60 89.72
MDN-noisyOR, bilinear kernel 87.25 88.52 88.61 89.64
Table 2: Ablation of interplay between MDN and network architecture choices (PCKh on MPII-val).

3.1 Single person pose estimation

3.1.1 Experimental setup

Dataset & Evaluation: We evaluate several variants of MDNs on the MPII Human Pose dataset AndrilukaPGB14 which consists of 25K images containing over 40K people with annotated body joints. We follow the single person evaluation protocol, i.e. use a subset of the data with isolated people assuming their positions and corresponding scales to be known at test time. We follow the standard evaluation procedure of AndrilukaPGB14 and report performance with the common Percentage Correct Keypoints-w.r.t.-head (PCKh) metric YangR13 . As in BulatT16 ; NewellYD16 , we refine the test joint positions by averaging network predictions obtained with the original and horizontally flipped images.


We conduct the first exhaustive set of experiments by fine-tuning ImageNet-pretrained ResNet architectures

HeZRS15 . We substitute the output linear layer and the average pooling that precedes it with a bottleneck convolution layer of spatial support that projects its 2048-dimensional input down to 512 dimensions. This acts like a buffer layer between the pretrained network and the pose-specific output layers. As in gpapan

, we reduce the amount of spatial downsampling in such networks by reducing the stride of the first residual module in conv5 block from 2 or 1, and employ atrous convolutions afterwards

ChenPKMY14 . As a result, the network takes as an input a cropped image of size , produces a set of feature planes with spatial resolution of (rather than ). These are then bilinearly upsampled to produce the outputs of size .
On top of this common network trunk operate three convolutional branches that deliver the three inputs of the MDN, namely confidence and displacement fields . Each such branch is a single convolutional layer of spatial support which maps the feature planes to dimensions, where is the number of landmarks to be localized. The outputs of these branches are passed to the MD layer, which in turn outputs the final refined localizations at the same resolution.

We also present preliminary experimental results with hourglass networks NewellYD16 , that have even higher performance on MPII – we apply a similar re-purposing as the one outlined above by introducing additional convolutional heads for predicting displacement fields after each stack of the network (where the final estimates for the offsets are obtained by taking a sum over predictions at each step).

Training: In these experiments, we test the perfomance of both additive and noisyOR MDNs. We train the network with three kinds of supervision signals applied to the following outputs:
(a) the confidence maps trained with pixelwise binary cross entropy loss. The supervision signal at each point from output plane is formulated in the form of binary disks centered at each keypoint location: , where is the ground truth position of joint , .
(b) two offset planes learned with robust Huber loss applied solely in the -vicinity of the ground truth position of every keypoint. The ground truth value for each point in the -vicinity of joint voting for joint is defined as follows:


where, as before, is the ground truth position of joint and is a normalization factor (defined below). The vertical component is defined analogously.
(c) the final refined localizations . In this case, depending on the aggregation rule, we apply either MSE regression loss (for additive mass displacement) or binary cross entropy loss (in case of noisyOR aggregation). The final supervision signal is formulated in the form of a Gaussian (additive MDN) or a binary disk (noisyOR MDN) in the same way as in (a) but with a smaller value of .
We would like to note here that supervising the network with a single loss (c) is possible and produces similar final results but at cost of significantly slower convergence.

All networks are trained using the training set of MPII Single person dataset with artificial data augmentation in the form of flipping, scaling and rotation, as described in NewellYD16

. We employ the RMSProp update rule, initial learning rate 0.0025, learning rate decay 0.99, and as in

NewellYD16 use a validation set of 3000 heldout images for our ablation study.

We perform evaluation on two separate tasks of within-part and cross-part voting:
(a) local mass displacement(within-part voting): in this setting, the offset branches receive their supervision signal in the form of local distributions of horizontal and vertical offsets defined as in 5, where and ;
(b) global mass displacement (cross-part voting): the implementation of the cross voting mechanism is similar to the previous case, but and , where is the output resolution. In this case, we found it more effective to restrict connectivity between joints and perform cross-joint voting along the kinematic tree starting from the center of the body.

(a)                                                          (b)

Figure 4: MDN-based improvements in human pose estimation through (a) within-part voting (MPII dataset, Single person track, ResNet-152 backbone) and (b) cross-part voting (COCO dataset, Mask-RCNN backbone). Top row: baseline performance; bottom row: MDN-corrected pose estimates.
Gaussian kernel
Baseline, additive 83.96 87.72 87.64 87.73
MDN-additive 88.05 88.08 87.83
Table 3: Relative performance of a ResNet-152-MD network applied to cross-voting between joints.
Model Head Shoulder Elbow Wrist Hip Knee Ankle Mean Mean-val
Chu et al.ChuYOMYW17 98.5 96.3 91.9 88.1 90.6 88.0 85.0 91.5 89.4
Newell et al. NewellYD16 98.2 96.3 91.2 87.1 90.1 87.4 83.6 89.4
Bulat et al. BulatT16 97.9 95.1 89.9 85.3 89.4 85.7 81.7 88.2
Wei et al. WeiRKS16 97.8 95.0 88.7 84.0 88.4 82.8 79.4 88.5
Insafutdinov et al.InsafutdinovPAA16 96.8 95.2 89.3 84.4 88.4 83.4 78.0 88.5
Belagiannis et al. BelagiannisZ17 97.7 95.0 88.2 83.0 87.9 82.6 78.4 88.1 86.3
Rafi et al. RafiGL16 97.2 93.9 86.4 81.3 86.8 80.6 73.4 86.3
Gkioxari et al.GkioxariTJ2016 96.2 93.1 86.7 82.1 85.2 81.4 74.1 86.1 85.3
Lifshitz et al. LifshitzFU16 97.8 93.3 85.7 80.4 85.3 76.6 70.2 85.0
Pishchulin et al. deepcut 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4
Hu&Ramanan HuR16 95.0 91.6 83.0 76.6 81.9 74.5 69.5 82.4
Carreira et al. carreiraA2016 95.7 91.7 81.7 72.4 82.8 73.2 66.4 81.3
Hourglass-8-MDN 98.2 96.4 91.6 87.4 90.8 87.9 84.3 91.3 89.7
Resnet-152-MDN 97.7 95.8 90.4 85.1 88.9 85.6 81.6 89.7 88.6
Table 4: Comparison with the state of the art frameworks on MPII Single person dataset (test set). Mean-val denotes mean PCKh on the validation set. (*) – models based on the Hourglass architecture.

3.1.2 Evaluation results

In Table 1 we compare Mass Displacement Networks for within-part voting over a set of increasingly complicated baselines: a) a network trained with the binary cross-entropy loss with a single objective in the form of a joint heatmap, b) a network outputting the first round of posterior probabilities and displacements independently with following aggregation of corresponding votes in the form of post-processing, i.e. without end-to-end training, c) a modified Spatial Transformer network (STN) JaderbergSZK15 aiming on shrinking the produced distributions from iteration to iteration and, just as our architecture, trained end-to-end. In this case, the spatial transformation is not defined globally but instead learned in the form of a vector field describing pixel-wise linear translation.

In the top rows of Table 1 we evaluate the baseline performance for different filter sizes and gauge the impact of this choice. This is easier to do since mass aggregation is done as postprocessing and does not require re-training of the model. We then train MDN models specifically with selected filter sizes.

We observe that MDNs yield a substantial boost over the different simpler baselines, even when end-to-end training is used, as in the case of STNs. This latter aspect can be attributed to the evidence accumulation operation of MDNs, which is better suited than interpolation (STNs) for the task.
The support of the kernel determines the computational complexity of the MD module; we note that by training MDNs end-to-end we achieve excellent results even with bilinear kernels, rather than using extended Gaussian kernels. Intuitively, we train our voting network to throw more accurate shots towards the center of the geometric structures, making the use of large kernels unnecessary.

In Table 2 we repeat the same evaluation for different feature extractors with a varying set of network architectures – the results indicate that there is a consistent improvement thanks to the MDN module, and that in all tasks the noisy-or and the additive voting yield virtually identical results. This confirms that we can discard the ad-hoc choice of training the second stage with regression, and replace it with the more meaningful cross-entropy loss.

We next evaluate MDNs on the task of passing information across different joints (cross-voting). The corresponding results are shown in Table 3. All models have now been trained to produce three kinds of outputs: posterior probabilities, local offsets and across-part voting offsets. This explains the drop in the baseline’s performance, which was forced to a harder multi-task learning setting (see Table 1 for comparison of the single-task network performance). However, employing an MD layer in the global setting leads to substantial improvement in the localization performance.

Comparison with the state-of-the-art methods is provided in Table 4. It shows that the MDN version of Resnet-152 outperforms all methods not based on Hourglass architecture, while Hourglass-MDN gives a 0.4 point boost over the corresponding baseline and is competitive with the most complex methods ChuYOMYW17 ; BulatT16 .

Finally, our experiments have shown that stacking several mass displacement modules in different ways (within+across, across+within, as well as several modules of the same kind) does not further improve performance. This could be explained by the fact that within-part voting is included in cross-joint aggregation (each joint also votes for itself) and, at the same time, dropping across-joint connections in simple cases allows the model to focus its capacity on local aggregation more efficiently. As a result, local voting performs better in the single person setting while cross-joint scheme turned out to be most effective in the multi-person scenario.

3.2 Multi-person pose estimation

We have obtained similar improvements as the ones reported above also on the challenging task of multi-person pose estimation in the wild, which includes both object detection and pose estimation. We have built on the recently-introduced Mask-RCNN system of maskRCNN which largely simplifies the task by integrating object detection and pose estimation in an end-to-end trainable architecture. This method has been shown to be only marginally inferior to two-stage architectures, like gpapan that first detect objects, and then apply pose estimation on images cropped around the detection results.

As in our previous experiments, we have extended the Mask-RCNN architecture with two displacement branches ( and ) that operate in parallel to the original classification, bounding box regression and pose estimation heads. In the setting of cross-part voting, we trained the whole architecture on COCO end-to-end, using identical experimental settings as those reported in maskRCNN . As shown in Table 5, our MDN-based modification of Mask-RCNN yields a substantial boost in performance over the original Mask-RCNN architecture. We also obtain results that are directly comparable to gpapan , while employing a substantially simpler and faster architecture.

Finally, in Appendix B we show that adding the mass displacement module with additional supervision on offsets further improves performance of detection branches (see Tables 6 and 7).

Mask R-CNN, keypoints maskRCNN 62.7 87.0 68.4 57.4 71.1
Mask R-CNN, masks+keypoints maskRCNN 63.1 87.3 68.7 57.8 71.4
RMPE CmuPose 61.0 82.9 68.8 57.9 66.5
CMU-Pose CmuPose 61.8 84.9 67.5 57.1 68.2 66.5 87.2 71.8 60.6 74.6
G-RMI, COCO only gpapan 64.9 85.5 71.3 62.3 70.0 69.7 88.7 75.5 64.4 77.1
Mask R-CNN-MDN, keypoints 63.9 87.2 70.0 58.5 72.3 70.7 91.9 76.2 64.8 78.8
Table 5: Performance of state-of-the-art pose estimation models trained exclusively on COCO data and tested on COCO test-dev (same as in maskRCNN ).

4 Conclusion

In this work we have introduced Mass Displacement Networks, a principled approach to integrate voting-type operations within deep architectures. MDNs provide us with a method to accumulate evidence from the image domain through an end-to-end learnable operation. We have demonstrated systematic improvements over strong baselines in human pose estimation, in both the single-person and multi-person settings. The geometric accumulation of evidence implemented by MDNs is generic and can apply to other tasks such as surface, curve and landmark estimation in 3D volumetric data in medical imaging, or curve tracking in space and time – we intend to explore these in the future.

Appendix A

The voting transformation is described in Eq. 1 as follows:


If one interprets both and as fields of posterior probability values, one has:


In this case, ensuring that would mean that we must use a normalized kernel, e.g. , as used in gpapan . One counter-intuitive resulting property is that the input-output mapping function defined by Eq. 1 can result in a decrease, rather accumulation of evidence. Consider in particular a perfectly-localized and perfectly-confident local evidence signal expressed in the form of a delta function centered at :


The result of voting according to Eq. 1 would then be a blurred support map that only yields a maximal support of to :


For a large value of this can result in an arbitrarily low value of , which is counter-intuitive, given the originally strong evidence at . At the root of this problem lies the operation of summing probabilities, which is a common operation when marginalizing over hidden variables, but does not make sense as a method of accumulating evidence Williams .

Appendix B.

Finally, we perform an ablation study in the multi-task setting to analyze the effect of the introduced cross-part MDN module on the performance of other brances of Mask R-CNN, namely bounding box regressor (Table 6) and predictor of binary masks (Table 7) for the person class from COCO minival

. In both cases, we observed consistent improvements in performance across the whole set of evaluation metrics. However, in the presence of the MDN module, activating the mask branch does not further improve the quality of pose estimation as in the baseline case.

Mask R-CNN, bb 51.5 82.5 55.0 59.4 68.5 18.2 52.2 59.8 66.9 76.3
Mask R-CNN, bb+mask 52.2 83.1 55.9 59.8 69.7 18.4 52.8 60.4 66.9 77.2
Mask R-CNN, bb+keypoints 51.6 81.4 55.3 60.1 69.7 18.3 52.4 60.0 67.4 77.0
Mask R-CNN-MDN, bb+keypoints 52.0 81.8 55.9 60.7 70.0 18.5 52.9 60.3 67.7 77.3
Mask R-CNN, bb+mask+keypoints 51.7 81.6 55.6 60.1 69.8 18.4 52.6 60.3 67.6 77.2
Mask R-CNN-MDN, bb+mask+keypoints 52.2 81.6 56.4 60.6 71.1 18.7 53.3 61.0 68.1 78.3
Table 6: Object detection performance (bounding box AP/AR) on COCO minival, person class.
Mask R-CNN, bb+mask 44.8 79.4 45.9 50.5 64.4 16.7 47.0 53.3 59.4 70.7
Mask R-CNN, bb+mask+keypoints 45.0 78.5 47.3 51.4 65.2 16.8 47.3 53.8 60.6 71.4
Mask R-CNN-MDN, bb+mask+keypoints 45.6 78.3 48.1 52.0 66.1 17.0 48.1 54.5 61.3 72.3
Table 7: Instance segmentation performance (mask AP/AR) on COCO minival, person class.


  • (1) M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. CVPR, 2014.
  • (2) B. Babenko. Multiple instance learning: algorithms and applications. Technical report, UCSD, 2008.
  • (3) D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2):111–122, 1981.
  • (4) O. Barinova, V. S. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. CVPR, 2010.
  • (5) V. Belagiannis and A. Zisserman. Recurrent human pose estimation. FG, 2017.
  • (6) M. B. Blaschko. Branch and bound strategies for non-maximal suppression in object detection. EMMCVPR, 2011.
  • (7) A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. ECCV, 2016.
  • (8) Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. CVPR, 2017.
  • (9) J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. CVPR, 2016.
  • (10) S. Chandra and I. Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. ECCV, 2016.
  • (11) L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. ICLR, 2015.
  • (12) X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. Multi-context attention for human pose estimation. CVPR, 2017.
  • (13) J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 2017.
  • (14) J. Gall, A. Yao, N. Razavi, L. J. V. Gool, and V. S. Lempitsky. Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2188–2202, 2011.
  • (15) S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware CNN model. ICCV, 2015.
  • (16) G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. ECCV, 2016.
  • (17) K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. CVPR, 2017.
  • (18) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
  • (19) P. Hu and D. Ramanan. Bottom-up and top-down reasoning with hierarchical rectified gaussians. CVPR, 2016.
  • (20) E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. ECCV, 2016.
  • (21) M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. NIPS, 2015.
  • (22) P. Krähenbühl and V. Koltun. Parameter learning and convergent inference for dense random fields. ICML, 2013.
  • (23) B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77(1-3):259–289, 2008.
  • (24) I. Lifshitz, E. Fetaya, and S. Ullman. Human pose estimation using deep consensus voting. ECCV, 2016.
  • (25) G. Lin, C. Shen, A. van den Hengel, and I. D. Reid. Efficient piecewise training of deep structured models for semantic segmentation. CVPR, 2016.
  • (26) T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. ECCV, 2014.
  • (27) S. Maji and J. Malik. Object detection using a max-margin hough transform. CVPR, 2009.
  • (28) A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. ECCV, 2016.
  • (29) G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. P. Murphy. Towards accurate multi-person pose estimation in the wild. CVPR, 2017.
  • (30) J. Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kauffman, 1988.
  • (31) L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. CVPR, 2016.
  • (32) U. Rafi, J. Gall, and B. Leibe. An efficient convolutional network for human pose estimation. BMVC, 2016.
  • (33) N. Razavi, J. Gall, P. Kohli, and L. J. V. Gool. Latent hough transform for object detection. ECCV, 2012.
  • (34) A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, B. Schiele, and L. J. V. Gool. Towards multi-view object class detection. CVPR, 2006.
  • (35) J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. CVPR, 2015.
  • (36) P. A. Viola, J. C. Platt, and C. Zhang. Multiple instance boosting for object detection. NIPS, 2005.
  • (37) S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. CVPR, 2016.
  • (38) C. K. I. Williams and M. Allan. On a connection between object localization with a generative template of features and pose-space prediction methods. Technical report, Edinburgh University, 2006.
  • (39) Z. Wu, C. Shen, and A. van den Hengel. Bridging category-level and instance-level semantic image segmentation. CoRR, abs/1605.06885, 2016.
  • (40) Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2878 – 2890, 2013.
  • (41) S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr.

    Conditional random fields as recurrent neural networks.

    ICCV, 2015.