1 Introduction
The advent of deep learning has reduced the amount of handengineered processing required for computer vision by integrating many operations such as pooling, normalization, and resampling within Convolutional Neural Networks (CNN). The succession of such operations gradually discards the effects of irrelevant signal transformations, allowing the higher layers of CNNs to exhibit increased robustness to small input perturbations. While this invariance is desirable for highlevel vision tasks, it can harm tasks such as pose estimation where one aims at precise spatial localization, rather than abstraction.
It is therefore common to apply some form of computer visionbased postprocessing on top of CNNbased scores to obtain sharp, localized geometric features. One of the first steps in this direction has been the use of structured prediction on top of semantic segmentation, e.g. by combining imagebased DenseCRF Koltun13 inference with CNNs for semantic segmentation ChenPKMY14 , training both systems jointly crfrnn , or more recently learning CNNbased pairwise terms in structured prediction modules Chandra16 ; Hengel16 . All of these works involve coupling decisions so as to reach some consistency in the labeling of global structures, typically coming in the form of smoothness constraints. While this is meaningful for tasks where information is spread out, such as semantic segmentation, we are interested in more general transformations, some of which are illustrated in Fig. 1. For instance, we consider tasks that require outputs in the forms of 1D or 0D outputs (boundary and keypoint detection, respectively), effectively collapsing the spatially extended output of a CNN into lowerdimensional structures. Even though in principle this could be cast in structured prediction terms, the resulting optimization problem amounts to maximizing a submodular function Blaschko11 and can only be approximately optimized. We therefore turn to geometrybased, rather than optimizationbased methods, and pursue their incorporation in the context of deep learning.
Our starting point is the understanding that requiring high spatial accuracy from a purely CNNbased deep architecture is misusing the network’s abilities: by design, the CNN feature maps get increasingly smooth as we go deeper. We can instead combine these smooth CNNbased classification results with an equally smooth displacement field
obtained from another CNN branch, indicating to every pixel where its mass should be displaced. This is achieved by separately predicting values of x and y components of the displacement vectors (or
offsets) of all pixels. Even though the displacement field may be smooth, if its value is accurate, then result can become sharp – in Fig. 1 we are displaying some indicative examples of a smooth response being manipulated by smooth displacement fields that turn it into quite different shapes, that could be appropriate for a variety of visual tasks.What we are proposing can be understood as reinventing geometric voting in the context of deep learning: in a host of computer vision tasks Ballard81 ; LeibeLS08 ; MajiM09 ; GallYRGL11 ; RazaviGKG12 ; BarinovaLK12 voting can be used to first associate an observation with positions that it supports and then shortlist structures that are supported by multiple observations, e.g. many points voting for a line or a cycle Ballard81 , object parts voting for an object’s 2D LeibeLS08 or 3D pose ism3d , or many object hypotheses voting for a single object bounding box GidarisK15 . Our work was actually motivated by the recent success of such schemes for landmark localization in gpapan , instance segmentation in WuSH16b , and bounding box postprocessing in GidarisK15 .
All of these approaches however are plagued by the heuristic nature of geometric voting, that makes them only applicable as postprocessing steps. For example in
gpapan posterior probabilities are being displaced and then accumulated which results in score maps that can be larger than one – disqualifying them from training with losses appropriate for classification. The authors end up using the crossentropy loss for the original CNN and the L2 loss for the second stage, while also not training the displacement fields endtoend – as such it is unclear if the displacement fields are really pointing to the positions that they should. Instead, in this work we develop this somehow adhoc postprocessing into a module that can easily be combined with existing architectures and trained endtoend.In particular, we treat geometric voting as a differentiable operation, allowing us to train the CNNbased score maps and displacement fields in an endtoend manner, ensuring that both arguments to the voting function are optimizing the final system’s performance. Each displaced point is dilated by a kernel to support a region around its novel position, and in the output space every position accumulates evidence from input points that can support it. The currently common approach of adding the posterior probabilities is also not justified probabilistically Williams . Instead we consider a probabilistically sound method of accumulating evidence that forces the final outcomes to stay smaller than one – as such the output of our operation lends itself to training with probabilistic criteria, such as the crossentropy loss.
Since our approach combines spatial transformation with the geometric manipulation of a probability mass function, we call a network incorporating our method a Mass Displacement Network (MDN). We explore two tasks: (i) human body landmark localization through withinpart voting, where the coarse score map of a part is sharpened by a voting process (ii) human pose estimation through acrosspart voting, where every body part score map votes for the presence of other parts. We provide systematic demonstrations of improvements achieved by MDNs over strong baselines on largescale benchmarks in human pose estimation both in single person (MPII Human Pose dataset) and multiperson (COCO dataset) setups.
Connections to other works:
Apart from the works mentioned already, our approach has connections to Spatial Transformer Networks (STNs), introduced in
JaderbergSZK15 to bring raw images into correspondence and remove intraclass variation that can be modelled in terms of image deformations. The tacit assumption underlying STNs is that the input and output fields are related by a diffeomorphic transformation, such as a similarity transformation or an affine map, meaning that the dimensionality of structures is preserved. Instead, here we consider transformations that allow us to collapse 2D structures into lowerdimensional structures, such as points, or lines. Furthermore, STNs typically consider a single global parametric transformation, while we have a nonparametric transformation determined by a fully convolutional layer. Finally, as we explain in Sec. 2, STNs are designed like image interpolation operations, and are typically used at the input of a network, while we cater for evidence accumulation, and our module is intended to be appended at the end of a network, or generally after some decisions have been produced by a CNN.
In a work parallel to our own daiactive
, the authors have introduced active CNNs, which allow a neuron to pick incoming neurons from input positions determined dynamically through a CNNbased deformation. While this work shares with us the idea of using a convolutional, CNNbased deformation field, in our case we have input neurons deciding where they move to, rather than output neurons deciding from where to pool their information. As such our approach seems to be better suited for collapsing densities and accumulating evidence to certain positions, while the work of
daiactive seems better suited for the task of discarding the effect of deformations.2 Mass Displacement Networks
We start by describing in Sec. 2.1 the nonprobabilistic, geometric voting process currently employed in recent works gpapan ; Hengel16 and then propose a principled variant that relies on the noisyor rule Pear88 allowing us to use the crossentropy loss during training. We then turn in Sec. 2.2 to the equations used for endtoend training of the resulting Mass Displacement Network.
2.1 Additive and NoisyOR Voting
We consider that both our local evidence functions and the output structures reside in a twodimensional space. In particular we consider that for any position a convolutional network provides us with two outputs: firstly an estimate of the local confidence for the presence of a feature , and secondly an estimate of the predicted structure’s position. The latter is expressed as an horizontal/vertical displacement (or, offset) that should be applied to the current position to obtain the refined estimate :
For landmark localization in withinpart voting the displacement field can act like a residual correction signal, while for acrosspart voting it reflects relative part locations. We can accommodate spatial uncertainty in the predicted position by supporting a structure not only at , but also in the vicinity of the same point. This can be accomplished by dilating the local confidence with a kernel, e.g. , that allows us to smoothly decrease our support as we move further away from . Combining evidence from multiple points is typically done through summation:
(1) 
where for every output position we sum the support delivered by all input positions . This has been the setting used for instance in gpapan and Hengel16 for landmark localization and instance segmentation, respectively. In these works a CNN is trained with a crossentropy loss for and a regression loss for , while Eq. 1 is used at test time to deliver more accurate estimates of the desired structures.
The operation in Eq. 1 can be justified in the context of image interpolation, as in the case of Spatial Transformer Networks JaderbergSZK15
, or in standard Kernel Density Estimation (KDE), but not as a method of accumulating evidence
Williams . The main problem, detailed in Appendix A, is that we cannot simultaneously guarantee that the input and output fields both lie in , so that they can be trained with the crossentropy loss, and that a confident posterior at will confidently support its displaced replica at , i.e. .We can guarantee both requirements by replacing summation with maximization (i.e. perform a “TransformedMaxPooling operation”). Our experiments with this approach were underwhelming, understandably because we do not really accumulate evidence from many points, but rather rely on the single most confident one. Instead we propose to use differentiable approximations of the maximum operation
Pear88 ; ViolaPZ05 ; babenko that allow us to softly combine multiple pieces of evidence while ensuring that the outputs are probabilistically valid.In particular we use the noisyor combination rule Pear88 which provides a probabilistic counterpart to a logical ORing operation. We consider that we have pieces of evidence about the presence of a feature, each being true with a probability of . The noisyor operation expresses the probability of the presence of the feature as follows: , namely the feature is absent if all supporting pieces of evidence are simultaneously absent – as such, any additional piece of evidence can only increase the estimated value of . If now we replace in the above formula with we obtain the following rule for combining evidence in the MDN:
(2) 
We note that we can use a firstorder approximation to obtain Eq. 1 from Eq. 2 if all of the individual terms are very small, which in hindsight gives some explanation for the practical success of Eq. 1. However, in Eq. 2 we have which allows us to use the crossentropy loss throughout training, by virtue of being probabilistically meaningful. Our experiments show that this yields equally good results as the currently broadly used heuristic of regressing to Gaussian functions TompsonGJLB15 ; NewellYD16 ; BulatT16 , while being simpler and cleaner.
2.2 Backpropagation through an MDN module
The inputoutput mapping defined by Eq. 1 is differentiable with respect to both input functions, , and as such lends itself to endtoend training with backpropagation. Given a gradient signal that dictates how the output layer activations should change to decrease the network loss , we obtain the update equations for and
through the following chain rule:
(3) 
where the summation runs over the toplayer neurons that send gradients back to neuron . Turning to the computation of the partial derivatives in Eq. 3, the use of displacement fields means that we no longer have a standard convolutional layer; an input position can potentially influence any other output position , as dictated by Eq. 2. For convenience we rewrite Eq. 2 as follows:
(4) 
indicates the amount by which influences . Using the same steps as in ViolaPZ05 , in case of a Gaussian kernel we have:
where are the horizontal components of respectively.
3 Experimental Evaluation
We present experiments in two setups: firstly, singleperson pose estimation on the MPII Human Pose dataset AndrilukaPGB14 , where the position and scale of a human is considered known in advance. This disentangles the performance of the pose estimation and object detection systems. Secondly, we consider human pose estimation “inthewild” on the COCO dataset mscoco , where one needs to jointly tackle detection and pose estimation. We use different baselines for both setups, since there is no common strong baseline for both. In both cases MDNs systematically improve strong baselines.
Model  No voting  Bilinear kernel  Gaussian kernel  
Baseline, additive  84.31  87.54  87.70  88.01  88.11  88.15  88.19  88.19 
Baseline, noisyOR  87.49  87.63  87.84  87.98  88.08  88.19  88.11  
Baseline, max  86.69  86.38  86.22  86.03  85.34  85.12  
Spatial Transformer JaderbergSZK15  88.28  
MDNadditive  88.60  88.63  88.61  
MDNnoisyOR  88.61  88.58  88.32 
Model  Resnet50  Resnet101  Resnet152  Hourglass8 
Baseline, no voting  83.29  84.28  84.31  89.24 
Baseline, additive, bilinear kernel  86.50  87.50  87.54  89.43 
Baseline, additive, Gaussian  87.17  88.10  88.19  89.70 
Baseline, noisyOR, bilinear kernel  86.42  87.46  87.49  89.49 
Baseline, noisyOR, Gaussian  87.12  88.08  88.11  89.67 
MDNadditive, bilinear kernel  87.23  88.42  88.60  89.72 
MDNnoisyOR, bilinear kernel  87.25  88.52  88.61  89.64 
3.1 Single person pose estimation
3.1.1 Experimental setup
Dataset & Evaluation: We evaluate several variants of MDNs on the MPII Human Pose dataset AndrilukaPGB14 which consists of 25K images containing over 40K people with annotated body joints. We follow the single person evaluation protocol, i.e. use a subset of the data with isolated people assuming their positions and corresponding scales to be known at test time. We follow the standard evaluation procedure of AndrilukaPGB14 and report performance with the common Percentage Correct Keypointsw.r.t.head (PCKh) metric YangR13 . As in BulatT16 ; NewellYD16 , we refine the test joint positions by averaging network predictions obtained with the original and horizontally flipped images.
Implementation:
We conduct the first exhaustive set of experiments by finetuning ImageNetpretrained ResNet architectures
HeZRS15 . We substitute the output linear layer and the average pooling that precedes it with a bottleneck convolution layer of spatial support that projects its 2048dimensional input down to 512 dimensions. This acts like a buffer layer between the pretrained network and the posespecific output layers. As in gpapan, we reduce the amount of spatial downsampling in such networks by reducing the stride of the first residual module in conv5 block from 2 or 1, and employ atrous convolutions afterwards
ChenPKMY14 . As a result, the network takes as an input a cropped image of size , produces a set of feature planes with spatial resolution of (rather than ). These are then bilinearly upsampled to produce the outputs of size .On top of this common network trunk operate three convolutional branches that deliver the three inputs of the MDN, namely confidence and displacement fields . Each such branch is a single convolutional layer of spatial support which maps the feature planes to dimensions, where is the number of landmarks to be localized. The outputs of these branches are passed to the MD layer, which in turn outputs the final refined localizations at the same resolution.
We also present preliminary experimental results with hourglass networks NewellYD16 , that have even higher performance on MPII – we apply a similar repurposing as the one outlined above by introducing additional convolutional heads for predicting displacement fields after each stack of the network (where the final estimates for the offsets are obtained by taking a sum over predictions at each step).
Training:
In these experiments, we test the perfomance of both additive and noisyOR MDNs. We train the network with three kinds of supervision signals applied to the following outputs:
(a) the confidence maps trained with pixelwise binary cross entropy loss. The supervision signal at each point from output plane is formulated in the form of binary disks centered at each keypoint location: , where is the ground truth position of joint , .
(b) two offset planes learned with robust Huber loss applied solely in the vicinity of the ground truth position of every keypoint. The ground truth value for each point in the vicinity of joint voting for joint is defined as follows:
(5) 
where, as before, is the ground truth position of joint and is a normalization factor (defined below). The vertical component is defined analogously.
(c) the final refined localizations . In this case, depending on the aggregation rule, we apply either MSE regression loss (for additive mass displacement) or binary cross entropy loss (in case of noisyOR aggregation). The final supervision signal is formulated in the form of a Gaussian (additive MDN) or a binary disk (noisyOR MDN) in the same way as in (a) but with a smaller value of .
We would like to note here that supervising the network with a single loss (c) is possible and produces similar final results but at cost of significantly slower convergence.
All networks are trained using the training set of MPII Single person dataset with artificial data augmentation in the form of flipping, scaling and rotation, as described in NewellYD16
. We employ the RMSProp update rule, initial learning rate 0.0025, learning rate decay 0.99, and as in
NewellYD16 use a validation set of 3000 heldout images for our ablation study.We perform evaluation on two separate tasks of withinpart and crosspart voting:
(a) local mass displacement(withinpart voting):
in this setting, the offset branches receive their supervision signal in the form of local distributions of horizontal and vertical offsets defined as in 5, where and ;
(b) global mass displacement (crosspart voting): the implementation of the cross voting mechanism is similar to the previous case, but and , where is the output resolution.
In this case, we found it more effective to restrict connectivity between joints and perform crossjoint voting along the kinematic tree starting from the center of the body.
Model 


Gaussian kernel  

Baseline, additive  83.96  87.72  87.64  87.73  
MDNadditive  88.05  88.08  87.83 
Model  Head  Shoulder  Elbow  Wrist  Hip  Knee  Ankle  Mean  Meanval 

Chu et al.ChuYOMYW17  98.5  96.3  91.9  88.1  90.6  88.0  85.0  91.5  89.4 
Newell et al. NewellYD16  98.2  96.3  91.2  87.1  90.1  87.4  83.6  89.4  
Bulat et al. BulatT16  97.9  95.1  89.9  85.3  89.4  85.7  81.7  88.2  
Wei et al. WeiRKS16  97.8  95.0  88.7  84.0  88.4  82.8  79.4  88.5  – 
Insafutdinov et al.InsafutdinovPAA16  96.8  95.2  89.3  84.4  88.4  83.4  78.0  88.5  – 
Belagiannis et al. BelagiannisZ17  97.7  95.0  88.2  83.0  87.9  82.6  78.4  88.1  86.3 
Rafi et al. RafiGL16  97.2  93.9  86.4  81.3  86.8  80.6  73.4  86.3  – 
Gkioxari et al.GkioxariTJ2016  96.2  93.1  86.7  82.1  85.2  81.4  74.1  86.1  85.3 
Lifshitz et al. LifshitzFU16  97.8  93.3  85.7  80.4  85.3  76.6  70.2  85.0  – 
Pishchulin et al. deepcut  94.1  90.2  83.4  77.3  82.6  75.7  68.6  82.4  – 
Hu&Ramanan HuR16  95.0  91.6  83.0  76.6  81.9  74.5  69.5  82.4  – 
Carreira et al. carreiraA2016  95.7  91.7  81.7  72.4  82.8  73.2  66.4  81.3  – 
Hourglass8MDN  98.2  96.4  91.6  87.4  90.8  87.9  84.3  91.3  89.7 
Resnet152MDN  97.7  95.8  90.4  85.1  88.9  85.6  81.6  89.7  88.6 
3.1.2 Evaluation results
In Table 1 we compare Mass Displacement Networks for withinpart voting over a set of increasingly complicated baselines: a) a network trained with the binary crossentropy loss with a single objective in the form of a joint heatmap, b) a network outputting the first round of posterior probabilities and displacements independently with following aggregation of corresponding votes in the form of postprocessing, i.e. without endtoend training, c) a modified Spatial Transformer network (STN) JaderbergSZK15 aiming on shrinking the produced distributions from iteration to iteration and, just as our architecture, trained endtoend. In this case, the spatial transformation is not defined globally but instead learned in the form of a vector field describing pixelwise linear translation.
In the top rows of Table 1 we evaluate the baseline performance for different filter sizes and gauge the impact of this choice. This is easier to do since mass aggregation is done as postprocessing and does not require retraining of the model. We then train MDN models specifically with selected filter sizes.
We observe that MDNs yield a substantial boost over the different simpler baselines, even when endtoend training is used, as in the case of STNs. This latter aspect can be attributed to the evidence accumulation operation of MDNs, which is better suited than interpolation (STNs) for the task.
The support of the kernel determines the computational complexity of the MD module; we note that by training MDNs endtoend we achieve excellent results even with bilinear kernels, rather than using extended Gaussian kernels. Intuitively, we train our voting network to throw more accurate shots towards the center of the geometric structures, making the use of large kernels unnecessary.
In Table 2 we repeat the same evaluation for different feature extractors with a varying set of network architectures – the results indicate that there is a consistent improvement thanks to the MDN module, and that in all tasks the noisyor and the additive voting yield virtually identical results. This confirms that we can discard the adhoc choice of training the second stage with regression, and replace it with the more meaningful crossentropy loss.
We next evaluate MDNs on the task of passing information across different joints (crossvoting). The corresponding results are shown in Table 3. All models have now been trained to produce three kinds of outputs: posterior probabilities, local offsets and acrosspart voting offsets. This explains the drop in the baseline’s performance, which was forced to a harder multitask learning setting (see Table 1 for comparison of the singletask network performance). However, employing an MD layer in the global setting leads to substantial improvement in the localization performance.
Comparison with the stateoftheart methods is provided in Table 4. It shows that the MDN version of Resnet152 outperforms all methods not based on Hourglass architecture, while HourglassMDN gives a 0.4 point boost over the corresponding baseline and is competitive with the most complex methods ChuYOMYW17 ; BulatT16 .
Finally, our experiments have shown that stacking several mass displacement modules in different ways (within+across, across+within, as well as several modules of the same kind) does not further improve performance. This could be explained by the fact that withinpart voting is included in crossjoint aggregation (each joint also votes for itself) and, at the same time, dropping acrossjoint connections in simple cases allows the model to focus its capacity on local aggregation more efficiently. As a result, local voting performs better in the single person setting while crossjoint scheme turned out to be most effective in the multiperson scenario.
3.2 Multiperson pose estimation
We have obtained similar improvements as the ones reported above also on the challenging task of multiperson pose estimation in the wild, which includes both object detection and pose estimation. We have built on the recentlyintroduced MaskRCNN system of maskRCNN which largely simplifies the task by integrating object detection and pose estimation in an endtoend trainable architecture. This method has been shown to be only marginally inferior to twostage architectures, like gpapan that first detect objects, and then apply pose estimation on images cropped around the detection results.
As in our previous experiments, we have extended the MaskRCNN architecture with two displacement branches ( and ) that operate in parallel to the original classification, bounding box regression and pose estimation heads. In the setting of crosspart voting, we trained the whole architecture on COCO endtoend, using identical experimental settings as those reported in maskRCNN . As shown in Table 5, our MDNbased modification of MaskRCNN yields a substantial boost in performance over the original MaskRCNN architecture. We also obtain results that are directly comparable to gpapan , while employing a substantially simpler and faster architecture.
Finally, in Appendix B we show that adding the mass displacement module with additional supervision on offsets further improves performance of detection branches (see Tables 6 and 7).
Method  

Mask RCNN, keypoints maskRCNN  62.7  87.0  68.4  57.4  71.1  –  –  –  –  – 
Mask RCNN, masks+keypoints maskRCNN  63.1  87.3  68.7  57.8  71.4  –  –  –  –  – 
RMPE CmuPose  61.0  82.9  68.8  57.9  66.5  –  –  –  –  – 
CMUPose CmuPose  61.8  84.9  67.5  57.1  68.2  66.5  87.2  71.8  60.6  74.6 
GRMI, COCO only gpapan  64.9  85.5  71.3  62.3  70.0  69.7  88.7  75.5  64.4  77.1 
Mask RCNNMDN, keypoints  63.9  87.2  70.0  58.5  72.3  70.7  91.9  76.2  64.8  78.8 
4 Conclusion
In this work we have introduced Mass Displacement Networks, a principled approach to integrate votingtype operations within deep architectures. MDNs provide us with a method to accumulate evidence from the image domain through an endtoend learnable operation. We have demonstrated systematic improvements over strong baselines in human pose estimation, in both the singleperson and multiperson settings. The geometric accumulation of evidence implemented by MDNs is generic and can apply to other tasks such as surface, curve and landmark estimation in 3D volumetric data in medical imaging, or curve tracking in space and time – we intend to explore these in the future.
Appendix A
The voting transformation is described in Eq. 1 as follows:
(6) 
If one interprets both and as fields of posterior probability values, one has:
(7) 
In this case, ensuring that would mean that we must use a normalized kernel, e.g. , as used in gpapan . One counterintuitive resulting property is that the inputoutput mapping function defined by Eq. 1 can result in a decrease, rather accumulation of evidence. Consider in particular a perfectlylocalized and perfectlyconfident local evidence signal expressed in the form of a delta function centered at :
(8) 
The result of voting according to Eq. 1 would then be a blurred support map that only yields a maximal support of to :
(9) 
For a large value of this can result in an arbitrarily low value of , which is counterintuitive, given the originally strong evidence at . At the root of this problem lies the operation of summing probabilities, which is a common operation when marginalizing over hidden variables, but does not make sense as a method of accumulating evidence Williams .
Appendix B.
Finally, we perform an ablation study in the multitask setting to analyze the effect of the introduced crosspart MDN module on the performance of other brances of Mask RCNN, namely bounding box regressor (Table 6) and predictor of binary masks (Table 7) for the person class from COCO minival
. In both cases, we observed consistent improvements in performance across the whole set of evaluation metrics. However, in the presence of the MDN module, activating the mask branch does not further improve the quality of pose estimation as in the baseline case.
Method  

Mask RCNN, bb  51.5  82.5  55.0  59.4  68.5  18.2  52.2  59.8  66.9  76.3 
Mask RCNN, bb+mask  52.2  83.1  55.9  59.8  69.7  18.4  52.8  60.4  66.9  77.2 
Mask RCNN, bb+keypoints  51.6  81.4  55.3  60.1  69.7  18.3  52.4  60.0  67.4  77.0 
Mask RCNNMDN, bb+keypoints  52.0  81.8  55.9  60.7  70.0  18.5  52.9  60.3  67.7  77.3 
Mask RCNN, bb+mask+keypoints  51.7  81.6  55.6  60.1  69.8  18.4  52.6  60.3  67.6  77.2 
Mask RCNNMDN, bb+mask+keypoints  52.2  81.6  56.4  60.6  71.1  18.7  53.3  61.0  68.1  78.3 
Method  

Mask RCNN, bb+mask  44.8  79.4  45.9  50.5  64.4  16.7  47.0  53.3  59.4  70.7 
Mask RCNN, bb+mask+keypoints  45.0  78.5  47.3  51.4  65.2  16.8  47.3  53.8  60.6  71.4 
Mask RCNNMDN, bb+mask+keypoints  45.6  78.3  48.1  52.0  66.1  17.0  48.1  54.5  61.3  72.3 
References
 (1) M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. CVPR, 2014.
 (2) B. Babenko. Multiple instance learning: algorithms and applications. Technical report, UCSD, 2008.
 (3) D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2):111–122, 1981.
 (4) O. Barinova, V. S. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. CVPR, 2010.
 (5) V. Belagiannis and A. Zisserman. Recurrent human pose estimation. FG, 2017.
 (6) M. B. Blaschko. Branch and bound strategies for nonmaximal suppression in object detection. EMMCVPR, 2011.
 (7) A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. ECCV, 2016.
 (8) Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. CVPR, 2017.
 (9) J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. CVPR, 2016.
 (10) S. Chandra and I. Kokkinos. Fast, exact and multiscale inference for semantic image segmentation with deep gaussian crfs. ECCV, 2016.
 (11) L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. ICLR, 2015.
 (12) X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. Multicontext attention for human pose estimation. CVPR, 2017.
 (13) J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 2017.
 (14) J. Gall, A. Yao, N. Razavi, L. J. V. Gool, and V. S. Lempitsky. Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2188–2202, 2011.
 (15) S. Gidaris and N. Komodakis. Object detection via a multiregion & semantic segmentationaware CNN model. ICCV, 2015.
 (16) G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. ECCV, 2016.
 (17) K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask rcnn. CVPR, 2017.
 (18) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
 (19) P. Hu and D. Ramanan. Bottomup and topdown reasoning with hierarchical rectified gaussians. CVPR, 2016.
 (20) E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multiperson pose estimation model. ECCV, 2016.
 (21) M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. NIPS, 2015.
 (22) P. Krähenbühl and V. Koltun. Parameter learning and convergent inference for dense random fields. ICML, 2013.
 (23) B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77(13):259–289, 2008.
 (24) I. Lifshitz, E. Fetaya, and S. Ullman. Human pose estimation using deep consensus voting. ECCV, 2016.
 (25) G. Lin, C. Shen, A. van den Hengel, and I. D. Reid. Efficient piecewise training of deep structured models for semantic segmentation. CVPR, 2016.
 (26) T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. ECCV, 2014.
 (27) S. Maji and J. Malik. Object detection using a maxmargin hough transform. CVPR, 2009.
 (28) A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. ECCV, 2016.
 (29) G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. P. Murphy. Towards accurate multiperson pose estimation in the wild. CVPR, 2017.
 (30) J. Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kauffman, 1988.
 (31) L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. CVPR, 2016.
 (32) U. Rafi, J. Gall, and B. Leibe. An efficient convolutional network for human pose estimation. BMVC, 2016.
 (33) N. Razavi, J. Gall, P. Kohli, and L. J. V. Gool. Latent hough transform for object detection. ECCV, 2012.
 (34) A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, B. Schiele, and L. J. V. Gool. Towards multiview object class detection. CVPR, 2006.
 (35) J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. CVPR, 2015.
 (36) P. A. Viola, J. C. Platt, and C. Zhang. Multiple instance boosting for object detection. NIPS, 2005.
 (37) S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. CVPR, 2016.
 (38) C. K. I. Williams and M. Allan. On a connection between object localization with a generative template of features and posespace prediction methods. Technical report, Edinburgh University, 2006.
 (39) Z. Wu, C. Shen, and A. van den Hengel. Bridging categorylevel and instancelevel semantic image segmentation. CoRR, abs/1605.06885, 2016.
 (40) Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2878 – 2890, 2013.

(41)
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang,
and P. Torr.
Conditional random fields as recurrent neural networks.
ICCV, 2015.
Comments
There are no comments yet.