1 Introduction and Related Work
In this paper we address the problem of localizing and segmenting partially occluded objects. We do this by generating a bounding box around the full extent of the objects, while also segmenting the visible parts inside the box. This is different from semantic segmentation, which typically does not provide information about the spatial position of labelled pixels inside the object. While a lot of progress has been made in object detection [Felzenszwalb et al. (2010), Hinterstoisser et al. (2012), Zhu et al. (2014)], occlusion by other objects still remains a challenge. A common theme is to model occlusion geometrically or appearancewise, thereby allowing it to contribute to the detection process. Wang et al. (2009) use a holistic Histogram of Oriented Gradients (HOG) template Dalal and Triggs (2005) to scan through the image and use specially trained part templates for instances where some cells of the holistic template respond poorly. Girshick et al. (2011) force the Deformable Part Model detector to place a trained ‘occluder’ part in regions where the original parts respond weakly. The object masks produced by both of these algorithms are only accurate up to the parts and hence not usable for many applications e.g. edgebased 3D pose estimation. Xiang and Savarese (2013) approximate object structure in 3D using planar parts. A Conditional Random Field (CRF) is then used to reason about visibility of the parts when the 3D planes are projected to the image. However, such methods work well only for large objects that can be approximated with planar parts.
Our approach is entitled Segmentation and Detection using HigherOrder Potentials (SDHOP). It is based on discriminatively learned HOG templates for objects and occlusion. Whereas the object templates model the objects of interest, the occlusion templates provide discriminative support and do not model a specific occluder. Segmentation is done by considering not only the response of patches to these templates, but also the segmentation of neighbouring patches through a CRF with higherorder connections that encompass image regions.
We will compare our approach to two existing approaches that have been designed to handle partial occlusion. Hsiao and Hebert (2012) approximate all occluders by boxes and generate occlusion hypotheses by finding locations of mismatch between image gradient and object model gradient. These hypotheses are then validated by the visibility of other points of the object and by an occlusion prior which assumes all objects rest on the same planar surface. Our algorithm does not need such assumptions which reduce the segmentation accuracy. Gao et al. (2011) learn discriminative appearance models of the object and occlusion seen during training. Segmentation is achieved by defining a CRF to assign binary labels to patches based on their response to these two filters. We build on their work but add several important modifications that lead to better localization and segmentation performance. Firstly, we replace the edgebased pairwise terms with 4connected pairwise terms that are better able to propagate visibility relations. Secondly, we introduce the use of higherorder potentials defined over groups of patches, allowing us to reason at the level of image segments which contain much more information than pairs of patches. We also introduce a new loss function for structured learning that targets both localization and segmentation performance but is still decomposable over the energy terms. Lastly, we introduce a simple procedure to convert the granular patchlevel object mask produced by the algorithm to a fine pixellevel mask that can be used to make 3D pose estimation of detected objects robust to partial occlusion. Our algorithm outperforms these approaches (Hsiao and Hebert (2012), Gao et al. (2011)) at both object localization and segmentation on the CMU Kitchen Occlusion dataset as shown in Section 3.
2 Method
The training phase for SDHOP requires a set of images with different occlusions of the object(s) of interest. Each training sample is (1) oversegmented and (2) annotated with a bounding box around the full extent of the object and a binary segmentation of the area inside the box into object vs. nonobject pixels. Given these training images and labels, we train a structured Support Vector Machine (SVM) that produces the HOG templates and CRF weights. Figure
1 shows an overview of our approach.Object segmentation is done by assigning binary labels to all HOG cells within the bounding box, 1 for visible and 0 for occluded. Instead of making independent decisions for every cell, we allow neighbouring cells to influence each other. Neighbour influence can take two forms: (1) pairwise terms (Rother et al. (2004)) that impose a cost for 4connected neighbours to have different labels and (2) higherorder potentials (Kohli et al. (2009)) that impose a cost for cells to have a different label than the dominant label in their segment of the image. These segments are produced separately by an unsupervised segmentation algorithm.
2.1 Notation
The label for an object in an image x is represented as , where p is the bounding box, v
is a vector of binary variables indicating the visibility of HOG cells within
p and indexes the discrete viewpoint. indicates the position of the top left corner and the level in a scalespace pyramid. The width and height of the box are fixed per viewpoint as and HOG cells respectively. Hence v has elements. All training images are also oversegmented to collect statistics for higherorder potentials. Any unsupervised algorithm can be used for this, e.g. Felzenszwalb and Huttenlocher (2004) and Arbelaez et al. (2011).2.2 Feature Extraction
Given an image and a labelling , a sparse joint feature vector is formed by stacking vectors. Each of these vectors has features for a different discretized viewpoint. All vectors except for the one corresponding to viewpoint are zeroed out. Below, we describe the components of this vector.

31dimensional HOG features are extracted for all cells of 8x8 pixels in as described in Felzenszwalb et al. (2010). The feature vector is is constructed by stacking two groups which are formed by zeroing out different parts, similarly to Vedaldi and Zisserman (2009). The visible group has the HOG features zeroed out for cells labelled 0 and the occlusion group has them zeroed out for cells labelled 1.

Complemented visibility labels, to learn a prior for a cell to be labelled 0: .

Count of cells in bounding box p lying outside the image boundaries, to learn a cost for truncation by the image boundary, similarly to Vedaldi and Zisserman (2009).

Number of 4connected neighbouring cells in the bounding box that have different labels, to learn a pairwise cost.

Each segment in the bounding box obtained from unsupervised segmentation defines a clique of cells. To learn higherorder potentials, we need a vector that captures the distribution of 0/1 label agreement within cliques. A vector is constructed for each clique as if . The sum of all within gives . In practice, since cliques do not have the same size we employ the normalization strategy described in Gould (2011) and transform statistics of all cliques to a standard clique size ( in our experiments).

The constant , used to learn a bias term for different viewpoints.
2.3 Learning
Suppose is a vector of weights for elements of the joint feature vector. We define as the ‘energy’ of the labelling . The aim of learning is to find such that the energy of the correct label is minimum. Hence we define the label predicted by the algorithm as
(1) 
We use a labelled dataset and learn by solving the following constrained Quadratic Program (QP)
(2) 
s.t.  
Intuitively this formulation requires that the score of any ground truth labelled image must be smaller than the score of any other labelling by the distance between the two labellings minus the slack variable , where and are minimized. The regularization constant adjusts the importance of minimizing the slack variables. The above formulation has exponential constraints for each training image. For tractability, training is performed by using the cutting plane training algorithm of Joachims et al. (2009) which maintains a working set of most violated constraints (MVCs) for each image. Gould (2011) adapts this algorithm for training higherorder potentials. It uses as a second order curvature constraint on the weights for the higherorder potentials, which forces them to make a concave lower envelope. This encourages most cells in the image segments to agree in visibility labelling.
is an appropriately 0padded (to the left and right) version of
The distance between two labels and is called the loss function. It depends on the amount of overlap between the two bounding boxes and the Hamming distance between the visibility labellings
(3) 
The mean Hamming distance between two labellings and (potentially having different sizes as they might belong to different viewpoints) is calculated after projecting them to the lowest level of the pyramid. By construction of the loss function, the difference in segmentation starts contributing to the loss only after the two bounding boxes start overlapping each other. It also has the nice property of decomposing over the energy terms, as described in Section 2.4.1.
2.4 Inference
To perform the inference as described in Eq. 1 we have to search through where is the set of viewpoints, is the set of all pyramid locations and is the exponential set of all combinations of visibility variables. We enumerate over and and use an mincut to search over at every location.
By construction, the feature vector w can be decomposed into weight vectors for the different viewpoints i.e. . In the following description, we will consider one viewpoint and omit the superscript for brevity of notation. can also be decomposed as into the six components described in Section 2.2. We define the following terms that are used to construct the graph shown in Figure 2.
are the vectorized HOG features extracted at cell
in bounding box . Unary terms and are the responses at cell for object and occlusion filters respectively. is the prior for cell to be labelled 0. Constant term is the sum of image boundary truncation cost and bias. is the set of 4connected neighbouring cells in and is the pairwise weight. is the set of all cliques in and is the higherorder potential for clique having nodes with visibility labels . Combining these terms, the energy for a particular labelling is formulated as(4) 
, the higherorder potential for clique is defined as , following Gould (2011). Intuitively, it is the lower envelope of a set of lines whose slope is defined as and intercept as (recall that is a dimensional weight vector). is the size of the clique. The normalization in makes the potential invariant to the size of the clique (refer to Gould (2011) for details). Figure 2 shows a sample higherorder potential curve for a clique of cells.
Given an image, a location, and a viewpoint we use mincut on the graph construction shown in Figure 2 to find the labelling that minimizes the energy in Eq. 4. Each variable , defines a node and each clique has auxiliary nodes in the graph, . For a detailed derivation of this graph structure please see Boykov and Kolmogorov (2004) and Gould (2011).
After the maxflow algorithm finishes, the nodes still connected to are labelled and others are labelled .
2.4.1 Lossaugmented Inference
Lossaugmented inference is an important part of the cutting plane training algorithm (‘separation oracle’ in Joachims et al. (2009)) and is used to find the most violated constraints. It is defined as , where is the groundtruth labelling. Our formulation of the loss function makes it solvable with the same complexity as normal inference (Eq. 1) by decomposing the loss over the terms in Eq 4. The first term of Eq. 3 is added to , while the second term is distributed across and in Eq. 4.
2.5 Detection of Multiple Objects
Multiple objects of interest might overlap. Running the individual object detectors separately leaves regions of ambiguity in overlapping areas if multiple detectors mark the same location as visible. We find that running one iteration of expansion (see Boykov et al. (2001)) in overlapping areas resolves ambiguities coherently. The detectors are run sequentially. We maintain a label map that stores for each cell the label of the object that last marked it visible, and a collected response map that stores for each cell the object filter response () from the object that last marked it visible. While running the location search for object , we transfer object filter responses from to the occlusion filter response map () for the current object as described in Algorithm 1.
This is effectively one iteration of expansion (see supplementary material for details). It causes decisions in overlapping regions to be made between responses of welldefined object filters rather than between responses of an object filter and a generic occlusion filter.
Such responsetransfer requires the object models to be compatible with each other. We achieve this by training the object models together as if they were different viewpoint components of the same object. The bias term in the feature vector makes the filter responses of different components comparable.
2.6 3D Pose Estimation
The basic principle of many model based 3D pose estimation algorithms is to fit a given 3D model of the object to its corresponding edges in the image e.g. in Choi and Christensen (2012), the 3D CAD model is projected into the image and correspondences between the projected model edges and image edges are set up. The pose is estimated by solving an Iterative Reweighted Least Squares (IRLS) problem. However, partial occlusion causes these approaches to fail by introducing new edges. We make the algorithm robust to partial occlusion by first identifying visible pixels of the object using SDHOP and discarding correspondences outside the visibility mask. We call our extension of the algorithm Occlusion ReasoningIRLS (ORIRLS).
3 Evaluation
We implemented SDHOP in Matlab, with MVC search and inference implemented in CUDA since they are massively parallel problems. Inference on a 640x480 image with 11 scales takes 3s for a single object with a single viewpoint on our 3.4 GHz CPU and NVIDIA GT730 GPU.
3.1 Localization and Segmentation
We evaluated our approach on the CMU Kitchen Occlusion Dataset from Hsiao and Hebert (2012). This dataset was chosen because (1) it provides extensive labelled training data in the form of images with bounding boxes and object masks, and (2) the dataset is challenging and offers the opportunity to compare against an algorithm designed specifically to handle occlusion. For the localization task we generated false positives per image (FPPI) vs. recall curves, while for the segmentation task we measured the mean segmentation error against ground truth as defined by the Pascal VOC segmentation challenge in Everingham et al. (2010). (see eq. 2) was chosen by 5fold crossvalidation. While both results are presented for the single pose part of the dataset, multiple poses are easily handled in our algorithm as different components of the feature vector. Figure 3 shows FPPI vs. recall curves compared with those reported by the rLINE2d+OCLP algorithm of Hsiao and Hebert (2012) and those generated from our implementation of Gao et al. (2011). Table 2 presents segmentation errors compared with Gao et al. (2011). Hsiao and Hebert (2012) do not report a segmentation of the object.
Object  Gao et al. (2011)  SDHOP 

Bakingpan  0.2904  0.1516 
Colander  0.2095  0.1249 
Cup  0.2144  0.1430 
Pitcher  0.2499  0.1131 
Saucepan  0.1956  0.1103 
Scissors  0.2391  0.1649 
Shaker  0.2654  0.1453 
Thermos  0.2271  0.1285 
Pose parameter  IRLS  ORIRLS 

X (cm)  1.6874  0.5774 
Y (cm)  1.4953  0.6516 
Z (cm)  8.228  2.1506 
Roll (degrees)  1.1711  0.7152 
Pitch (degrees)  7.9100  2.3191 
Yaw (degrees)  5.7712  2.6055 
Figure 3 shows that while both SDHOP and Gao et al. (2011) have similar recall at 1.0 FPPI, SDHOP consistently preforms better in terms of area under the curve (AUC). Averaged over the 8 objects, SDHOP achieves 16.13% more AUC than Gao et al. (2011). Table 2 shows that SDHOP consistently outperforms Gao et al. (2011) in terms of segmentation error, achieving 42.44% less segmentation error averaged over the 8 objects. Figure 5 shows examples of the algorithm’s output on various images from the CMU Kitchen Occlusion dataset.
3.2 Ablation Study
We conducted an ablation study on the ‘pitcher’ object of the CMU Kitchen Occlusion dataset to determine the individual effect of our contributions. Using the loss function from Gao et al. (2011) caused the segmentation error to increase from 0.1131 to 0.1547 and area under curve (AUC) of FPPI vs. recall to drop from 0.7877 to 0.7071. To discern the effect of 4connected pairwise terms we removed the higher order terms from the model too. Using the pairwise terms as described in Gao et al. (2011) caused the segmentation error to increase from 0.1547 to 0.2499 and AUC to decrease from 0.7071 to 0.6414.
Lastly, to quantify the effect of higher order potentials, we compared the full SDHOP model against one with higher order potentials removed. Removing higher order potentials caused the segmentation error to increase from 0.1131 to 0.1430 and AUC to drop from 0.7877 to 0.7544. We hypothesize that for small objects like the ones in the CMU Kitchen Occlusion dataset, 4connected pairwise terms are almost as informative as higher order terms. To check this hypothesis we tested the effect of removing higher order potentials on a closeup dataset of 41 images of a pastabox occluded by various amounts through various household objects. Removing the higher order potentials caused the segmentation error to increase from 0.1308 to 0.1516 and area under curve AUC to drop from 0.9546 to 0.9008. This indicates that higher order terms are more useful for objects with larger and hence more informative segments.
3.3 3D Pose Estimation
We collected 3D pose estimation results produced by IRLS and ORIRLS on a dataset which has 17 images of a cardoor in an indoor environment. The ground truth pose for the cardoor was obtained by an ALVAR marker alv . Table 2 shows the mean errors in the six pose parameters. To discern the effect of errors inherent in the pose estimation process from the effect of occlusion reasoning, the pose of the cardoor was constant throughout the dataset, with various partial occlusions being introduced.
The granular HOG celllevel mask produced by SDHOP caused some important silhouette edges to be missed for pose estimation. To solve this problem we utilized the unsupervised segmentation done earlier for defining higher order terms. If more than 80% of the area within a segment was marked 1, we marked the whole segment with 1. Since segments follow object boundaries, this produced much cleaner masks for pose estimation. Figure 4 shows the masks and pose estimation results for an example image from the dataset, with more such examples presented in the supplementary material. Note that the segmentation errors mentioned in Table 2 use the raw masks.
4 Conclusion
We presented an algorithm (SDHOP) that localizes partially occluded objects robustly and segments their visible regions accurately. In contrast to previous approaches that model occlusion, our algorithm uses higher order potentials to reason at the level of image segments and employs a loss function that targets both localization and segmentation performance. We demonstrated that our algorithm outperforms existing approaches on both tasks, when evaluated on a challenging dataset. Finally, we have shown that the segmentation output from SDHOP can be used to improve pose estimation performance in the presence of occlusion. Avenues of future research include (1) training from weakly labelled data i.e. without segmentations, (2) a posttraining algorithm to make object models comparable without having to train them together, and (3) using the occlusion information to reason about interactions between objects in scene understanding applications.
We would like to acknowledge Ana Huamán Quispe’s help with implementing this system on a bimanual robot. The system was used to enable the robot to pick up partially visible objects lying on a table.
References
 (1) ALVAR tracking library. http://virtual.vtt.fi/virtual/proj2/multimedia/alvar/index.html. Acessed: 20150503.
 Arbelaez et al. (2011) Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5):898–916, 2011. URL http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5557884.
 Boykov and Kolmogorov (2004) Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(9):1124–1137, 2004. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1316848.
 Boykov et al. (2001) Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(11):1222–1239, 2001. URL http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=969114.
 Choi and Christensen (2012) Changhyun Choi and Henrik I Christensen. Robust 3D visual tracking using particle filtering on the special Euclidean group: A combined approach of keypoint and edge features. The International Journal of Robotics Research, 31(4):498–519, 2012. URL http://ijr.sagepub.com/content/early/2012/03/01/0278364912437213.
 Dalal and Triggs (2005) Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. URL http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1467360.
 Everingham et al. (2010) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2):303–338, June 2010. URL http://link.springer.com/article/10.1007%2Fs1126300902754.
 Felzenszwalb and Huttenlocher (2004) Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graphbased image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004. URL http://link.springer.com/article/10.1023%2FB%3AVISI.0000022288.19776.77.
 Felzenszwalb et al. (2010) Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5255236.
 Gao et al. (2011) Tianshi Gao, Benjamin Packer, and Daphne Koller. A segmentationaware object detection model with occlusion handling. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1361–1368. IEEE, 2011. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5995623.
 Girshick et al. (2011) Ross B Girshick, Pedro F Felzenszwalb, and David A Mcallester. Object detection with grammar models. In Advances in Neural Information Processing Systems, pages 442–450, 2011. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.231.2429.

Gould (2011)
Stephen Gould.
Maxmargin
learning for lower linear envelope potentials in binary markov random
fields.
In
Proceedings of the 28th International Conference on Machine Learning (ICML11)
, pages 193–200, 2011. URL http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6945904.  Hinterstoisser et al. (2012) Stefan Hinterstoisser, Cedric Cagniart, Slobodan Ilic, Peter Sturm, Nassir Navab, Pascal Fua, and Vincent Lepetit. Gradient response maps for realtime detection of textureless objects. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(5):876–888, 2012. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6042881.
 Hsiao and Hebert (2012) Edward Hsiao and Martial Hebert. Occlusion reasoning for object detection under arbitrary viewpoint. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3146–3153. IEEE, 2012. URL http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6248048.
 Joachims et al. (2009) Thorsten Joachims, Thomas Finley, and ChunNam John Yu. Cuttingplane training of structural SVMs. Machine Learning, 77(1):27–59, 2009. URL http://link.springer.com/article/10.1007%2Fs1099400951088.
 Kohli et al. (2009) Pushmeet Kohli, Philip HS Torr, et al. Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3):302–324, 2009. URL http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4587417.
 Rother et al. (2004) Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM Transactions on Graphics (TOG), volume 23, pages 309–314. ACM, 2004. URL http://dl.acm.org/citation.cfm?id=1015720.
 Vedaldi and Zisserman (2009) Andrea Vedaldi and Andrew Zisserman. Structured output regression for detection with partial truncation. In Advances in neural information processing systems, pages 1928–1936, 2009.
 Wang et al. (2009) Xiaoyu Wang, Tony X Han, and Shuicheng Yan. An HOGLBP human detector with partial occlusion handling. In Computer Vision, 2009 IEEE 12th International Conference on, pages 32–39. IEEE, 2009. URL http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5459207.
 Xiang and Savarese (2013) Yu Xiang and Silvio Savarese. Object Detection by 3D Aspectlets and Occlusion Reasoning. In Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, pages 530–537. IEEE, 2013. URL http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6755942.
 Zhu et al. (2014) Menglong Zhu, Konstantinos G Derpanis, Yinfei Yang, Samarth Brahmbhatt, Mabel Zhang, Cody Phillips, Matthieu Lecce, and Kostas Daniilidis. Single image 3D object detection and pose estimation for grasping. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 3936–3943. IEEE, 2014. URL http://www.cis.upenn.edu/~menglong/papers/icra2014_object_grasping.pdf.