Log In Sign Up

Occlusion-Aware Object Localization, Segmentation and Pose Estimation

by   Samarth Brahmbhatt, et al.

We present a learning approach for localization and segmentation of objects in an image in a manner that is robust to partial occlusion. Our algorithm produces a bounding box around the full extent of the object and labels pixels in the interior that belong to the object. Like existing segmentation aware detection approaches, we learn an appearance model of the object and consider regions that do not fit this model as potential occlusions. However, in addition to the established use of pairwise potentials for encouraging local consistency, we use higher order potentials which capture information at the level of im- age segments. We also propose an efficient loss function that targets both localization and segmentation performance. Our algorithm achieves 13.52 recall curve on average over the challenging CMU Kitchen Occlusion Dataset. This is a 42.44 localization performance compared to the state-of-the-art. Finally, we show that the visibility labelling produced by our algorithm can make full 3D pose estimation from a single image robust to occlusion.


page 3

page 9

page 10


Car Segmentation and Pose Estimation using 3D Object Models

Image segmentation and 3D pose estimation are two key cogs in any algori...

PrimA6D: Rotational Primitive Reconstruction for Enhanced and Robust 6D Pose Estimation

In this paper, we introduce a rotational primitive prediction based 6D o...

End-to-End Differentiable 6DoF Object Pose Estimation with Local and Global Constraints

Inferring the 6DoF pose of an object from a single RGB image is an impor...

CASAPose: Class-Adaptive and Semantic-Aware Multi-Object Pose Estimation

Applications in the field of augmented reality or robotics often require...

Category-Level Pose Retrieval with Contrastive Features Learnt with Occlusion Augmentation

Pose estimation is usually tackled as either a bin classification proble...

TopoTag: A Robust and Scalable Topological Fiducial Marker System

Fiducial markers have been playing an important role in augmented realit...

1 Introduction and Related Work

In this paper we address the problem of localizing and segmenting partially occluded objects. We do this by generating a bounding box around the full extent of the objects, while also segmenting the visible parts inside the box. This is different from semantic segmentation, which typically does not provide information about the spatial position of labelled pixels inside the object. While a lot of progress has been made in object detection [Felzenszwalb et al. (2010), Hinterstoisser et al. (2012), Zhu et al. (2014)], occlusion by other objects still remains a challenge. A common theme is to model occlusion geometrically or appearance-wise, thereby allowing it to contribute to the detection process. Wang et al. (2009) use a holistic Histogram of Oriented Gradients (HOG) template Dalal and Triggs (2005) to scan through the image and use specially trained part templates for instances where some cells of the holistic template respond poorly. Girshick et al. (2011) force the Deformable Part Model detector to place a trained ‘occluder’ part in regions where the original parts respond weakly. The object masks produced by both of these algorithms are only accurate up to the parts and hence not usable for many applications e.g. edge-based 3D pose estimation. Xiang and Savarese (2013) approximate object structure in 3D using planar parts. A Conditional Random Field (CRF) is then used to reason about visibility of the parts when the 3D planes are projected to the image. However, such methods work well only for large objects that can be approximated with planar parts.

Our approach is entitled Segmentation and Detection using Higher-Order Potentials (SD-HOP). It is based on discriminatively learned HOG templates for objects and occlusion. Whereas the object templates model the objects of interest, the occlusion templates provide discriminative support and do not model a specific occluder. Segmentation is done by considering not only the response of patches to these templates, but also the segmentation of neighbouring patches through a CRF with higher-order connections that encompass image regions.

We will compare our approach to two existing approaches that have been designed to handle partial occlusion. Hsiao and Hebert (2012) approximate all occluders by boxes and generate occlusion hypotheses by finding locations of mismatch between image gradient and object model gradient. These hypotheses are then validated by the visibility of other points of the object and by an occlusion prior which assumes all objects rest on the same planar surface. Our algorithm does not need such assumptions which reduce the segmentation accuracy. Gao et al. (2011) learn discriminative appearance models of the object and occlusion seen during training. Segmentation is achieved by defining a CRF to assign binary labels to patches based on their response to these two filters. We build on their work but add several important modifications that lead to better localization and segmentation performance. Firstly, we replace the edge-based pairwise terms with 4-connected pairwise terms that are better able to propagate visibility relations. Secondly, we introduce the use of higher-order potentials defined over groups of patches, allowing us to reason at the level of image segments which contain much more information than pairs of patches. We also introduce a new loss function for structured learning that targets both localization and segmentation performance but is still decomposable over the energy terms. Lastly, we introduce a simple procedure to convert the granular patch-level object mask produced by the algorithm to a fine pixel-level mask that can be used to make 3D pose estimation of detected objects robust to partial occlusion. Our algorithm outperforms these approaches (Hsiao and Hebert (2012)Gao et al. (2011)) at both object localization and segmentation on the CMU Kitchen Occlusion dataset as shown in Section 3.

The rest of the paper is organized as follows. Section 2 describes our proposed approach. We present evaluations on standard datasets and our own laboratory dataset in Section 3 and summarize in Section 4.

2 Method

Figure 1: Overview of our approach. Top: During training, images are segmented and features are extracted from pyramids of segmentations and HOG features. An SVM model is learned by max-margin learning. Bottom: After training, the model can be used to infer a bounding box and visible segments of the object.

The training phase for SD-HOP requires a set of images with different occlusions of the object(s) of interest. Each training sample is (1) over-segmented and (2) annotated with a bounding box around the full extent of the object and a binary segmentation of the area inside the box into object vs. non-object pixels. Given these training images and labels, we train a structured Support Vector Machine (SVM) that produces the HOG templates and CRF weights. Figure 

1 shows an overview of our approach.

Object segmentation is done by assigning binary labels to all HOG cells within the bounding box, 1 for visible and 0 for occluded. Instead of making independent decisions for every cell, we allow neighbouring cells to influence each other. Neighbour influence can take two forms: (1) pairwise terms (Rother et al. (2004)) that impose a cost for 4-connected neighbours to have different labels and (2) higher-order potentials (Kohli et al. (2009)) that impose a cost for cells to have a different label than the dominant label in their segment of the image. These segments are produced separately by an unsupervised segmentation algorithm.

2.1 Notation

The label for an object in an image x is represented as , where p is the bounding box, v

is a vector of binary variables indicating the visibility of HOG cells within

p and indexes the discrete viewpoint. indicates the position of the top left corner and the level in a scale-space pyramid. The width and height of the box are fixed per viewpoint as and HOG cells respectively. Hence v has elements. All training images are also over-segmented to collect statistics for higher-order potentials. Any unsupervised algorithm can be used for this, e.g. Felzenszwalb and Huttenlocher (2004) and Arbelaez et al. (2011).

2.2 Feature Extraction

Given an image and a labelling , a sparse joint feature vector is formed by stacking vectors. Each of these vectors has features for a different discretized viewpoint. All vectors except for the one corresponding to viewpoint are zeroed out. Below, we describe the components of this vector.

  1. 31-dimensional HOG features are extracted for all cells of 8x8 pixels in as described in Felzenszwalb et al. (2010). The feature vector is is constructed by stacking two groups which are formed by zeroing out different parts, similarly to Vedaldi and Zisserman (2009). The visible group has the HOG features zeroed out for cells labelled 0 and the occlusion group has them zeroed out for cells labelled 1.

  2. Complemented visibility labels, to learn a prior for a cell to be labelled 0: .

  3. Count of cells in bounding box p lying outside the image boundaries, to learn a cost for truncation by the image boundary, similarly to Vedaldi and Zisserman (2009).

  4. Number of 4-connected neighbouring cells in the bounding box that have different labels, to learn a pairwise cost.

  5. Each segment in the bounding box obtained from unsupervised segmentation defines a clique of cells. To learn higher-order potentials, we need a vector that captures the distribution of 0/1 label agreement within cliques. A vector is constructed for each clique as if . The sum of all within gives . In practice, since cliques do not have the same size we employ the normalization strategy described in Gould (2011) and transform statistics of all cliques to a standard clique size ( in our experiments).

  6. The constant , used to learn a bias term for different viewpoints.

2.3 Learning

Suppose is a vector of weights for elements of the joint feature vector. We define as the ‘energy’ of the labelling . The aim of learning is to find such that the energy of the correct label is minimum. Hence we define the label predicted by the algorithm as


We use a labelled dataset and learn by solving the following constrained Quadratic Program (QP)


Intuitively this formulation requires that the score of any ground truth labelled image must be smaller than the score of any other labelling by the distance between the two labellings minus the slack variable , where and are minimized. The regularization constant adjusts the importance of minimizing the slack variables. The above formulation has exponential constraints for each training image. For tractability, training is performed by using the cutting plane training algorithm of Joachims et al. (2009) which maintains a working set of most violated constraints (MVCs) for each image. Gould (2011) adapts this algorithm for training higher-order potentials. It uses as a second order curvature constraint on the weights for the higher-order potentials, which forces them to make a concave lower envelope. This encourages most cells in the image segments to agree in visibility labelling.

is an appropriately 0-padded (to the left and right) version of

The distance between two labels and is called the loss function. It depends on the amount of overlap between the two bounding boxes and the Hamming distance between the visibility labellings


The mean Hamming distance between two labellings and (potentially having different sizes as they might belong to different viewpoints) is calculated after projecting them to the lowest level of the pyramid. By construction of the loss function, the difference in segmentation starts contributing to the loss only after the two bounding boxes start overlapping each other. It also has the nice property of decomposing over the energy terms, as described in Section 2.4.1.

2.4 Inference

To perform the inference as described in Eq. 1 we have to search through where is the set of viewpoints, is the set of all pyramid locations and is the exponential set of all combinations of visibility variables. We enumerate over and and use an mincut to search over at every location.

By construction, the feature vector w can be decomposed into weight vectors for the different viewpoints i.e. . In the following description, we will consider one viewpoint and omit the superscript for brevity of notation. can also be decomposed as into the six components described in Section 2.2. We define the following terms that are used to construct the graph shown in Figure 2.

are the vectorized HOG features extracted at cell

in bounding box . Unary terms and are the responses at cell for object and occlusion filters respectively. is the prior for cell to be labelled 0. Constant term is the sum of image boundary truncation cost and bias. is the set of 4-connected neighbouring cells in and is the pairwise weight. is the set of all cliques in and is the higher-order potential for clique having nodes with visibility labels . Combining these terms, the energy for a particular labelling is formulated as


, the higher-order potential for clique is defined as , following Gould (2011). Intuitively, it is the lower envelope of a set of lines whose slope is defined as and intercept as (recall that is a dimensional weight vector). is the size of the clique. The normalization in makes the potential invariant to the size of the clique (refer to Gould (2011) for details). Figure 2 shows a sample higher-order potential curve for a clique of cells.

Given an image, a location, and a viewpoint we use mincut on the graph construction shown in Figure 2 to find the labelling that minimizes the energy in Eq. 4. Each variable , defines a node and each clique has auxiliary nodes in the graph, . For a detailed derivation of this graph structure please see Boykov and Kolmogorov (2004) and Gould (2011).

Figure 2: (a): Concave higher-order potentials encouraging cells in a clique to have the same binary label, (b): Construction of graph to compute the energy minimizing binary labelling of cells by mincut.

After the maxflow algorithm finishes, the nodes still connected to are labelled and others are labelled .

2.4.1 Loss-augmented Inference

Loss-augmented inference is an important part of the cutting plane training algorithm (‘separation oracle’ in Joachims et al. (2009)) and is used to find the most violated constraints. It is defined as , where is the ground-truth labelling. Our formulation of the loss function makes it solvable with the same complexity as normal inference (Eq. 1) by decomposing the loss over the terms in Eq 4. The first term of Eq. 3 is added to , while the second term is distributed across and in Eq. 4.

2.5 Detection of Multiple Objects

Multiple objects of interest might overlap. Running the individual object detectors separately leaves regions of ambiguity in overlapping areas if multiple detectors mark the same location as visible. We find that running one iteration of -expansion (see Boykov et al. (2001)) in overlapping areas resolves ambiguities coherently. The detectors are run sequentially. We maintain a label map that stores for each cell the label of the object that last marked it visible, and a collected response map that stores for each cell the object filter response () from the object that last marked it visible. While running the location search for object , we transfer object filter responses from to the occlusion filter response map () for the current object as described in Algorithm 1.

  for all  do { is the number of objects, denotes the Hadamard product}
     for all  do
         {Transfer equation for all cells in }
     end for
      {Update equations for all cells in }
  end for
Algorithm 1 Response-transfer between object detectors in overlapping regions

This is effectively one iteration of -expansion (see supplementary material for details). It causes decisions in overlapping regions to be made between responses of well-defined object filters rather than between responses of an object filter and a generic occlusion filter.

Such response-transfer requires the object models to be compatible with each other. We achieve this by training the object models together as if they were different viewpoint components of the same object. The bias term in the feature vector makes the filter responses of different components comparable.

2.6 3D Pose Estimation

The basic principle of many model based 3D pose estimation algorithms is to fit a given 3D model of the object to its corresponding edges in the image e.g. in Choi and Christensen (2012), the 3D CAD model is projected into the image and correspondences between the projected model edges and image edges are set up. The pose is estimated by solving an Iterative Re-weighted Least Squares (IRLS) problem. However, partial occlusion causes these approaches to fail by introducing new edges. We make the algorithm robust to partial occlusion by first identifying visible pixels of the object using SD-HOP and discarding correspondences outside the visibility mask. We call our extension of the algorithm Occlusion Reasoning-IRLS (OR-IRLS).

3 Evaluation

We implemented SD-HOP in Matlab, with MVC search and inference implemented in CUDA since they are massively parallel problems. Inference on a 640x480 image with 11 scales takes 3s for a single object with a single viewpoint on our 3.4 GHz CPU and NVIDIA GT-730 GPU.

3.1 Localization and Segmentation

We evaluated our approach on the CMU Kitchen Occlusion Dataset from Hsiao and Hebert (2012). This dataset was chosen because (1) it provides extensive labelled training data in the form of images with bounding boxes and object masks, and (2) the dataset is challenging and offers the opportunity to compare against an algorithm designed specifically to handle occlusion. For the localization task we generated false positives per image (FPPI) vs. recall curves, while for the segmentation task we measured the mean segmentation error against ground truth as defined by the Pascal VOC segmentation challenge in  Everingham et al. (2010). (see eq. 2) was chosen by 5-fold cross-validation. While both results are presented for the single pose part of the dataset, multiple poses are easily handled in our algorithm as different components of the feature vector. Figure 3 shows FPPI vs. recall curves compared with those reported by the rLINE2d+OCLP algorithm of Hsiao and Hebert (2012) and those generated from our implementation of  Gao et al. (2011). Table 2 presents segmentation errors compared with Gao et al. (2011)Hsiao and Hebert (2012) do not report a segmentation of the object.

Figure 3: Object localization results on the CMU Kitchen Occlusion dataset
Object Gao et al. (2011) SD-HOP
Bakingpan 0.2904 0.1516
Colander 0.2095 0.1249
Cup 0.2144 0.1430
Pitcher 0.2499 0.1131
Saucepan 0.1956 0.1103
Scissors 0.2391 0.1649
Shaker 0.2654 0.1453
Thermos 0.2271 0.1285
Table 2: Mean 3D pose estimation error
Pose parameter IRLS OR-IRLS
X (cm) 1.6874 0.5774
Y (cm) 1.4953 0.6516
Z (cm) 8.228 2.1506
Roll (degrees) 1.1711 0.7152
Pitch (degrees) 7.9100 2.3191
Yaw (degrees) 5.7712 2.6055
Table 1: Mean object segmentation error

Figure 3 shows that while both SD-HOP and Gao et al. (2011) have similar recall at 1.0 FPPI, SD-HOP consistently preforms better in terms of area under the curve (AUC). Averaged over the 8 objects, SD-HOP achieves 16.13% more AUC than Gao et al. (2011). Table 2 shows that SD-HOP consistently outperforms Gao et al. (2011) in terms of segmentation error, achieving 42.44% less segmentation error averaged over the 8 objects. Figure 5 shows examples of the algorithm’s output on various images from the CMU Kitchen Occlusion dataset.

3.2 Ablation Study

We conducted an ablation study on the ‘pitcher’ object of the CMU Kitchen Occlusion dataset to determine the individual effect of our contributions. Using the loss function from Gao et al. (2011) caused the segmentation error to increase from 0.1131 to 0.1547 and area under curve (AUC) of FPPI vs. recall to drop from 0.7877 to 0.7071. To discern the effect of 4-connected pairwise terms we removed the higher order terms from the model too. Using the pairwise terms as described in Gao et al. (2011) caused the segmentation error to increase from 0.1547 to 0.2499 and AUC to decrease from 0.7071 to 0.6414.

Lastly, to quantify the effect of higher order potentials, we compared the full SD-HOP model against one with higher order potentials removed. Removing higher order potentials caused the segmentation error to increase from 0.1131 to 0.1430 and AUC to drop from 0.7877 to 0.7544. We hypothesize that for small objects like the ones in the CMU Kitchen Occlusion dataset, 4-connected pairwise terms are almost as informative as higher order terms. To check this hypothesis we tested the effect of removing higher order potentials on a close-up dataset of 41 images of a pasta-box occluded by various amounts through various household objects. Removing the higher order potentials caused the segmentation error to increase from 0.1308 to 0.1516 and area under curve AUC to drop from 0.9546 to 0.9008. This indicates that higher order terms are more useful for objects with larger and hence more informative segments.

3.3 3D Pose Estimation

We collected 3D pose estimation results produced by IRLS and OR-IRLS on a dataset which has 17 images of a car-door in an indoor environment. The ground truth pose for the cardoor was obtained by an ALVAR marker alv . Table 2 shows the mean errors in the six pose parameters. To discern the effect of errors inherent in the pose estimation process from the effect of occlusion reasoning, the pose of the cardoor was constant throughout the dataset, with various partial occlusions being introduced.

The granular HOG cell-level mask produced by SD-HOP caused some important silhouette edges to be missed for pose estimation. To solve this problem we utilized the unsupervised segmentation done earlier for defining higher order terms. If more than 80% of the area within a segment was marked 1, we marked the whole segment with 1. Since segments follow object boundaries, this produced much cleaner masks for pose estimation. Figure 4 shows the masks and pose estimation results for an example image from the dataset, with more such examples presented in the supplementary material. Note that the segmentation errors mentioned in Table 2 use the raw masks.

Figure 4: 3D pose estimation. Left to right: Pose estimation with IRLS, SD-HOP raw segmentation mask, SD-HOP refined segmentation mask, Pose estimation with OR-IRLS. Best viewed in colour.
Figure 5: Object localization and segmentation results on the CMU Kitchen Occlusion dataset. Left: Image, Center: Raw mask from SD-HOP, Right: Refined mask from SD-HOP

4 Conclusion

We presented an algorithm (SD-HOP) that localizes partially occluded objects robustly and segments their visible regions accurately. In contrast to previous approaches that model occlusion, our algorithm uses higher order potentials to reason at the level of image segments and employs a loss function that targets both localization and segmentation performance. We demonstrated that our algorithm outperforms existing approaches on both tasks, when evaluated on a challenging dataset. Finally, we have shown that the segmentation output from SD-HOP can be used to improve pose estimation performance in the presence of occlusion. Avenues of future research include (1) training from weakly labelled data i.e. without segmentations, (2) a post-training algorithm to make object models comparable without having to train them together, and (3) using the occlusion information to reason about interactions between objects in scene understanding applications.

We would like to acknowledge Ana Huamán Quispe’s help with implementing this system on a bimanual robot. The system was used to enable the robot to pick up partially visible objects lying on a table.