Segmenting an image into multiple regions has for long been considered a plausible precursor of many high level visual recognition routines. Indeed, if plausible image regions could be extracted so they would at least partly overlap the projections of visible surfaces in the scene, it would be conceivable that such interpretations can be later lifted to high-level scene percepts by invoking part-based object models and scene consistency rules. This has motivated research into (hierarchical) multipart image segmentations, for which many excellent methods are available [1, 2, 3, 4]
. But finding good multipart image segmentations in one step has proven difficult, partly due to the inherently local nature of the grouping process. The competition constraints implicit in various methods make it difficult to integrate scene constraints and mid-level grouping into early computations, and can influence results in ways that do not always correlate with scene properties. Learning segmentation models has also been problematic, partly because of insufficient support for reliable feature extraction and because inference, the inner core of learning, is usually very expensive.
The alternative computational framework we pursue assembles multipart image interpretations by tiling multiple figure-ground image segment hypotheses using mid-level scene constraints. The problem of hypothesis selection and consistent (full) image segmentation is formulated as optimization over sets of maximal cliques, sampled from a graph that connects non-overlapping image segments. By designing and learning clique potentials that encode both intrinsic, unary Gestalt segment properties and pairwise spatial compatibilities that account for plausible configurations of neighboring, spatially non-overlapping segments, we are able to eliminate many implausible image segments and tilings that cannot possibly arise from the projection of surfaces in typical, structured 3d scenes. We show that such a strategy achieves the state of the art in benchmarks like Berkeley and VOC2009.
1.1 Related work
. They are usually computed multiple times, to increase the probability that some of the retrieved segments capture full objects, or their significant parts in images. Another methodology to obtain multiple segmentations is to aggregate in a hierarchy, two well-known examples being multigrid methods and the Ultrametric Contour Maps . The latter achieved state-of-the-art results in a number of challenging segmentation datasets. These algorithms partition the image into a number of regions by using pairwise pixel dependencies. Direct learning is usually targeted at finding the parameters of local affinities [4, 6]. Other techniques work at coarser scales by optimizing over superpixels. This allow features to be computed over a larger spatial support. Ren and Malik  learn a classification model to combine superpixels based on their Gestalt properties. Hoiem et al  proposed a model that reasons jointly over scene geometry and occlusion boundaries, progressively merging superpixels so as to maximize the likelihood of a qualitative 3d scene interpretation. Instead our goal is complementary: a set of consistent full image segmentation hypotheses, computed based on mid-level Gestalt cues and implicit 3d constraints.
While multi-part image segmentation algorithms are most commonly used, a number of figure-ground methods have been recently pursued. Bagon et al  proposed an algorithm that generates figure-ground segmentations by maximizing a self-similarity criterion around a user selected image point. Malisiewicz and Efros  showed that good object-level segments could be obtained by merging pairs and triplets of segments from multi-part segmentations, but at the expense of generating also a large quantity of implausible ones. Carreira and Sminchisescu  generate a compact set of segments using parametric minimum cuts and learn to score them using region and Gestalt-based features. These algorithms were shown to be quite successful in extracting full object segments, suggesting that a promising research direction is to develop methods that combine multiple figure-ground segmentations (or just segments obtained at multiple scales, potentially from different methods), into plausible full image segmentations. Still missing is a formal multiple hypothesis computational framework for consistent selection (tiling) and learning, which we pursue here. Providing a compact set of multiple hypotheses rather than a single answer is desirable for learning, for high-level, informed processing and for graceful performance degradation.
In sec. 2 we present our maximal clique formulation framework including both the search procedure and the parameterization of the clique potentials. Sec. 3 describes our ranking-based learning framework that alternates between sampling new tilings (a discrete optimization method) and optimizing the parameters of our clique potentials (a continuous problem) against the test error measure, here the full image segmentation quality. Sec 4 discusses our segment, mid-level unary and pair-wise terms based on Gestalt measures and the statistics of projected boundaries of 3d surfaces, including T-junctions and extremal edges. We show inference and learning statistics as well as experiments on the Berkeley and Pascal VOC 2009 segmentation datasets in sec. 5. We conclude with ideas for future work in sec. 6.
2 Image tiling as sampling maximal cliques
Given a set of segments our aim is to generate several tilings such that no two segments on overlap and has a high score . Consider for that a graph , called the consistency graph, where the vertices are the segments in . Two vertices are connected by an edge if the corresponding segments do not overlap.111While disallowing overlap increases the exposure to imperfect boundary alignments between the available segments, it leads to a dramatic reduction in the solution space and doesn’t require additional processing to assign pixels lying on the intersection of overlapping segments. A clique of , which is a fully connected subgraph of , corresponds to a set of segments that can form a tiling. A clique is called maximal222Also called inclusion maximal clique. if it is not included into any other clique and hence a larger clique cannot be obtained by adding vertices to it. In our case a maximal clique corresponds to a tiling that cannot be extended using any other segment in . A maximum clique of a graph is a clique with the largest number of vertices. A maximum weighted clique is a clique that maximizes the sum of weights associated to its vertices.
We formulate the search for tilings as finding maximal cliques with high potential
are feature vectors extracted for, respectively, segmentand image neighbors (denoted ). are the corresponding weights learned as mentioned in Section 3.
The problem of finding the maximum (weighted) clique of a general graph is known to be both NP-complete and hard to approximate to a given bound . Existing algorithms produce one single solution which equals or approximates the maximum clique. In the weighted case maximization is done only over unary terms associated to vertices. This is different from our case. We desire multiple tilings for each image and the potential of a clique (tiling) depends on both unary and pairwise terms. Enumerating all cliques to find the optimum is not feasible as we deal with many vertices (over 150) and the complexity to enumerate all cliques of size of a graph with vertices is . Finding a maximal
clique can be done in linear time in the number of vertices, by starting with one vertex and adding each of the other vertices in some order. But graphs that have a large maximum clique can have maximal cliques of arbitrary small size. To obtain multiple estimates we follow a two step greedy approach:(i) starting with each vertex generate a maximal clique; (ii) refine each solution using a local search in the space of maximal cliques based on the trained cost function. We generate up to different tilings, ranked in decreasing order of . Notice that our approach is based on established strategies to find approximations of the maximum clique (step 1 is known as a
sequential greedy heuristicand step 2 as a local search heuristic ). Algorithm 1 describes the proposed method.
The size of the largest clique that can be formed with a certain vertex is bounded by the degree of this vertex, in our case . If a set is kept, containing the segments in which are not overlapping any segment in , the complexity of step 1 is . Maximum steps are needed to build from the list of sorted segments and is an upper bound for the loop in step 1 and the verification inside.
Step 2 can be executed in where is the maximum number of iterations allowed333In experiments we use .. The inner loop over all is bounded by as must not overlap . Rejecting segments in overlapping with is also bounded by as all segments previously in are not overlapping . Finally, extending to a maximal clique has the same complexity as step 1, namely .
Ordering the segments is done only once, thus the complexity for running FG-Tiling for all segments is where the dominant worst case component is if is fixed. In practice our matlab implementation using takes on average 20 seconds per image for the BSDS test set.
3 Learning mid-level vision
Assume we are given a set of features computed, respectively, for segments and pairs of segments which are neighbors in the image, i.e. share a common boundary and do not overlap. We search for the weights such that the ranking of tilings induced by (eq. 1) is as close as possible to the ranking induced by the quality of the tilings with respect to the ground truth.
The learning process alternates between the discrete optimization of tilings, where it runs FG-Tiling with the existing parameters to create a new pool of tilings for each of the images in the training set, and a continuous parameter optimization step that finds parameters which maximize an objective function on the produced tilings, as used for testing: the overlap with ground truth (Algorithm 2). Instead of aiming to enforce only the best tiling in the first position, which might be impossible, we design a scoring (with best as special case) that aims at ranking tilings in decreasing order of their quality. For an image , weights , and a pool of tilings where is the tiling at rank when sorting in decreasing order of the value of , the objective function is:
where is the quality of measured using the ground truth, is the weighting of rank , and is the rank parameter which determines the constraint we want to enforce (e.g. for only the best ranked, for a full K-ordering). We define as the average covering of with all ground truth segmentations as in . The covering is the sum of overlaps between each individual segment in a ground truth segmentation and the closest segment in a tiling, multiplied by the area of the ground truth segment. is the standard overlap measure between and . For rank weighting, we use: . This decay is similar to the Discounted Cumulative Gain (DCG)  that uses a logarithmic reduction factor of the form . DCG penalizes more aggressively the error in the first ranks. We found this to work slightly less well in our tests.444For an image segment graph with nodes, clique potentials can be used to define a constrained probability distribution over partitions. We can write a Gibbs distribution over cliques as (This approach will be presented in an upcoming technical report.). Here we choose a different loss that directly optimizes the overlap measure used during test time. Notice however, our very different use of cliques compared to product expansions in graphical models. Along this path, modeling the nodes as binary variables in a random field would neither produce the semantics we need, nor would necessarily lead to clique consistent inference.
nodes, clique potentials can be used to define a constrained probability distribution over partitions. We can write a Gibbs distribution over cliques as, and can learn using ML, with partition functions approximated by summing only over cliques, computed by FG-Tiling
(This approach will be presented in an upcoming technical report.). Here we choose a different loss that directly optimizes the overlap measure used during test time. Notice however, our very different use of cliques compared to product expansions in graphical models. Along this path, modeling the nodes as binary variables in a random field would neither produce the semantics we need, nor would necessarily lead to clique consistent inference.
4 Mid-level image descriptors
Our model aims to generate full image tilings that have properties similar to the ones of ground truth segmentations produced by human annotators. We use both unary features inspired by Gestalt properties and pairwise features sensitive to the boundary statistics arising from projections of 3d surfaces, for a total of unary and
pairwise features. These features are computed once and do not change during learning and inference. All features are individually normalized to zero mean and standard deviation.
Unary Descriptors: As unary features, we primarily use the ones proposed in , that include the amount of contrast along the boundary of the segment (8 features), , region properties such as position in the image, area and orientation, (18 features), as well as Gestalt properties such as convexity and dissimilarity between the segment interior and the rest of the image in terms of intensity and texture, (8 features).
We complemented the unary features in  with a novel set of responses quantifying center-surround dissimilarity, . We define three image strips of width , and pixels around each segment. We compute how dissimilar each strip and the segment are according to different local features: hue, rgb, SIFT and textons. For each type of local feature and each strip, dissimilarity is determined as the chi-square distance between the histogram of quantized local features in the strip and in the segment, resulting in the features. The local features are sampled on a regular grid, every pixels. The color histograms use patches and pixels wide, while the SIFT patches are and pixels wide. The textons are the ones used in globalPb  quantized into bins. We quantize the other features into bins, with the codebook being obtained in each image
at test time by k-means.
Pairwise Descriptors: We define a segment neighborhood between pairs of segments sharing a boundary and not overlapping. The occurrence of such pairs is usually non-accidental, particularly in our pool of figure-ground segmentations, because we don’t consider the ground. Segments that are artifacts of the particular parameter and location constraint that generated them will tend to have few neighbors. Computing this type of neighborhoods can be done robustly by growing all segments by a small amount (4 pixels in our implementation) and then detecting the pairs that overlap. The pairwise features capture the configuration of pairs of segments. We use two sets of pairwise features. The first encodes pairwise region properties such as relative area, position and orientation and is simply defined by (18 features).
We also employ features which signal occlusion. In ground truth segmentations, neighboring segments often correspond to projections of objects at different depths, which result in distinctive image statistics. These are sufficiently informative even for determining which of the two neighboring regions corresponds to the occluding surface in 3D space, the so called figure-ground assignment problem  . The occluding segment usually has a higher convexity coefficient and is often surrounded by the occluded segment. Let be the unary convexity feature in . Then the relative convexity feature is implemented as . Let the length of the adjacent boundary between two segments be , and the segment perimeters be and . Then surroundedness is defined as . Another important occlusion features are t-junctions, boundary patterns shaped as a T, usually caused by the intersection of the boundaries of two objects in an occlusion relationship. Typically the location of the leg of the T indicates which segment is occluding the other. T-junctions were used in recent approaches to figure-ground assignment, as an energy term for triplets of regions in CRFs   . Here we model them directly as a pairwise segment compatibility feature, by measuring the consistency with which the leg of the t-junctions belongs to the same segment, weighted by the quality of the fitting of the junction to a T, as opposed to being Y-like. The feature is defined as , with the sums being over all junctions between the pair of segments. The weighting is , being the angle formed by the leg of the junction with the base. When the leg of the junction is on the boundary separating both segments, or the leg is not on the boundary of segment then is set to . Junctions are hard to detect when considering pixel intensities locally, even for humans . But given a pair of neighboring segments this can be done robustly, as illustrated in fig. 1.
The shading along region borders was shown to provide information about occlusion in both computational  and psychophysical tests, under the name of extremal edges . The phenomenon is explained by the illumination gradient tending to be orthogonal to the boundary, on the occluding side. We implement the gradient orthogonality feature as in  and produce the compatibility feature as . The absolute value is computed because we’re not interested here in determining which segment is in front, just in having an occlusion indicator.
Our inference and learning methods were tested on the Berkeley Dataset (BSDS)  and on the Pascal VOC 2009 Segmentation Dataset (VOC2009) . For comparison we show results of the Oriented Watershed Transform Ultrametric Contour Maps using globalPb as contour detector (gPb-owt-ucm) .
We generate a pool of segments using the publicly available implementation of Constrained Parametric Min-Cuts (CPMC) , which produces nested sets of segments around rectangular seeds on a regular grid with predicted qualities for each segment. Per image an average of 194 segments is generated for the BSDS test set and 156 segments for the VOC2009 validation set. This algorithm was recently shown to produce compact sets of segments that accurately cover ground truth objects.
Fig. 2 shows the evaluation of FG-Tiling and two baselines, Enum-1min and Constrained-random, on the BSDS dataset (see sec. 5). All methods produce maximal cliques i.e. tilings with segments that do not overlap and the cliques cannot be extended using the current pool of segments. For each method the produced tilings are ranked using the scoring function in eq. 1.
Enum-1min is an algorithm that recursively, exhaustively, enumerates maximal cliques until the given time of 1 minute per image is reached and returns the highest scoring cliques that have been found555The time of 1 minute given to Enum-1min is equal to 3 the average time of FG-Tiling on the BSDS test set. Without the time constraint the algorithm did not finish enumerating cliques after 48 hours on a test image where a pool of figure-ground segmentations had been used.. Similar to line 1 of FG-Tiling, Enum-1min first sorts the segments based on . During enumeration, it quickly finds one tiling similar to the result of step 1 in FG-Tiling. However, within 1 minute, it produces only small variations of the same tiling, as seen also in fig. 2, right. Constrained-random is similar to step 1 in Algorithm 1 with the difference that in line 1 the order of the segments is randomized. The method gets a few “lucky shots“ which explains the quite high values in the plot in fig. 2 left, but overall the average quality of the produced tilings is much lower than the other two methods (23% less than FG-Tiling on the test set of BSDS). FG-Tiling balances the diversity and quality of the produced tilings to give the best results of all methods.
During learning, for the initial run of FG-Tiling we set the weights corresponding to the pairwise terms to zero. The weights
corresponding to the unary terms are set using linear regression s.t.approximates the response where is the set of ground truth segments for the image. Parameter optimization is done using a Quasi-Newton method. During this step, the sum of over all images in the training set and their corresponding pools of tilings is maximized. The first time this step is executed, the initial weight estimates required to initialize the search are obtained using linear regression over all tilings produced for the training set. Regression uses targets for each tiling . The inner loop (line 6) needs on average 15 iterations to converge. The outer loop (lines 5–10) saturates after a few iterations (3–4) and both the quality of the first ranked tiling as well as the highest quality over all tilings for each image are maximized.
Fig. 3 shows the progress of learning on the Berkeley Segmentation Dataset (BSDS)  using and a comparison of the results: without learning, learning with and with . We observe that compared to , produces a slightly better ranking also on the first position, presumably due to the additional constraints from lower ranks.
Table 1 shows results of benchmarks on the test set of BSDS and on the validation set of VOC2009. The values represent average covering scores of ground truth segmentations by the output segmentations. BIS measures the best covering of the ground truth segmentations by individual segments from any segmentation produced by the evaluated method. OIS and ODS have been used in  to evaluate the results of gPb-owt-ucm. They have been introduced in the context of hierarchical segmentation, where scale is used to navigate from coarser to finer segmentations. The optimal image scale (OIS) measures for each image the quality of the produced segmentation that best covers the ground truth. The optimal dataset scale (ODS) measures the quality of the segmentations when the same scale is selected for all images. The scale to be evaluated is chosen to maximize the score on the test set. ”First“ evaluates the results using the predicted best segmentation for each image. ”First“ is only applicable to our method, since the segmentations from gPb-owt-ucm don’t have associated scores to select a single segmentation. ODS is not applicable to our method, as FG-Tiling generates independent segmentations. Note that ”First“ does not use any ground truth information to select the tiling to be evaluated for each image.
The BSDS dataset has multiple ground truth (human) segmentations for each image. To evaluate the quality of a segmentation, the average over all ground truth segmentations for that image is considered. As the provided human segmentations are different, the upper bound for OIS, ”First“, and ODS on the BSDS test set are 0.73. A score of 1.00 for BIS could be obtained by generating segments that perfectly cover all ground truth segments.
The results obtained by FG-Tiling are competitive on BSDS and superior on the VOC2009. Note that the given VOC2009 scores are not using the ”segmentation challenge“ evaluation which requires recognition, but evaluating the quality of unlabeled segmentations like the method we compare with . The results of gPb-owt-ucm on VOC2009 have been computed by us using the code provided by the authors and are consistent with their published results on VOC2008.
We have proposed a mid-level computational learning and inference framework for image segmentation that tiles multiple figure-ground hypotheses into a complete interpretation. The inference problem is formulated as searching for high-scoring maximal cliques in a graph connecting non-overlapping putative figure/ground hypotheses. Clique potentials are based on both intrinsic Gestalt segment quality and compatibilities among neighboring image segments, as derived from statistics of 3d scene boundaries. Learning is formulated as optimizing the ranking of the best-K hypotheses, directly on the testing error, measuring the overlap between image tilings and the ground truth human annotations. We have empirically analyzed the performance of our learning and inference components and have shown that these achieve state of the art results in the Berkeley and the VOC2009 segmentation benchmarks. In the latter the proposed method improves on the state-of-the-art by 28% when considering the full set of generated tilings, and by 16% for the predicted best tiling. In future work we plan to combine segmentation and partial recognition in order to be able to interpret images that contain both familiar and unknown objects.
-  J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
-  D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
-  P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004.
P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik.
From contours to regions: An empirical evaluation.
IEEE International Conference on Computer Vision and Pattern Recognition, 2009.
-  E. Sharon, M. Galun, D. Sharon, R. Basri, and A. Brandt. Hierarchy and adaptivity in segmenting visual scenes. Nature, 442(7104):719–846, 2006.
T. Cour, N. Gogin, and J. Shi.
Learning spectral graph segmentation.
IEEE International Conference on Artificial Intelligence and Statistics, 2005.
-  X. Ren and J. Malik. Learning a classification model for segmentation. IEEE International Conference on Computer Vision, 2003.
-  D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1):151–172, 2007.
-  S. Bagon, O. Boiman, and M. Irani. What is a good image segment? a unified approach to segment extraction. In European Conference on Computer Vision, 2008.
-  T. Malisiewicz and A. Efros. Improving spatial support for objects via multiple segmentations. In British Machine Vision Conference, 2007.
-  J. Carreira and C. Sminchisescu. Constrained Parametric Min-Cuts for Automatic Object Segmentation. In IEEE International Conference on Computer Vision and Pattern Recognition, 2010.
I. Bomze, M. Budinich, P. Pardalos, and M. Pelillo.
Handbook of Combinatorial Optimization, chapter The Maximum Clique Problem, pages 1–74. Kluwer Academic Publishers, 1999.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2009 (VOC2009) Results. http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html.
-  K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002.
-  X. Ren, C. Fowlkes, and J. Malik. Figure/ground assignment in natural images. In European Conference on Computer Vision, 2006.
-  I. Leichter and M. Lindenbaum. Boundary ownership by lifting to 2.1d. In IEEE International Conference on Computer Vision, 2009.
-  D. Hoiem, A. Stein, A. A. Efros, and M. Hebert. Recovering occlusion boundaries from a single image. In IEEE International Conference on Computer Vision, 2007.
-  Josh McDermott. Psychophysics with junctions in real images. Journal of Vision, 2(7):131–131, November 2002.
-  P. Huggins, H. Chen, P. Belhumeur, and S. Zucker. Finding folds: On the appearance and identification of occlusion. In IEEE International Conference on Computer Vision and Pattern Recognition, 2001.
-  T. Ghose and S. Palmer. Surface convexity and extremal edges in depth and figure-ground perception. Journal of Vision, 5(8):970–970, September 2005.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In IEEE International Conference on Computer Vision, 2001.