ae-multigrid
Progressive Multigrid Eigensolver for Multiscale Angular Embedding Problems
view repo
Spectral embedding provides a framework for solving perceptual organization problems, including image segmentation and figure/ground organization. From an affinity matrix describing pairwise relationships between pixels, it clusters pixels into regions, and, using a complex-valued extension, orders pixels according to layer. We train a convolutional neural network (CNN) to directly predict the pairwise relationships that define this affinity matrix. Spectral embedding then resolves these predictions into a globally-consistent segmentation and figure/ground organization of the scene. Experiments demonstrate significant benefit to this direct coupling compared to prior works which use explicit intermediate stages, such as edge detection, on the pathway from image to affinities. Our results suggest spectral embedding as a powerful alternative to the conditional random field (CRF)-based globalization schemes typically coupled to deep neural networks.
READ FULL TEXT VIEW PDF
Existing semantic segmentation methods mostly rely on per-pixel supervis...
read it
We address the problem of decomposing a single image into reflectance an...
read it
Image segmentation is considered to be one of the critical tasks in
hype...
read it
In this paper, we propose spatial propagation networks for learning the
...
read it
In this work, we address the face parsing task with a Fully-Convolutiona...
read it
Introducing explicit constraints on the structural predictions has been ...
read it
Image translation with convolutional neural networks has recently been u...
read it
Progressive Multigrid Eigensolver for Multiscale Angular Embedding Problems
Systems for perceptual organization of scenes are commonly architected around a pipeline of intermediate stages. For example, image segmentation follows from edge detection [12, 1, 2, 7, 4]; figure/ground, occlusion, or depth layering follows from reasoning over discrete contours or regions [27, 16, 21, 36, 18] with some systems also reliant on motion cues [30, 15, 32, 31]. This trend holds even in light of rapid advancements from designs centered on convolutional neural networks (CNNs). Rather than directly focus on image segmentation, recent CNN architectures [14, 3, 28, 4] target edge detection. Turaga et al. [33] make the connection between affinity learning and segmentation, yet restrict affinities to be precisely local edge strengths. Pure CNN approaches for depth from a single image do focus on directly constructing the desired output [9, 8]. However, these works do not address the problem of perceptual grouping without fixed semantic classes.
We engineer a system for simultaneous segmentation and figure/ground organization by directly connecting a CNN to an inference algorithm which produces a globally consistent scene interpretation. Training the CNN with a target appropriate for the inference procedure eliminates the need for hand-designed intermediate stages such as edge detection. Our strategy parallels recent work connecting CNNs and conditional random fields (CRFs) for semantic segmentation [6, 20, 35]. A crucial difference, however, is that we handle the generic, or class independent, image partitioning problem. In this context, spectral embedding, and specifically Angular Embedding (AE) [37, 38], is a more natural inference algorithm. Figure 1 illustrates our architecture.
Angular Embedding, an extension of the spectral relaxation of Normalized Cuts [29] to complex-valued affinities, provides a mathematical framework for solving joint grouping and ranking problems. Previous works established this framework as a basis for segmentation and figure/ground organization [22] as well as object-part grouping and segmentation [24]. We follow the spirit of [22], but employ major changes to achieve high-quality figure/ground results:
We reformulate segmentation and figure/ground layering in terms of an energy model with pairwise forces between pixels. Pixels either bind together (group) or differentially repel (layer separation), with strength of interaction modulated by confidence in the prediction.
We train a CNN to directly predict all data-dependent terms in the model.
We predict interactions across multiple distance scales and use an efficient solver [23] for spectral embedding.
Our new energy model replaces the ad-hoc boundary-centric interactions employed by [22]. Our CNN replaces hand-designed features. Together they facilitate learning of pairwise interactions across a regular stencil pattern. Choosing a sparse stencil pattern, yet including both short- and long-range connections, allows us to incorporate multiscale cues while remaining computationally efficient.
Section 2 develops our model for segmentation and figure/ground while providing the necessary background on Angular Embedding. Section 3 details the structure of our CNN for predicting pairwise interaction terms in the model.
As our model is fully learned, it could be trained according to different notions of segmentation and figure/ground. For example, consistent definitions for figure/ground include true depth ordering as in [9], object class-specific foreground/background separation as in [24], and boundary ownership or occlusion as in [27, 13, 22]. We focus on the latter and define segmentation as a region partition and figure/ground as an ordering of regions by occlusion layer. The Berkeley segmentation dataset (BSDS) provides ground-truth annotation of this form [25, 13]. We demonstrate segmentation results competitive with the state-of-the-art on the BSDS benchmark [11], while simultaneously generating high-quality figure/ground output.
The occlusion layering interpretation of figure/ground is the one most likely to be portable across datasets; it corresponds to a mid-level perceptual task. We find this to be precisely the case for our learned model. Trained on BSDS, it generates quite reasonable output when tested on other image sources, including the PASCAL VOC dataset [10]. We believe this to be a significant advance in fully automatic perceptual organization. Section 4 presents experimental results across all datasets, while Section 5 concludes.
We abstract the figure/ground problem to that of assigning each pixel a rank , such that
orders pixels by occlusion layer. Assume we are given estimates of the relative order
between many pairs of pixels and . The task is then to find that agrees as best as possible with these pairwise estimates. Angular Embedding [38] addresses this optimization problem by minimizing error :(1) |
where accounts for possibly differing confidences in the pairwise estimates and is replaced by . As Figure 2 shows, this mathematical convenience permits interpretation of as an embedding into the complex plane, with desired ordering corresponding to absolute angle. is defined as the consensus embedding location for according to its neighbors and :
(2) | ||||
(3) |
Relaxing the unit norm constraint on yields a generalized eigenproblem:
(4) |
with and defined in terms of and by:
(5) | ||||
(6) |
where is the number of pixels,
is a column vector of ones,
is a matrix with its vector argument on the main diagonal, and denotes the matrix Hadamard product.For everywhere zero (), this eigenproblem is identical to the spectral relaxation of Normalized Cuts [29], in which the second and higher eigenvectors encode grouping [29, 2]. With nonzero entries in , the first of the now complex-valued eigenvectors is nontrivial and its angle encodes rank ordering while the subsequent eigenvectors still encode grouping [22]. We use the same decoding procedure as [22] to read off this information.
Specifically, given eigenvectors,
, and corresponding eigenvalues,
, solving Equation 4, recovers figure/ground ordering. Treating the eigenvectors as an embedding of pixels into , distance in this embedding space reveals perceptual grouping. We follow [2, 22] to recover both boundaries and segmentation from the embedding by taking the (spatial) gradient of eigenvectors and applying the watershed transform. This is equivalent to a form of agglomerative clustering in the embedding space, with merging constrained to be between neighbors in the image domain.A remaining issue, solved by [24], is to avoid circular wrap-around in angular span by guaranteeing that the solution fits within a wedge of the complex plane. It suffices to rescale by prior to embedding.
Having chosen Angular Embedding as our inference procedure, it remains to define the pairwise pixel relationships and . In the special case of Normalized Cuts, represents a clustering affinity, or confidence on zero separation (in both clustering and figure/ground). For the more general case, we must also predict non-zero figure/ground separation values and assign them confidences.
Let us develop the model in terms of probabilities:
(7) | ||||
(8) | ||||
(9) | ||||
(10) |
where is the region (segment) containing pixel and means that is figure with respect to , according to the true segmentation and figure/ground ordering. is the probability that some boundary separates and . and are conditional probabilities of figure and ground, respectively. Note .
There are three possible transitions between and : none (same region), ground figure, and figure ground. Selecting the most likely, the probabilities of erroneously binding and into the same region, transitioning to figure, or transitioning to ground are respectively:
(11) | ||||
(12) | ||||
(13) |
where is the probability that there is a boundary between and , but that neither nor themselves lie on any boundary. Figure/ground repulsion forces act long-range and across boundaries. We convert to confidence via exponential reweighting:
(14) | ||||
(15) | ||||
(16) |
where and control scaling. Using a fixed angle for the rotational action of figure/ground transitions (), we obtain complex-valued affinities:
(17) | ||||
(18) | ||||
(19) |
Figure 3 illustrates how (shown in green), (red), and (blue) cover the base cases in the space of pairwise grouping relationships. Combining them into a single energy model (generalized affinity) spans the entire space:
(20) |
One can regard as a sum of binding, figure transition, and ground transition forces acting between and . Figure 4 plots the configuration space of in terms of and . As the areas of this space in which each component force is strong do not overlap, behaves in distinct modes, with a smooth transition between them through the area of weak affinity near the origin.
Learning to predict , , and suffices to determine all components of . For computational efficiency, we predict pairwise relationships between each pixel and its immediate neighbors across multiple spatial scales. This defines a multiscale sparse . As an adjustment prior to feeding to the Angular Embedding solver of [23], we enforce Hermitian symmetry by assigning:
(21) |
Supervised training of our system proceeds from a collection of images and associated ground-truth, . Here, is an image defined on domain . is a segmentation mapping each pixel to a region id, and is an rank ordering of pixels according to figure/ground layering. This data defines ground-truth pairwise relationships:
(22) | ||||
(23) |
As is a conditional probability (Equation 9), we only generate training examples for pairs satisfying .
In all experiments, we sample pixel pairs from a multiscale stencil pattern. For each pixel , we consider as each of its immediate neighbors in the pixel grid, across scales (distances of , , and pixels). The stencil pattern thus consists of neighbors total. We train predictors, and at each of the offsets, for describing the pairwise affinity between a pixel and its neighbors. We derive the predictor as a local average of :
(24) |
where consists of the neighbors to at fine-scale.
Choosing a CNN to implement these predictors, we regard the problem as mapping an input image to a channel output over the same domain. We adapt prior CNN designs for predicting output quantities at every pixel [9, 8, 26] to our somewhat higher-dimensional prediction task. Specifically, we reuse the basic network design of [26], which first passes a large-scale coarse receptive field through an AlexNet [19]-like subnetwork. It appends this subnetwork’s output into a second scale subnetwork acting on a finer receptive field. Figure 5 provides a complete layer diagram. In modifying [26], we increase the size of the penultimate feature map as well as the output dimensionality.
For modularity at training time, we separately train two networks, one for and one for , each with the layer architecture of Figure 5
. We use modified Caffe
[17] for training with log loss between truth and prediction applied to each output pixel-wise:(25) |
making the total loss for :
(26) |
with an analogous loss applied for . Here denotes all neighbors of according to the stencil pattern.
Using stochastic gradient descent with random initialization and momentum of
, we train with batch size for mini-batch iterations. Learning rates for each layer are tuned by hand. We utilize data augmentation in the form of translation and left-right mirroring of examples.Training our system for the generic perceptual task of segmentation and figure/ground layering requires a dataset fully annotated in this form. While there appears to be renewed interest in creating large-scale datasets with such annotation [39], none has yet been released. We therefore use the Berkeley segmentation dataset [25] for training. Though it consists of images total, only have been annotated with ground-truth figure/ground [13]. We resplit this subset of images into for training and for testing.
The following subsections detail, how, even with such scarcity of training data, our system achieves substantial improvements in figure/ground quality over prior work.
Our model formulation relies on dense labeling of pixel relationships. The BSDS ground-truth provides a dense segmentation in the from of a region map, but only defines local figure/ground relationships between pixels immediately adjacent along a region boundary [13]. We would like to train predictors for long-range figure/ground relationships (our multiscale stencil pattern) in addition to short-range.
Figure 6 illustrates our method for overcoming this limitation. Given perfect (e.g. ground-truth) short-range predictions as input, Angular Embedding generates an extremely high-quality global figure/ground estimate. In a real setting, we want robustness by having many estimates of pairwise relations over many scales. Ground-truth short-range connections suffice as they are perfect estimates. We use the globalized ground-truth figure/ground map (column in Figure 6) as our training signal in Equation 23. The usual ground-truth segmentation serves as in Equation 22.
Figure 7 shows results on some examples from our image test set. Compared to the previous attempt [22] to use Angular Embedding as an inference engine for figure/ground, our results are strikingly better. It is visually apparent that our system improves on every single example in terms of figure/ground.
Our segmentation, as measured by boundary quality, is comparable to that of similar systems using spectral clustering for segmentation alone
[2]. On the standard boundary precision recall benchmark on BSDS, our spectral boundaries achieve an F-measure of , identical to that of the spectral component (“spectral Pb”) of the gPb boundary detector [2]. Thus, as a segmentation engine our system is on par with the previous best spectral-clustering based systems.To our knowledge, there is not a well-established benchmarking methodology for dense figure/ground predictions. While [34] propose metrics coupling figure/ground classification accuracy along boundaries to boundary detection performance, we develop a simpler alternative.
Given a per-pixel figure/ground ordering assignment, and a segmentation partitioning an image into regions, we can easily order the regions according to figure/ground layering. Simply assign each region a rank order equal to the mean figure/ground order of its member pixels. For robustness to minor misalignment between the figure/ground assignment and the boundaries of regions in the segmentation, we use median in place of mean.
This transfer procedure serves as a basis for comparing different figure/ground orderings. We transfer them both onto the same segmentation. In particular, given predicted figure/ground ordering , ground-truth figure/ground ordering , and ground-truth segmentation , we transfer each of and onto . This gives two orderings of the regions in , which we compare according to the following metrics:
Pairwise region accuracy (R-ACC): For each pair of neighboring regions in , if the ground-truth figure/ground assignment shows them to be in different layers, we test whether the predicted relative ordering of these regions matches the ground-truth relative ordering. That is, we measure accuracy on the classification problem of predicting which region is in front.
Boundary ownership accuracy (B-ACC): We define the front region as owning the pixels on the common boundary of the region pair and measure the per-pixel accuracy of predicting boundary ownership. This is a reweighting of R-ACC. In R-ACC, all region pairs straddling a ground-truth figure/ground boundary count equally. In B-ACC, their importance is weighted according to length of the boundary.
Boundary ownership of foreground regions (B-ACC-50, B-ACC-25): Identical to B-ACC, except we only consider boundaries which belong to the foreground-most or of regions in the ground-truth figure/ground ordering of each image. These metrics emphasize the importance of correct predictions for foreground objects while ignoring more distant objects.
Note that need not be the ground-truth segmentation. We can project ground-truth figure/ground onto any segmentation (say, a machine-generated one) and compare to predicted figure/ground projected onto that segmentation.
Segmentation: | Figure/Ground Prediction Accuracy | |||
---|---|---|---|---|
Ground-truth | R-ACC | B-ACC | B-ACC-50 | B-ACC-25 |
F/G: Ours | ||||
F/G: Maire [22] | ||||
Segmentation: | Figure/Ground Prediction Accuracy | |||
Ours | R-ACC | B-ACC | B-ACC-50 | B-ACC-25 |
F/G: Ours | ||||
F/G: Maire [22] |
Table 1 reports a complete comparison of our figure/ground predictions and those of [22] against ground-truth figure/ground on our image test subset of BSDS [25]. We consider both projection onto ground-truth segmentation and onto our own system’s segmentation output. For the latter, as our system produces hierarchical segmentation, we use the region partition at a fixed level of the hierarchy, calibrated for optimal boundary F-measure. Figures 8 and 9 provide visual comparison on test images.
Figure 10 demonstrates that our BSDS-trained system captures generally-applicable notions of both segmentation and figure/ground. On both PASCAL VOC [10] and the Weizmann Horse database [5], it generates figure/ground layering that respects scene organization. On the Weizmann examples, though having only been trained for perceptual organization, it behaves like an object detector.
We demonstrate that Angular Embedding, acting on CNN predictions about pairwise pixel relationships, provides a powerful framework for segmentation and figure/ground organization. Our work is the first to formulate a robust interface between these two components. Our results are a dramatic improvement over prior attempts to use spectral methods for figure/ground organization.
Joint object and part segmentation using deep learned potentials.
ICCV, 2015.
Comments
There are no comments yet.