Kernel Cuts: MRF meets Kernel & Spectral Clustering

We propose a new segmentation model combining common regularization energies, e.g. Markov Random Field (MRF) potentials, and standard pairwise clustering criteria like Normalized Cut (NC), average association (AA), etc. These clustering and regularization models are widely used in machine learning and computer vision, but they were not combined before due to significant differences in the corresponding optimization, e.g. spectral relaxation and combinatorial max-flow techniques. On the one hand, we show that many common applications using MRF segmentation energies can benefit from a high-order NC term, e.g. enforcing balanced clustering of arbitrary high-dimensional image features combining color, texture, location, depth, motion, etc. On the other hand, standard clustering applications can benefit from an inclusion of common pairwise or higher-order MRF constraints, e.g. edge alignment, bin-consistency, label cost, etc. To address joint energies like NC+MRF, we propose efficient Kernel Cut algorithms based on bound optimization. While focusing on graph cut and move-making techniques, our new unary (linear) kernel and spectral bound formulations for common pairwise clustering criteria allow to integrate them with any regularization functionals with existing discrete or continuous solvers.

Authors

• 11 publications
• 8 publications
• 46 publications
• 20 publications
• Tight Continuous Relaxation of the Balanced k-Cut Problem

Spectral Clustering as a relaxation of the normalized/ratio cut has beco...
05/24/2015 ∙ by Syama Sundar Rangapuram, et al. ∙ 0

• Constrained 1-Spectral Clustering

An important form of prior information in clustering comes in form of ca...
05/24/2015 ∙ by Syama Sundar Rangapuram, et al. ∙ 0

• Spectral Graph Cut from a Filtering Point of View

Spectral graph theory is well known and widely used in computer vision. ...
05/20/2012 ∙ by Chengxi Ye, et al. ∙ 0

• Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Dense semantic 3D reconstruction is typically formulated as a discrete o...
06/25/2019 ∙ by Nikolay Savinov, et al. ∙ 2

• Normalized Cut with Adaptive Similarity and Spatial Regularization

In this paper, we propose a normalized cut segmentation algorithm with s...
06/06/2018 ∙ by Faqiang Wang, et al. ∙ 0

• Kernel Spectral Clustering and applications

In this chapter we review the main literature related to kernel spectral...
05/03/2015 ∙ by Rocco Langone, et al. ∙ 0

• Exact and Approximate Inference in Associative Hierarchical Networks using Graph Cuts

Markov Networks are widely used through out computer vision and machine ...
03/15/2012 ∙ by Chris Russell, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

The terms clustering and segmentation are largely synonyms. The latter is common specifically for computer vision when data points are intensities or other features sampled at regularly spaced image pixels . The pixel’s location is essential information. Many image segmentation methods treat as a function and process intensity and location in fundamentally different ways. This applies to many regularization methods including discrete MRF-based techniques [1] and continuous variational methods [2]. For example, such methods often use pixel locations to represent the segment’s geometry/shape and intensities to represent its appearance [3, 4, 5, 6].

The term clustering is often used in the general context where data point is an arbitrary observation indexed by integer . Data clustering techniques [7] are widely used outside of vision. Variants of K-means, spectral or other clustering methods are also used for image segmentation. They often join the pixel’s location and its feature into one combined data point in . We focus on a well-known general group of pairwise clustering methods [8]

based on some estimated

affinities between all pairs of points.

While independently developed from different methodologies, standard regularization and pairwise clustering methods are defined by objective functions that have many complementary properties reviewed later. Our goal is to combine these functions into a joint energy applicable to image segmentation or general clustering problems. Such objectives could not be combined before due to significant differences in the underlying optimization methods, e.g. combinatorial graph cut versus spectral relaxation. While focused on MRF regularization, our approach to integrating pairwise clustering is based on a bound formulation that easily extends to any regularization functional with existing solvers.

We use basic notation applicable to either image segmentation or general data clustering. Let be a set of pixels, voxels, or any other points . For example, for 2D images could be a subset of regularly spaced points in . Set could also represent data points indices. Assume that every comes with an observed feature . For example, could be a greyscale intensity in , an RGB color in , or any other high-dimensional observation. If needed, could also include the pixel’s location.

A segmentation of can be equivalently represented either as a labeling including integer node labels or as a partitioning of set into non-overlapping subsets or segments . With minor abuse of notation, we will also use

as indicator vectors in

.

We combine standard pairwise clustering criteria such as Average Association (AA) or Normalized Cut (NC) [8] and common regularization functionals such as pairwise or high-order MRF potentials [1, 9]. The general form of our joint energy is

 E(S)=EA(S)+γ∑c∈FEc(Sc) (1)

where the first term is some pairwise clustering objective based on data affinity matrix or kernel with defined by some similarity function . The second term in (1) is a general formulation of MRF potentials [10, 11, 12]. Constant is a relative weight of this term. Subset represents a factor typically consisting of pixels with nearby locations. Factor labels is a restriction of labeling to . Potentials for a given set of factors represent various forms of unary, second, or higher order constraints, where factor size defines the order. Factor features often work as parameters for potential . Section 1.1 reviews standard MRF potentials. Note that standard clustering methods encourage balanced segments using ratio-based objectives. They correspond to high-order potential of order in (1). Sections 1.2 and 1.3 review standard clustering objectives used in our paper.

1.1 Overview of MRF regularization

Probably the most basic MRF regularization potential corresponds to the second-order Potts model [10] used for edge alignment

 ∑c∈FEc(Sc)=∑pq∈Nwpq⋅[Sp≠Sq]≈||∂S|| (2)

where a set of pairwise factors includes edges between pairs of neighboring nodes and are Iverson brackets. Weight is a discontinuity penalty between and . It could be a constant or may be set by a decreasing function of intensity difference attracting the segmentation boundary to image contrast edges [4]. This is similar to the image-based boundary length in geodesic contours [3, 5].

A useful bin consistency constraint enforced by the -Potts model [11] is defined over an arbitrary collection of high-order factors . Factors correspond to predefined subsets of nodes such as superpixels [11] or bins of pixels with the same color/feature [13, 14]. The model penalizes inconsistency in segmentation of each factor

 ∑c∈FEc(Sc)=∑c∈Fmin{T,|c|−|Sc|∗} (3)

where is some threshold and is the cardinality of the largest segment inside . Potential (3) has its lowest value (zero) when all nodes in each factor are within the same segment.

Standard label cost [12] is a sparsity potential defined for a single high-order factor . In its simplest form this potential penalizes the number of distinct segments (labels) in

 EΩ(S)=∑khk⋅[|Sk|>0] (4)

where could be a constant or a cost for each specific label.

Potentials (2), (3), (4) are only a few examples of regularization terms widely used in combination with powerful discrete solvers like graph cut [10], belief propagation [15], TRWS [16], LP relaxation [17, 18], or continuous methods [19, 20, 21].

Image segmentation methods often combine regularization with a likelihood term integrating segments/objects color models. For example, [4, 22] used graph cuts to combine second-order edge alignment (2) with a unary (first-order) appearance term

 −∑k∑p∈SklogPk(Ip) (5)

where

are given probability distributions. Unary terms like (

5) are easy to integrate into any of the solvers above.

If unknown, parameters of the models in a regularization energy including (5) are often estimated by iteratively minimizing the energy with respect to and model parameters [23, 24, 25, 26, 12]. In presence of variable model parameters, (5) can be seen as a maximum likelihood (ML) model-fitting term or a probabilistic K-means clustering objective [27]. The next section reviews K-means and other standard clustering methods.

1.2 Overview of K-means and clustering

Many clustering methods are based on K-means (KM). The most basic iterative KM algorithm [28] can be described as the block-coordinate descent for the following mixed objective

 F(S,m):=∑k∑p∈Sk∥Ip−mk∥2 (6)

combining discrete variables with continuous variables representing cluster “centers”. Norm denotes the Euclidean metric. For any given the optimal centers are the means

 μSk:=∑q∈SkIq|Sk| (7)

where denotes the segment’s cardinality. Assuming current segments the update operation giving

defines the next solution as per standard K-means algorithm. This greedy descent technique converges only to a local minimum of KM objective (6), which is known to be NP hard to optimize. There are also other approximation methods. Below we review the properties of KM objective (6) independently of optimization.

The optimal centers in (7) allow to represent (6) via an equivalent objective of a single argument

 ∑k∑p∈Sk∥Ip−μSk∥2≡∑k|Sk|⋅var(Sk). (9)

The sum of squared distances between data points and mean normalized by gives the sample variance denoted by . Formulation (9) presents the basic KM objective as a standard variance criterion

for clustering. That is, K-means attempts to find K compact clusters with small variance.

K-means can also be presented as a pairwise clustering criteria with Euclidean affinities. The sample variance can be expressed as the sum of distances between all pairs of the points. For example, plugging (7) into (9) reduces this KM objective to

 ∑k∑pq∈Sk∥Ip−Iq∥22|Sk|. (10)

Taking the square in the denominator transforms (10) to another equivalent KM energy with Euclidean dot-product affinities

 c=−∑k∑pq∈Sk⟨Ip,Iq⟩|Sk|. (11)

Note that we use and for “up to additive constant” relations.

Alternatively, K-means clustering can be seen as Gaussian model fitting. Formula (5

) for normal distributions with variable means

and some fixed variance

 −∑k∑p∈SklogN(Ip|mk) (12)

equals objective (6) up to a constant.

Various extensions of objectives (6), (9), (10), (11), or (12) lead to many powerful clustering methods such as kernel K-means, average association, and Normalized Cut, see Tab.I and Fig.34.

1.2.1 Probabilistic K-means (pKM) and model fitting

One way to generalize K-means is to replace squared Euclidean distance in (6) by other distortion measures leading to a general distortion energy commonly used for clustering

 ∑k∑p∈Sk∥Ip−mk∥d. (13)

The optimal value of parameter may no longer correspond to a mean. For example, the optimal for non-squared metric is a geometric median. For exponential distortions the optimal may correspond to modes [29, 30], see Appendix B.

A seemingly different way to generalize K-means is to treat both means and covariance matrices for the normal distributions in (12) as variables. This corresponds to the standard elliptic K-means [31, 32, 12]. In this case variable model parameters and data points are not in the same space. Yet, it is still possible to present elliptic K-means as distortion clustering (13) with “distortion” between and defined by an operator corresponding to a likelihood function

 ∥Ip−θk∥d:=−logN(Ip|θk).

Similar distortion measures can be defined for arbitrary probability distributions with any variable parameters . Then, distortion clustering (13) generalizes to ML model fitting objective

 ∑k∑p∈Sk∥Ip−θk∥d≡−∑k∑p∈SklogP(Ip|θk) (14)

which is (5) with explicit model parameters . This formulation suggests probabilistic K-means111The name probabilistic K-means in the general clustering context was coined by [27]. They formulated (14) after representing distortion energy (13) as ML fitting of Gibbs models for arbitrary integrable metrics. (pKM) as a good idiomatic name for ML model fitting or distortion clustering (13), even though the corresponding parameters are not “means”, in general.

Probabilistic K-means (14) is used in image segmentation with models such as elliptic Gaussians [31, 32, 12], gamma/exponential [25], or other generative models [33]. Zhu-Yuille [23] and GrabCut [26] use pKM with highly descriptive probability models such as GMM or histograms. Information theoretic analysis in [27] shows that in this case pKM objective (14) reduces to the standard entropy criterion for clustering

 ∑k|Sk|⋅H(Sk) (15)

where is the entropy of the distribution for .

Intuitively, minimization of the entropy criterion (15) favors clusters with tight or “peaked” distributions. This criterion is widely used in categorical clustering [34]

[35, 36] where the entropy evaluates histograms over “naturally” discrete features. However, the entropy criterion with either discrete histograms or continuous GMM densities has limitations in the context of continuous feature spaces, see Appendix C. Iterative fitting of descriptive models is highly sensitive to local minima [14, 37] and easily over-fits even low dimentional features in (Fig.1b,e) or in (RGB colors, Fig.2b). This may explain why this approach to clustering is not too common in the learning community. As proposed in (1), instead of entropy criterion we will combine MRF regularization with general pairwise clustering objectives widely used for balanced partitioning of arbitrary high-dimensional features [8].

1.2.2 Kernel K-means (kKm) and pairwise clustering

This section reviews pairwise extensions of K-means (11) such as kernel K-means (KM) and related clustering criteria. In machine learning, KM is a well established data clustering technique [42, 43, 44, 40, 45, 46] that can identify non-linearly separable structures. In contrast to pKM based on complex models, KM corresponds to complex (nonlinear) mappings

 ϕ:RN→H

embedding data as points in a higher-dimensional Hilbert space . The original non-linear problem can often be solved by simple linear separators of the embedded points . Kernel K-means corresponds to the basic K-means (6) in the embedding space

 F(S,m)=∑k∑p∈Sk∥ϕp−mk∥2. (16)

Optimal segment centers corresponding to the means

 μSk=∑q∈Skϕq|Sk|. (17)

reduce (16) to KM energy of the single variable similar to (9)

 F(S)=∑k∑p∈Sk∥ϕp−μSk∥2. (18)

Similarly to (10) and (11) one can write pairwise clustering criteria equivalent to (18) based on Euclidean distances or inner products , which are commonly represented via kernel function

 k(x,y):=⟨ϕ(x),ϕ(y)⟩. (19)

The (non-linear) kernel function corresponds to the inner product in . It also defines Hilbertian metric222Such metrics can be isometrically embedded into a Hilbert space [47].

 ∥x−y∥2k := ∥ϕ(x)−ϕ(y)∥2 (20) ≡ k(x,x)+k(y,y)−2k(x,y)

isometric to the Euclidean metric in the embedding space. Then, pairwise formulations (10) and (11) for K-means in the embedding space (18) can be written with respect to the original data points using isometric kernel distance in (20)

 F(S)≡∑k∑pq∈Sk∥Ip−Iq∥2k2|Sk| (21)

or using kernel function in (19)

 F(S)c=−∑k∑pq∈Skk(Ip,Iq)|Sk|. (22)

The definition of kernel in (19) requires embedding . Since pairwise objectives (21) and (22) are defined for any kernel function in the original data space, it is possible to formulate KM by directly specifying an affinity function or kernel rather than embedding . This is typical for KM explaining why the method is called kernel K-means rather than embedding K-means333This could be a name for some clustering techniques constructing explicit embeddings [48, 49] instead of working with pairwise affinities/kernels..

Given embedding , kernel function defined by (19) is positive semi-definite (p.s.d), that is for any . Moreover, Mercer’s theorem [50] states that p.s.d. condition for any given kernel is sufficient to guarantee that is an inner product in some Hilbert space. That is, it guarantees existence of some embedding such that (19) is satisfied. Therefore, KM objectives (18), (21), (22) are equivalently defined either by embeddings or p.s.d. kernels . Thus, kernels are commonly assumed p.s.d. However, as discussed later, pairwise clustering objective (22) is also used with non p.s.d. affinities.

To optimize KM objectives (18), (21), (22) one can use the basic KM procedure (8) iteratively minimizing mixed objective (16) explicitly using embedding

where is the mean (17) for current segment . Equivalently, this procedure can use kernel instead of . Indeed, as in Section 8.2.2 of [51], the square of the objective in (23) is

 ∥ϕp∥2−2ϕp′μSkt+∥μSkt∥2=−2ϕp′ϕSkt|Skt|+Skt′ϕ′ϕSkt|Skt|2

where we use segment as an indicator vector, embedding as an embedding matrix where points are columns, and denotes the transpose. Since the crossed term is a constant at , the right hand side gives an equivalent objective for computing in (23). Using kernel matrix and indicator vector for element we get

where the kernel matrix is directly determined by kernel

 Kpq≡ϕ′pϕq=⟨ϕp,ϕq⟩=k(Ip,Iq).

Approach (24) has quadratic complexity iterations. But, it avoids explicit high-dimensional embeddings in (23) replacing them by kernel in all computations, a.k.a. the kernel trick.

Note that the implicit KM procedure (24) is guaranteed to decrease pairwise KM objectives (21) or (22) only for p.s.d. kernels. Indeed, equation (24) is derived from the standard greedy K-means procedure in the embedding space (23) assuming kernel (19). The backward reduction of (24) to (23) can be done only for p.s.d. kernels when Mercer’s theorem guarantees existence of some embedding such that .

Pairwise energy (21) helps to explain the positive result for KM with common Gaussian kernel in Fig.1(h). Gaussian kernel distance (red plot below)

 ∥Ip−Iq∥2k∝1−k(Ip,Iq)=1−exp−(Ip−Iq)22σ2 (25)

is a “robust” version of Euclidean metric (green) in basic K-means (10). Thus, Gaussian KM finds clusters with small local variances, Fig.1(h). In contrast, basic K-means (c) tries to find good clusters with small global variances, which is impossible for non-compact clusters.

Average association (AA) or distortion (AD): Equivalent pairwise objectives (21) and (22) suggest natural extensions of KM. For example, one can replace Hilbertian metric in (21) by an arbitrary zero-diagonal distortion matrix generating average distortion (AD) energy

reducing to KM energy (21) for . Similarly, p.s.d. kernel in (22) can be replaced by an arbitrary pairwise similarity or affinity matrix defining standard average association (AA) energy

 Eaa(S):=−∑k∑pq∈SkApq|Sk| (27)

reducing to KM objective (22) for . We will also use association between any two segments and

 assoc(Si,Sj):=∑p∈Si,q∈SjApq≡Si′ASj (28)

allowing to rewrite AA energy (27) as

The matrix expressions in (28) and (29) represent segments as indicator vectors such that iff and symbol  means a transpose. Matrix notation as in (29) will be used for all pairwise clustering objectives discussed in this paper.

KM algorithm (24) is not guaranteed to decrease (27) for improper (non p.s.d.) kernel matrix , but general AA and AD energies could be useful despite optimization issues. However, [38] showed that dropping metric and proper kernel assumptions are not essential; there exist p.s.d. kernels with KM energies equivalent (up to constant) to AD (26) and AA (27) for arbitrary associations and zero-diagonal distortions , see Fig. 3.

For example, for any given affinity in (27) the diagonal shift trick of Roth et al. [38] generates the “kernel matrix”

 K=A+A′2+δ⋅I. (30)

For sufficiently large scalar matrix is positive definite yielding a proper discrete kernel

 k(Ip,Iq):χ×χ→R

for finite set . It is easy to check that KM energy (22) with kernel in (30) is equivalent to AA energy (27) with affinity , up to a constant. Indeed, for any indicator we have implying

 X′KX1′X=X′AX2(1′X)+X′A′X2(1′X)+δX′X1′X=X′AX1′X+δ.

Also, Section 3.1 uses eigen decomposition of to construct an explicit finite-dimensional Euclidean embedding444Mercer’s theorem is a similar eigen decomposition for continuous p.d. kernels giving infinite-dimensional Hilbert embedding . Discrete kernel embedding in Sec. 3.1 (56) has finite dimension , which is still much higher than the dimension of points , e.g.  for colors. Sec. 3.1 also shows lower dimensional embeddings approximating isometry (20). satisfying isometry (20) for any p.d. discrete kernel . Minimizing KM energy (18) over such embedding isometric to in (30) is equivalent to optimizing (22) and, therefore, (27).

Since average distortion energy (26) for arbitrary is equivalent to average association for , it can also be converted to KM with a proper kernel [38]. Using the corresponding kernel matrix (30) and (20) it is easy to derive Hilbertian distortion (metric) equivalent to original distortions

 ∥Ip−Iq∥2k:=D+D′2+2δ(1⋅1′−I). (31)

For simplicity and without loss of generality, the rest of the paper assumes symmetric affinities since non-symmetric ones can be equivalently replaced by . However, we do not assume positive definiteness and discuss diagonal shifts, if needed.

Weighted KM and weighted AA: Weighted K-means [28] is a common extension of KM techniques incorporating some given point weights . In the context of embedded points it corresponds to weighted KM iteratively minimizing the weighted version of the mixed objective in (16)

 Fw(S,m):=∑k∑p∈Skwp∥ϕp−mk∥2. (32)

Optimal segment centers are now weighted means

 μwSk=∑q∈Skwqϕq∑q∈Skwq≡ϕWSkw′Sk (33)

where the matrix formulation has weights represented by column vector and diagonal matrix . Assuming a finite dimensional data embedding this formulation uses embedding matrix with column vectors . This notation implies two simple identities used in (33)

 ∑q∈Skwq≡w′Skand∑q∈Skwqϕp≡ϕWSk. (34)

Inserting weighted means (33) into mixed objective (32) produces a pairwise energy formulation for weighted KM similar to (22)

 Fw(S) := ∑k∑p∈Skwp∥ϕp−μwSk∥2 (35) c= −∑k∑pq∈SkwpwqKpq∑p∈Skwp ≡ −∑kSk′WKWSkw′Sk

where p.s.d kernel matrix corresponds to the dot products in the embedding space, i.e. .

Replacing the p.s.d. kernel with an arbitrary affinity matrix defines a weighted AA objective generalizing (27) and (29)

 Ewaa(S) := −∑kSk′WAWSkw′Sk. (37)

Weighted AD can also be defined. Equivalence of KM, AA, and AD in the general weighted case is discussed in Appendix A.

Other pairwise clusteing criteria: Besides AA there are many other standard pairwise clustering criteria defined by affinity matrices . For example, Average Cut (AC)

 Eac(S) := ∑kassoc(Sk,¯Sk)|Sk|≡∑kSk′A(1−Sk)1′Sk (38) = ∑kSk′(D−A)Sk1′Sk

where is a degree matrix defined by node degrees vector . The formulation on the last line (38) comes from the following identity valid for Boolean

 X′DX=X′d.

Normalized Cut (NC) [8] in (39) is another well-known pairwise clustering criterion. Both AC and NC can also be reduced to KM [38, 52, 40]. However, spectral relaxation [8] is more common optimization method for pairwise clustering objectives than iterative KM procedure (24). Due to popularity of NC and spectral relaxation we review them in a dedicated Section 1.3.

1.2.3 Pairwise vs. pointwise distortions

Equivalence of KM to pairwise distortion criteria (26) helps to juxtapose kernel K-means with probabilistic K-means (Sec.1.2.1) from one more point of view. Both methods generalize the basic K-means (6), (10) by replacing the Euclidean metric with a more general distortion measure . While pKM uses “pointwise” formulation (13) where measures distortion between a point and a model, KM uses “pairwise” formulation (21) where measures distortion between pairs of points.

These two different formulations are equivalent for Euclidean distortion (i.e. basic K-means), but the pairwise approach is strictly stronger than the pointwise version using the same Hilbertian distortion in non-Euclidean cases (see Appendix B). The corresponding pointwise approach is often called weak kernel K-means. Interestingly, weak KM with standard Gaussian kernel can be seen as K-modes [29], see Fig. 1(g). Appendix B also details relations between K-modes and popular mean-shift clustering [39]. An extended version of Table I including weighted KM and weak KM techniques is given in Figure 34.

1.3 Overview of Normalized Cut and spectral clustering

Section 1.2.2 has already discussed KM and many related pairwise clustering criteria based on specified affinities . This section is focused on a related pairwise clustering method, Normalized Cut (NC) [8]. Shi and Malik [8] also popularized pairwise clustering optimization via spectral relaxation, which is different from iterative K-means algorithms (23) (24). Note that there are many other popular optimization methods for different clustering energies using pairwise affinities [53, 54, 55, 56, 57], which are outside the scope of this work.

1.3.1 Nc objective and its relation to Aa and kKm

Normalized Cut (NC) energy [8] can be written as

where association (28) is defined by a given affinity matrix . The second matrix formulation above shows that the difference between NC and AA (29) is in the normalization. AA objective uses denominator , which is -th segment’s size. NC (39) normalizes by weighted size. Indeed, using

where weights are node degrees

 dp:=∑q∈ΩApq. (40)

For convenience, NC objective can be formatted similarly to (29)

For some common types of affinities where , e.g. -nearest neighbor () graphs, NC and AA objectives (41) and (29) are equivalent. More generally, Bach & Jordan [52], Dhillon et al. [40] showed that NC objective can be reduced to weighted AA or KM with specific weights and affinities.

Our matrix notation makes equivalence between NC (41) and weighted AA (37) straightforward. Indeed, objective (41) with affinity coincides with (37) for weights and affinity

 w=d=A′1and~A=W−1AW−1. (42)

The weighted version of KM procedure (24) detailed in Appendix A minimizes weighted AA (37) only for p.s.d. affinities, but positive definiteness of is not critical. For example, an extension of the diagonal shift (30) [38] can convert NC (41) with arbitrary (symmetric) to an equivalent NC objective with p.s.d. affinity

 K=A+δ⋅D (43)

using degree matrix and sufficiently large . Indeed, for indicators we have and

 X′KXd′X=X′AXd′X+%$δ$X′DXd′X=X′AXd′X+δ.

Positive definite (43) implies p.d. affinity (42) of weighted AA

The weighted version of KM procedure (24) for this p.d. kernel [58] greedily optimizes NC objective (41) for any (symmetric) .

1.3.2 Spectral relaxation and other optimization methods

There are many methods for approximate optimization of NP-hard pairwise clustering energies besides greedy K-mean procedures. In particular, Shi, Malik, and Yu [8, 59] popularized spectral relaxation methods in the context of normalized cuts. Such methods also apply to AA and other problems [8]. For example, similarly to [59] one can rewrite AA energy (27) as

 Eaa(S)=−tr(Z′AZ)forZ:=[…,Sk√|Sk|,…]

where is a matrix of normalized indicator vectors . Orthogonality implies where

is an identity matrix of size

. Minimization of the trace energy above with relaxed constrained to a “unit sphere” is a simple representative example of spectral relaxation in the context of AA. This relaxed trace optimization is a generalization of Rayleigh quotient problem that has an exact closed form solution in terms of

largest eigenvectors for matrix

. This approach extends to general multi-label weighted AA and related graph clustering problems, e.g. AC and NC [8, 59]

. The main computational difficulties for spectral relaxation methods are explicit eigen decomposition for large matrices and integrality gap - there is a final heuristics-based discretization step for extracting an integer solution for the original combinatorial problem from an optimal relaxed solution. For example, one basic discretization heuristic is to run K-means over the row-vectors of the optimal relaxed

.

Other optimization techniques are also used for pairwise clustering. Buhmann et al. [60, 38] address the general AD and AA energies via mean field approximation and deterministic annealing. It can be seen as a fuzzy version555Fuzzy version of K-means in Duda et al. [61] generalizes to KM. of the kernel K-means algorithm. In the context of normalized cuts Dhillon et al. [40, 58] use spectral relaxation to initialize KM algorithm.

In computer vision it is common to combine various constraints or energy terms into one objective function. Similar efforts were made in the context of pairwise clustering as well. For example, to combine KM or NC objectives with Potts regularization [62] normalizes the corresponding pairwise constraints by cluster sizes. This alters the Potts model to fit the problem to a standard trace-based formulation.

Adding non-homogeneous linear constraints into spectral relaxation techniques also requires approximations [63] or model modifications [64]. Exact optimization for the relaxed quadratic ratios (including NC) with arbitrary linear equality constraints is possible by solving a sequence of spectral problems [65].

Our bound optimization approach allows to combine many standard pairwise clustering objectives with any regularization terms with existing solvers. In our framework such pairwise clustering objectives are interpreted as high-order energy terms approximated via linear upper bounds during optimization.

1.4 Our approach summary

We propose to combine standard pairwise clustering criteria such as AA, AC, or NC with standard regularization constraints such as geometric or MRF priors. To achieve this goal, we propose unary (linear) bounds for the pairwise clustering objectives that are easy to integrate into any standard regularization algorithms. Below we summarize our motivation and our main technical contributions.

1.4.1 Motivation and Related work

Due to significant differences in existing optimization methods, pairwise clustering (e.g. NC) and Markov Random Fields (MRF) techniques are used separately in many applications of vision and learning. They have complementary strengths and weaknesses.

For example, NC can find a balanced partitioning of data points from pairwise affinities for high-dimensional features [8, 66, 67]. In contrast, discrete MRF as well as continuous regularization methods commonly use probabilistic K-means [27] or model fitting to partition image features [24, 23, 26, 12]

. Clustering data by fitting simple parametric models seems intuitively justified when data supports such simple models,

e.g. Gaussians [24] or lines/planes [12]. But, clustering arbitrary data by fitting complex models like GMM or histograms [23, 26] seems like an idea bound to over-fit the data. Indeed, over-fitting happens even for low dimensional color features [14], see also Fig.1(b,e) and Fig.2(b). Our energy (1) allows a general MRF framework to benefit from standard pairwise clustering criteria, e.g. NC, widely used for complex data. In general, kernel-based clustering methods are a prevalent choice in the learning community as model fitting (e.g. EM) becomes intractable for high dimensions. We show potent segmentation results for basic formulations of energy (1) with higher-dimensional features like RGBXY, RGBD, RGBM where standard MRF methods using model-fitting fail.

On the other hand, standard NC applications can also benefit from additional constraints [63, 65, 68]. We show how to add a wide class of standard MRF potentials. For example, standard NC segmentation has weak alignment to contrast edges [67]. While this can be addressed by post-processing, inclusion of the standard pair-wise Potts term [10, 4] offers a principled solution. We show benefits from combining NC with lower and higher-order constraints, such as sparsity or label costs [12]. In the context of a general graph clustering, higher-order consistency terms based on a -Potts model [11] also give significant improvements.

The synergy of the joint energy (1) is illustrated by juxtaposing the use of the pixel location information (XY) in standard NC and MRF techniques. The basic pairwise Potts model typically works on the nearest-neighbor pixel grids or where XY information helps contrast/edge alignment and enforces “smooth” segment boundaries. Wider connectivity Potts leads to denser graphs with slower optimization and poorer edge localization. In contrast, common NC methods [8] augment pixel color features with location using relatively wide bandwidth for the XY dimension. This encourages segments with spatially “compact” regions. Narrower XY kernels improve edge alignment [67], but weaken regional color/feature consistency. On the other hand, very large XY kernels produce color-only clustering with spatially incoherent segments. Combining regional color consistency with spatial coherence in a single NC energy requires a compromise when selecting XY bandwidth. Our general energy (1) can separate the regional consistency (e.g. NC clustering term) from the boundary smoothness or edge alignment (e.g. Potts potential). Interestingly, it may still be useful to augment colors with XY in the NC term in (1) since XY kernel bandwidth allows to separate similar-appearance objects at distances larger than , see Sec.6.2.3.

Adding high-order term to MRF potentials in (1) differs from fully-connected CRF [69]. Like many MRF/CRF models the method in [69] lacks balancing. It is a denser version of the pairwise Potts model giving a trivial solution without a data term. In contrast, ratio-based objectives are designed for balanced clustering. In fact, our unary bounds for such objectives allow to combine them with fully connected CRF or any other MRF/CRF model with known solvers, e.g. mean field approximation [69].

Some earlier work also motivates a combination of pairwise clustering (e.g. NC) with MRF potentials like the Potts model [62]. They alter the Potts model to fit it to the standard trace-based formulation of NC. Our general bound approach can combine many pairwise clustering methods with any solvable discrete or continuous regularization potentials.

1.4.2 Main contributions

We propose a new energy model (1) for multi-label image segmentation and clustering, efficient bound optimization algorithms, and demonstrate many useful applications. Our preliminary results appear in [41] and [70]. Our main contributions are:

• We propose a general multi-label segmentation or clustering energy (1) combining pairwise clustering (e.g. NC) with standard second or higher-order MRF regularization potentials. A pairwise clustering term can enforce balanced partitioning of observed image features and MRF terms can enforce many standard regularization constraints.

• We obtain kernel (exact) and spectral (approximate) bounds for common pairwise clustering criteria providing two auxiliary functions for joint energy (1). In the context of standard MRF potentials (e.g. Potts, robust -Potts, label cost) we propose move-making algorithms for energy (1) generalizing -expansion and -swap moves666Our bounds for pairwise criteria can be also integrated into auxiliary functions with other standard regularization potentials (truncated, cardinality, TV) addressed by discrete (e.g. message passing, relaxations, mean-field approximations) or continuous (e.g. convex, primal-dual) algorithms..

• For our spectral bound777Our term spectral bound means “spectral” auxiliary function

in the context of bound optimization, not to be confused with bounds on eigenvalues.

we derive a new low-dimensional data embedding minimizing isometry errors w.r.t. affinities. These optimal embeddings complete the gap between standard KM discretization heuristics in spectral relaxation [8] and the exact KM formulations [52, 40].

• Our experiments demonstrate that typical NC applications benefit from extra MRF constraints, as well as, MRF segmentation benefit from the high-order NC term encouraging balanced partitioning of image features. In particular, our NC+MRF framework works for higher-dimensional features (e.g. RGBXY, RGBD, RGBM) where standard model-fitting clustering [23, 26, 12] fails.

The rest of the paper is organized as follows. Sections 2 and 3 present our kernel and spectral bounds for (1) and detail combinatorial move making graph cut algorithms for its optimization. Section 4 discusses a number of extensions for our bound optimization approach. Sections 5 analyses kernel and bandwidth selection strategies. Section 6 presents many experiments where either pairwise clustering criteria benefit from the additional MRF constraints or common MRF formulations benefit from an additional clustering term for various high-dimensional features.

2 Kernel Bounds

First, we review the general bound optimization principle and present basic K-means as an example. Section 2.2 derives kernel bounds for standard pairwise clustering objectives. Without loss of generality, we assume symmetric affinities since non-symmetric ones can be equivalently replaced by , e.g. see (30) in Sec.1.2.2. Positive definiteness of is not assumed and diagonal shifts are discussed when needed. Move-making bound optimization for energy (1) is discussed in Section 2.3.

2.1 Bound optimization and K-means

In general, bound optimizers are iterative algorithms that optimize auxiliary functions (upper bounds) for a given energy assuming that these auxiliary functions are more tractable than the original difficult optimization problem [71, 72, 73, 37]. Let be a current iteration index. Then is an auxiliary function of at current solution if