1 Introduction and Motivation
The terms clustering and segmentation are largely synonyms. The latter is common specifically for computer vision when data points are intensities or other features sampled at regularly spaced image pixels . The pixel’s location is essential information. Many image segmentation methods treat as a function and process intensity and location in fundamentally different ways. This applies to many regularization methods including discrete MRFbased techniques [1] and continuous variational methods [2]. For example, such methods often use pixel locations to represent the segment’s geometry/shape and intensities to represent its appearance [3, 4, 5, 6].
The term clustering is often used in the general context where data point is an arbitrary observation indexed by integer . Data clustering techniques [7] are widely used outside of vision. Variants of Kmeans, spectral or other clustering methods are also used for image segmentation. They often join the pixel’s location and its feature into one combined data point in . We focus on a wellknown general group of pairwise clustering methods [8]
based on some estimated
affinities between all pairs of points.While independently developed from different methodologies, standard regularization and pairwise clustering methods are defined by objective functions that have many complementary properties reviewed later. Our goal is to combine these functions into a joint energy applicable to image segmentation or general clustering problems. Such objectives could not be combined before due to significant differences in the underlying optimization methods, e.g. combinatorial graph cut versus spectral relaxation. While focused on MRF regularization, our approach to integrating pairwise clustering is based on a bound formulation that easily extends to any regularization functional with existing solvers.
We use basic notation applicable to either image segmentation or general data clustering. Let be a set of pixels, voxels, or any other points . For example, for 2D images could be a subset of regularly spaced points in . Set could also represent data points indices. Assume that every comes with an observed feature . For example, could be a greyscale intensity in , an RGB color in , or any other highdimensional observation. If needed, could also include the pixel’s location.
A segmentation of can be equivalently represented either as a labeling including integer node labels or as a partitioning of set into nonoverlapping subsets or segments . With minor abuse of notation, we will also use
as indicator vectors in
.We combine standard pairwise clustering criteria such as Average Association (AA) or Normalized Cut (NC) [8] and common regularization functionals such as pairwise or highorder MRF potentials [1, 9]. The general form of our joint energy is
(1) 
where the first term is some pairwise clustering objective based on data affinity matrix or kernel with defined by some similarity function . The second term in (1) is a general formulation of MRF potentials [10, 11, 12]. Constant is a relative weight of this term. Subset represents a factor typically consisting of pixels with nearby locations. Factor labels is a restriction of labeling to . Potentials for a given set of factors represent various forms of unary, second, or higher order constraints, where factor size defines the order. Factor features often work as parameters for potential . Section 1.1 reviews standard MRF potentials. Note that standard clustering methods encourage balanced segments using ratiobased objectives. They correspond to highorder potential of order in (1). Sections 1.2 and 1.3 review standard clustering objectives used in our paper.
1.1 Overview of MRF regularization
Probably the most basic MRF regularization potential corresponds to the secondorder Potts model [10] used for edge alignment
(2) 
where a set of pairwise factors includes edges between pairs of neighboring nodes and are Iverson brackets. Weight is a discontinuity penalty between and . It could be a constant or may be set by a decreasing function of intensity difference attracting the segmentation boundary to image contrast edges [4]. This is similar to the imagebased boundary length in geodesic contours [3, 5].
A useful bin consistency constraint enforced by the Potts model [11] is defined over an arbitrary collection of highorder factors . Factors correspond to predefined subsets of nodes such as superpixels [11] or bins of pixels with the same color/feature [13, 14]. The model penalizes inconsistency in segmentation of each factor
(3) 
where is some threshold and is the cardinality of the largest segment inside . Potential (3) has its lowest value (zero) when all nodes in each factor are within the same segment.
Standard label cost [12] is a sparsity potential defined for a single highorder factor . In its simplest form this potential penalizes the number of distinct segments (labels) in
(4) 
where could be a constant or a cost for each specific label.
Potentials (2), (3), (4) are only a few examples of regularization terms widely used in combination with powerful discrete solvers like graph cut [10], belief propagation [15], TRWS [16], LP relaxation [17, 18], or continuous methods [19, 20, 21].
Image segmentation methods often combine regularization with a likelihood term integrating segments/objects color models. For example, [4, 22] used graph cuts to combine secondorder edge alignment (2) with a unary (firstorder) appearance term
(5) 
where
are given probability distributions. Unary terms like (
5) are easy to integrate into any of the solvers above.If unknown, parameters of the models in a regularization energy including (5) are often estimated by iteratively minimizing the energy with respect to and model parameters [23, 24, 25, 26, 12]. In presence of variable model parameters, (5) can be seen as a maximum likelihood (ML) modelfitting term or a probabilistic Kmeans clustering objective [27]. The next section reviews Kmeans and other standard clustering methods.
1.2 Overview of Kmeans and clustering
Many clustering methods are based on Kmeans (KM). The most basic iterative KM algorithm [28] can be described as the blockcoordinate descent for the following mixed objective
(6) 
combining discrete variables with continuous variables representing cluster “centers”. Norm denotes the Euclidean metric. For any given the optimal centers are the means
(7) 
where denotes the segment’s cardinality. Assuming current segments the update operation giving
(8) 
defines the next solution as per standard Kmeans algorithm. This greedy descent technique converges only to a local minimum of KM objective (6), which is known to be NP hard to optimize. There are also other approximation methods. Below we review the properties of KM objective (6) independently of optimization.
The optimal centers in (7) allow to represent (6) via an equivalent objective of a single argument
(9) 
The sum of squared distances between data points and mean normalized by gives the sample variance denoted by . Formulation (9) presents the basic KM objective as a standard variance criterion
for clustering. That is, Kmeans attempts to find K compact clusters with small variance.
Kmeans can also be presented as a pairwise clustering criteria with Euclidean affinities. The sample variance can be expressed as the sum of distances between all pairs of the points. For example, plugging (7) into (9) reduces this KM objective to
(10) 
Taking the square in the denominator transforms (10) to another equivalent KM energy with Euclidean dotproduct affinities
(11) 
Note that we use and for “up to additive constant” relations.
Alternatively, Kmeans clustering can be seen as Gaussian model fitting. Formula (5
) for normal distributions with variable means
and some fixed variance(12) 
equals objective (6) up to a constant.
Various extensions of objectives (6), (9), (10), (11), or (12) lead to many powerful clustering methods such as kernel Kmeans, average association, and Normalized Cut, see Tab.I and Fig.34.
1.2.1 Probabilistic Kmeans (pKM) and model fitting
One way to generalize Kmeans is to replace squared Euclidean distance in (6) by other distortion measures leading to a general distortion energy commonly used for clustering
(13) 
The optimal value of parameter may no longer correspond to a mean. For example, the optimal for nonsquared metric is a geometric median. For exponential distortions the optimal may correspond to modes [29, 30], see Appendix B.

A seemingly different way to generalize Kmeans is to treat both means and covariance matrices for the normal distributions in (12) as variables. This corresponds to the standard elliptic Kmeans [31, 32, 12]. In this case variable model parameters and data points are not in the same space. Yet, it is still possible to present elliptic Kmeans as distortion clustering (13) with “distortion” between and defined by an operator corresponding to a likelihood function
Similar distortion measures can be defined for arbitrary probability distributions with any variable parameters . Then, distortion clustering (13) generalizes to ML model fitting objective
(14) 
which is (5) with explicit model parameters . This formulation suggests probabilistic Kmeans^{1}^{1}1The name probabilistic Kmeans in the general clustering context was coined by [27]. They formulated (14) after representing distortion energy (13) as ML fitting of Gibbs models for arbitrary integrable metrics. (pKM) as a good idiomatic name for ML model fitting or distortion clustering (13), even though the corresponding parameters are not “means”, in general.
Probabilistic Kmeans (14) is used in image segmentation with models such as elliptic Gaussians [31, 32, 12], gamma/exponential [25], or other generative models [33]. ZhuYuille [23] and GrabCut [26] use pKM with highly descriptive probability models such as GMM or histograms. Information theoretic analysis in [27] shows that in this case pKM objective (14) reduces to the standard entropy criterion for clustering
(15) 
where is the entropy of the distribution for .
Intuitively, minimization of the entropy criterion (15) favors clusters with tight or “peaked” distributions. This criterion is widely used in categorical clustering [34]
and decision trees
[35, 36] where the entropy evaluates histograms over “naturally” discrete features. However, the entropy criterion with either discrete histograms or continuous GMM densities has limitations in the context of continuous feature spaces, see Appendix C. Iterative fitting of descriptive models is highly sensitive to local minima [14, 37] and easily overfits even low dimentional features in (Fig.1b,e) or in (RGB colors, Fig.2b). This may explain why this approach to clustering is not too common in the learning community. As proposed in (1), instead of entropy criterion we will combine MRF regularization with general pairwise clustering objectives widely used for balanced partitioning of arbitrary highdimensional features [8].A. basic Kmeans (KM) (e.g. [28]) 

Variance criterion 
[1ex]
[1ex] B. probabilistic Kmeans (pKM) C. kernel Kmeans (KM) equivalent energy formulations: equivalent energy formulations: related examples: related examples: elliptic Kmeans [31, 32] Average Association or Distortion [38] geometric model fitting [12] Average Cut [8] Kmodes [29] or meanshift [39] (weak KM) Normalized Cut [8, 40] (weighted KM) Entropy criterion [23, 26] Gini criterion [35, 41] for highly descriptive models (GMMs, histograms) for smallwidth normalized kernels (see Sec.5.1)
1.2.2 Kernel Kmeans (Km) and pairwise clustering
This section reviews pairwise extensions of Kmeans (11) such as kernel Kmeans (KM) and related clustering criteria. In machine learning, KM is a well established data clustering technique [42, 43, 44, 40, 45, 46] that can identify nonlinearly separable structures. In contrast to pKM based on complex models, KM corresponds to complex (nonlinear) mappings
embedding data as points in a higherdimensional Hilbert space . The original nonlinear problem can often be solved by simple linear separators of the embedded points . Kernel Kmeans corresponds to the basic Kmeans (6) in the embedding space
(16) 
Optimal segment centers corresponding to the means
(17) 
reduce (16) to KM energy of the single variable similar to (9)
(18) 
Similarly to (10) and (11) one can write pairwise clustering criteria equivalent to (18) based on Euclidean distances or inner products , which are commonly represented via kernel function
(19) 
The (nonlinear) kernel function corresponds to the inner product in . It also defines Hilbertian metric^{2}^{2}2Such metrics can be isometrically embedded into a Hilbert space [47].
(20)  
isometric to the Euclidean metric in the embedding space. Then, pairwise formulations (10) and (11) for Kmeans in the embedding space (18) can be written with respect to the original data points using isometric kernel distance in (20)
(21) 
or using kernel function in (19)
(22) 
The definition of kernel in (19) requires embedding . Since pairwise objectives (21) and (22) are defined for any kernel function in the original data space, it is possible to formulate KM by directly specifying an affinity function or kernel rather than embedding . This is typical for KM explaining why the method is called kernel Kmeans rather than embedding Kmeans^{3}^{3}3This could be a name for some clustering techniques constructing explicit embeddings [48, 49] instead of working with pairwise affinities/kernels..
Given embedding , kernel function defined by (19) is positive semidefinite (p.s.d), that is for any . Moreover, Mercer’s theorem [50] states that p.s.d. condition for any given kernel is sufficient to guarantee that is an inner product in some Hilbert space. That is, it guarantees existence of some embedding such that (19) is satisfied. Therefore, KM objectives (18), (21), (22) are equivalently defined either by embeddings or p.s.d. kernels . Thus, kernels are commonly assumed p.s.d. However, as discussed later, pairwise clustering objective (22) is also used with non p.s.d. affinities.
To optimize KM objectives (18), (21), (22) one can use the basic KM procedure (8) iteratively minimizing mixed objective (16) explicitly using embedding
(23) 
where is the mean (17) for current segment . Equivalently, this procedure can use kernel instead of . Indeed, as in Section 8.2.2 of [51], the square of the objective in (23) is
where we use segment as an indicator vector, embedding as an embedding matrix where points are columns, and denotes the transpose. Since the crossed term is a constant at , the right hand side gives an equivalent objective for computing in (23). Using kernel matrix and indicator vector for element we get
(24) 
where the kernel matrix is directly determined by kernel
Approach (24) has quadratic complexity iterations. But, it avoids explicit highdimensional embeddings in (23) replacing them by kernel in all computations, a.k.a. the kernel trick.
Note that the implicit KM procedure (24) is guaranteed to decrease pairwise KM objectives (21) or (22) only for p.s.d. kernels. Indeed, equation (24) is derived from the standard greedy Kmeans procedure in the embedding space (23) assuming kernel (19). The backward reduction of (24) to (23) can be done only for p.s.d. kernels when Mercer’s theorem guarantees existence of some embedding such that .
Pairwise energy (21) helps to explain the positive result for KM with common Gaussian kernel in Fig.1(h). Gaussian kernel distance (red plot below)
(25) 
is a “robust” version of Euclidean metric (green) in basic Kmeans (10). Thus, Gaussian KM finds clusters with small local variances, Fig.1(h). In contrast, basic Kmeans (c) tries to find good clusters with small global variances, which is impossible for noncompact clusters.
Average association (AA) or distortion (AD): Equivalent pairwise objectives (21) and (22) suggest natural extensions of KM. For example, one can replace Hilbertian metric in (21) by an arbitrary zerodiagonal distortion matrix generating average distortion (AD) energy
(26) 
reducing to KM energy (21) for . Similarly, p.s.d. kernel in (22) can be replaced by an arbitrary pairwise similarity or affinity matrix defining standard average association (AA) energy
(27) 
reducing to KM objective (22) for . We will also use association between any two segments and
(28) 
allowing to rewrite AA energy (27) as
(29) 
The matrix expressions in (28) and (29) represent segments as indicator vectors such that iff and symbol means a transpose. Matrix notation as in (29) will be used for all pairwise clustering objectives discussed in this paper.
KM algorithm (24) is not guaranteed to decrease (27) for improper (non p.s.d.) kernel matrix , but general AA and AD energies could be useful despite optimization issues. However, [38] showed that dropping metric and proper kernel assumptions are not essential; there exist p.s.d. kernels with KM energies equivalent (up to constant) to AD (26) and AA (27) for arbitrary associations and zerodiagonal distortions , see Fig. 3.
For example, for any given affinity in (27) the diagonal shift trick of Roth et al. [38] generates the “kernel matrix”
(30) 
For sufficiently large scalar matrix is positive definite yielding a proper discrete kernel
for finite set . It is easy to check that KM energy (22) with kernel in (30) is equivalent to AA energy (27) with affinity , up to a constant. Indeed, for any indicator we have implying
Also, Section 3.1 uses eigen decomposition of to construct an explicit finitedimensional Euclidean embedding^{4}^{4}4Mercer’s theorem is a similar eigen decomposition for continuous p.d. kernels giving infinitedimensional Hilbert embedding . Discrete kernel embedding in Sec. 3.1 (56) has finite dimension , which is still much higher than the dimension of points , e.g. for colors. Sec. 3.1 also shows lower dimensional embeddings approximating isometry (20). satisfying isometry (20) for any p.d. discrete kernel . Minimizing KM energy (18) over such embedding isometric to in (30) is equivalent to optimizing (22) and, therefore, (27).
Since average distortion energy (26) for arbitrary is equivalent to average association for , it can also be converted to KM with a proper kernel [38]. Using the corresponding kernel matrix (30) and (20) it is easy to derive Hilbertian distortion (metric) equivalent to original distortions
(31) 
For simplicity and without loss of generality, the rest of the paper assumes symmetric affinities since nonsymmetric ones can be equivalently replaced by . However, we do not assume positive definiteness and discuss diagonal shifts, if needed.
Weighted KM and weighted AA: Weighted Kmeans [28] is a common extension of KM techniques incorporating some given point weights . In the context of embedded points it corresponds to weighted KM iteratively minimizing the weighted version of the mixed objective in (16)
(32) 
Optimal segment centers are now weighted means
(33) 
where the matrix formulation has weights represented by column vector and diagonal matrix . Assuming a finite dimensional data embedding this formulation uses embedding matrix with column vectors . This notation implies two simple identities used in (33)
(34) 
Inserting weighted means (33) into mixed objective (32) produces a pairwise energy formulation for weighted KM similar to (22)
(35)  
where p.s.d kernel matrix corresponds to the dot products in the embedding space, i.e. .
Replacing the p.s.d. kernel with an arbitrary affinity matrix defines a weighted AA objective generalizing (27) and (29)
(37) 
Weighted AD can also be defined. Equivalence of KM, AA, and AD in the general weighted case is discussed in Appendix A.
Other pairwise clusteing criteria: Besides AA there are many other standard pairwise clustering criteria defined by affinity matrices . For example, Average Cut (AC)
(38)  
where is a degree matrix defined by node degrees vector . The formulation on the last line (38) comes from the following identity valid for Boolean
Normalized Cut (NC) [8] in (39) is another wellknown pairwise clustering criterion. Both AC and NC can also be reduced to KM [38, 52, 40]. However, spectral relaxation [8] is more common optimization method for pairwise clustering objectives than iterative KM procedure (24). Due to popularity of NC and spectral relaxation we review them in a dedicated Section 1.3.
1.2.3 Pairwise vs. pointwise distortions
Equivalence of KM to pairwise distortion criteria (26) helps to juxtapose kernel Kmeans with probabilistic Kmeans (Sec.1.2.1) from one more point of view. Both methods generalize the basic Kmeans (6), (10) by replacing the Euclidean metric with a more general distortion measure . While pKM uses “pointwise” formulation (13) where measures distortion between a point and a model, KM uses “pairwise” formulation (21) where measures distortion between pairs of points.
These two different formulations are equivalent for Euclidean distortion (i.e. basic Kmeans), but the pairwise approach is strictly stronger than the pointwise version using the same Hilbertian distortion in nonEuclidean cases (see Appendix B). The corresponding pointwise approach is often called weak kernel Kmeans. Interestingly, weak KM with standard Gaussian kernel can be seen as Kmodes [29], see Fig. 1(g). Appendix B also details relations between Kmodes and popular meanshift clustering [39]. An extended version of Table I including weighted KM and weak KM techniques is given in Figure 34.
1.3 Overview of Normalized Cut and spectral clustering
Section 1.2.2 has already discussed KM and many related pairwise clustering criteria based on specified affinities . This section is focused on a related pairwise clustering method, Normalized Cut (NC) [8]. Shi and Malik [8] also popularized pairwise clustering optimization via spectral relaxation, which is different from iterative Kmeans algorithms (23) (24). Note that there are many other popular optimization methods for different clustering energies using pairwise affinities [53, 54, 55, 56, 57], which are outside the scope of this work.
1.3.1 Nc objective and its relation to Aa and Km
Normalized Cut (NC) energy [8] can be written as
(39) 
where association (28) is defined by a given affinity matrix . The second matrix formulation above shows that the difference between NC and AA (29) is in the normalization. AA objective uses denominator , which is th segment’s size. NC (39) normalizes by weighted size. Indeed, using
where weights are node degrees
(40) 
For convenience, NC objective can be formatted similarly to (29)
(41) 
For some common types of affinities where , e.g. nearest neighbor () graphs, NC and AA objectives (41) and (29) are equivalent. More generally, Bach & Jordan [52], Dhillon et al. [40] showed that NC objective can be reduced to weighted AA or KM with specific weights and affinities.
Our matrix notation makes equivalence between NC (41) and weighted AA (37) straightforward. Indeed, objective (41) with affinity coincides with (37) for weights and affinity
(42) 
The weighted version of KM procedure (24) detailed in Appendix A minimizes weighted AA (37) only for p.s.d. affinities, but positive definiteness of is not critical. For example, an extension of the diagonal shift (30) [38] can convert NC (41) with arbitrary (symmetric) to an equivalent NC objective with p.s.d. affinity
(43) 
using degree matrix and sufficiently large . Indeed, for indicators we have and
Positive definite (43) implies p.d. affinity (42) of weighted AA
(44) 
The weighted version of KM procedure (24) for this p.d. kernel [58] greedily optimizes NC objective (41) for any (symmetric) .
1.3.2 Spectral relaxation and other optimization methods
There are many methods for approximate optimization of NPhard pairwise clustering energies besides greedy Kmean procedures. In particular, Shi, Malik, and Yu [8, 59] popularized spectral relaxation methods in the context of normalized cuts. Such methods also apply to AA and other problems [8]. For example, similarly to [59] one can rewrite AA energy (27) as
where is a matrix of normalized indicator vectors . Orthogonality implies where
is an identity matrix of size
. Minimization of the trace energy above with relaxed constrained to a “unit sphere” is a simple representative example of spectral relaxation in the context of AA. This relaxed trace optimization is a generalization of Rayleigh quotient problem that has an exact closed form solution in terms oflargest eigenvectors for matrix
. This approach extends to general multilabel weighted AA and related graph clustering problems, e.g. AC and NC [8, 59]. The main computational difficulties for spectral relaxation methods are explicit eigen decomposition for large matrices and integrality gap  there is a final heuristicsbased discretization step for extracting an integer solution for the original combinatorial problem from an optimal relaxed solution. For example, one basic discretization heuristic is to run Kmeans over the rowvectors of the optimal relaxed
.Other optimization techniques are also used for pairwise clustering. Buhmann et al. [60, 38] address the general AD and AA energies via mean field approximation and deterministic annealing. It can be seen as a fuzzy version^{5}^{5}5Fuzzy version of Kmeans in Duda et al. [61] generalizes to KM. of the kernel Kmeans algorithm. In the context of normalized cuts Dhillon et al. [40, 58] use spectral relaxation to initialize KM algorithm.
In computer vision it is common to combine various constraints or energy terms into one objective function. Similar efforts were made in the context of pairwise clustering as well. For example, to combine KM or NC objectives with Potts regularization [62] normalizes the corresponding pairwise constraints by cluster sizes. This alters the Potts model to fit the problem to a standard tracebased formulation.
Adding nonhomogeneous linear constraints into spectral relaxation techniques also requires approximations [63] or model modifications [64]. Exact optimization for the relaxed quadratic ratios (including NC) with arbitrary linear equality constraints is possible by solving a sequence of spectral problems [65].
Our bound optimization approach allows to combine many standard pairwise clustering objectives with any regularization terms with existing solvers. In our framework such pairwise clustering objectives are interpreted as highorder energy terms approximated via linear upper bounds during optimization.
1.4 Our approach summary
We propose to combine standard pairwise clustering criteria such as AA, AC, or NC with standard regularization constraints such as geometric or MRF priors. To achieve this goal, we propose unary (linear) bounds for the pairwise clustering objectives that are easy to integrate into any standard regularization algorithms. Below we summarize our motivation and our main technical contributions.
1.4.1 Motivation and Related work
Due to significant differences in existing optimization methods, pairwise clustering (e.g. NC) and Markov Random Fields (MRF) techniques are used separately in many applications of vision and learning. They have complementary strengths and weaknesses.
For example, NC can find a balanced partitioning of data points from pairwise affinities for highdimensional features [8, 66, 67]. In contrast, discrete MRF as well as continuous regularization methods commonly use probabilistic Kmeans [27] or model fitting to partition image features [24, 23, 26, 12]
. Clustering data by fitting simple parametric models seems intuitively justified when data supports such simple models,
e.g. Gaussians [24] or lines/planes [12]. But, clustering arbitrary data by fitting complex models like GMM or histograms [23, 26] seems like an idea bound to overfit the data. Indeed, overfitting happens even for low dimensional color features [14], see also Fig.1(b,e) and Fig.2(b). Our energy (1) allows a general MRF framework to benefit from standard pairwise clustering criteria, e.g. NC, widely used for complex data. In general, kernelbased clustering methods are a prevalent choice in the learning community as model fitting (e.g. EM) becomes intractable for high dimensions. We show potent segmentation results for basic formulations of energy (1) with higherdimensional features like RGBXY, RGBD, RGBM where standard MRF methods using modelfitting fail.On the other hand, standard NC applications can also benefit from additional constraints [63, 65, 68]. We show how to add a wide class of standard MRF potentials. For example, standard NC segmentation has weak alignment to contrast edges [67]. While this can be addressed by postprocessing, inclusion of the standard pairwise Potts term [10, 4] offers a principled solution. We show benefits from combining NC with lower and higherorder constraints, such as sparsity or label costs [12]. In the context of a general graph clustering, higherorder consistency terms based on a Potts model [11] also give significant improvements.
The synergy of the joint energy (1) is illustrated by juxtaposing the use of the pixel location information (XY) in standard NC and MRF techniques. The basic pairwise Potts model typically works on the nearestneighbor pixel grids or where XY information helps contrast/edge alignment and enforces “smooth” segment boundaries. Wider connectivity Potts leads to denser graphs with slower optimization and poorer edge localization. In contrast, common NC methods [8] augment pixel color features with location using relatively wide bandwidth for the XY dimension. This encourages segments with spatially “compact” regions. Narrower XY kernels improve edge alignment [67], but weaken regional color/feature consistency. On the other hand, very large XY kernels produce coloronly clustering with spatially incoherent segments. Combining regional color consistency with spatial coherence in a single NC energy requires a compromise when selecting XY bandwidth. Our general energy (1) can separate the regional consistency (e.g. NC clustering term) from the boundary smoothness or edge alignment (e.g. Potts potential). Interestingly, it may still be useful to augment colors with XY in the NC term in (1) since XY kernel bandwidth allows to separate similarappearance objects at distances larger than , see Sec.6.2.3.
Adding highorder term to MRF potentials in (1) differs from fullyconnected CRF [69]. Like many MRF/CRF models the method in [69] lacks balancing. It is a denser version of the pairwise Potts model giving a trivial solution without a data term. In contrast, ratiobased objectives are designed for balanced clustering. In fact, our unary bounds for such objectives allow to combine them with fully connected CRF or any other MRF/CRF model with known solvers, e.g. mean field approximation [69].
Some earlier work also motivates a combination of pairwise clustering (e.g. NC) with MRF potentials like the Potts model [62]. They alter the Potts model to fit it to the standard tracebased formulation of NC. Our general bound approach can combine many pairwise clustering methods with any solvable discrete or continuous regularization potentials.
1.4.2 Main contributions
We propose a new energy model (1) for multilabel image segmentation and clustering, efficient bound optimization algorithms, and demonstrate many useful applications. Our preliminary results appear in [41] and [70]. Our main contributions are:

We propose a general multilabel segmentation or clustering energy (1) combining pairwise clustering (e.g. NC) with standard second or higherorder MRF regularization potentials. A pairwise clustering term can enforce balanced partitioning of observed image features and MRF terms can enforce many standard regularization constraints.

We obtain kernel (exact) and spectral (approximate) bounds for common pairwise clustering criteria providing two auxiliary functions for joint energy (1). In the context of standard MRF potentials (e.g. Potts, robust Potts, label cost) we propose movemaking algorithms for energy (1) generalizing expansion and swap moves^{6}^{6}6Our bounds for pairwise criteria can be also integrated into auxiliary functions with other standard regularization potentials (truncated, cardinality, TV) addressed by discrete (e.g. message passing, relaxations, meanfield approximations) or continuous (e.g. convex, primaldual) algorithms..

For our spectral bound^{7}^{7}7Our term spectral bound means “spectral” auxiliary function
in the context of bound optimization, not to be confused with bounds on eigenvalues.
we derive a new lowdimensional data embedding minimizing isometry errors w.r.t. affinities. These optimal embeddings complete the gap between standard KM discretization heuristics in spectral relaxation [8] and the exact KM formulations [52, 40]. 
Our experiments demonstrate that typical NC applications benefit from extra MRF constraints, as well as, MRF segmentation benefit from the highorder NC term encouraging balanced partitioning of image features. In particular, our NC+MRF framework works for higherdimensional features (e.g. RGBXY, RGBD, RGBM) where standard modelfitting clustering [23, 26, 12] fails.
The rest of the paper is organized as follows. Sections 2 and 3 present our kernel and spectral bounds for (1) and detail combinatorial move making graph cut algorithms for its optimization. Section 4 discusses a number of extensions for our bound optimization approach. Sections 5 analyses kernel and bandwidth selection strategies. Section 6 presents many experiments where either pairwise clustering criteria benefit from the additional MRF constraints or common MRF formulations benefit from an additional clustering term for various highdimensional features.
objective  matrix formulation  concave relaxation  and  kernel bound 
in  (51)  in Lemma 1  for at  
AA (29)  ,  
AC (38)  ,  
for in (52)  
NC (41)  , 
2 Kernel Bounds
First, we review the general bound optimization principle and present basic Kmeans as an example. Section 2.2 derives kernel bounds for standard pairwise clustering objectives. Without loss of generality, we assume symmetric affinities since nonsymmetric ones can be equivalently replaced by , e.g. see (30) in Sec.1.2.2. Positive definiteness of is not assumed and diagonal shifts are discussed when needed. Movemaking bound optimization for energy (1) is discussed in Section 2.3.
2.1 Bound optimization and Kmeans
In general, bound optimizers are iterative algorithms that optimize auxiliary functions (upper bounds) for a given energy assuming that these auxiliary functions are more tractable than the original difficult optimization problem [71, 72, 73, 37]. Let be a current iteration index. Then is an auxiliary function of at current solution if
(45a)  