# Constrained 1-Spectral Clustering

An important form of prior information in clustering comes in form of cannot-link and must-link constraints. We present a generalization of the popular spectral clustering technique which integrates such constraints. Motivated by the recently proposed 1-spectral clustering for the unconstrained problem, our method is based on a tight relaxation of the constrained normalized cut into a continuous optimization problem. Opposite to all other methods which have been suggested for constrained spectral clustering, we can always guarantee to satisfy all constraints. Moreover, our soft formulation allows to optimize a trade-off between normalized cut and the number of violated constraints. An efficient implementation is provided which scales to large datasets. We outperform consistently all other proposed methods in the experiments.

## Authors

• 5 publications
• 65 publications
• ### On Constrained Spectral Clustering and Its Applications

Constrained clustering has been well-studied for algorithms such as K-me...
01/25/2012 ∙ by Xiang Wang, et al. ∙ 0

• ### Improving Spectral Clustering using the Asymptotic Value of the Normalised Cut

Spectral clustering is a popular and versatile clustering method based o...
03/29/2017 ∙ by David Hofmeyr, et al. ∙ 0

• ### Tight Continuous Relaxation of the Balanced k-Cut Problem

Spectral Clustering as a relaxation of the normalized/ratio cut has beco...
05/24/2015 ∙ by Syama Sundar Rangapuram, et al. ∙ 0

• ### An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

Many problems in machine learning and statistics can be formulated as (g...
12/03/2010 ∙ by Matthias Hein, et al. ∙ 0

• ### A Manifold Proximal Linear Method for Sparse Spectral Clustering with Application to Single-Cell RNA Sequencing Data Analysis

Spectral clustering is one of the fundamental unsupervised learning meth...
07/18/2020 ∙ by Zhongruo Wang, et al. ∙ 11

• ### Kernel Cuts: MRF meets Kernel & Spectral Clustering

We propose a new segmentation model combining common regularization ener...
06/24/2015 ∙ by Meng Tang, et al. ∙ 0

• ### The SpectACl of Nonconvex Clustering: A Spectral Approach to Density-Based Clustering

When it comes to clustering nonconvex shapes, two paradigms are used to ...
07/01/2019 ∙ by Sibylle Hess, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The task of clustering is to find a natural grouping of items given e.g. pairwise similarities. In real world problems, such a natural grouping is often hard to discover with given similarities alone or there is more than one way to group the given items. In either case, clustering methods benefit from domain knowledge that gives bias to the desired clustering. Wagstaff et. al [17] are the first to consider constrained clustering by encoding available domain knowledge in the form of pairwise must-link (ML, for short) and cannot-link (CL) constraints. By incorporating these constraints into -means they achieve much better performance. Since acquiring such constraint information is relatively easy, constrained clustering has become an active area of research; see [1] for an overview.

Spectral clustering is a graph-based clustering algorithm originally derived as a relaxation of the NP-hard normalized cut problem. The spectral relaxation leads to an eigenproblem for the graph Laplacian, see [7, 14, 16]. However, it is known that the spectral relaxation can be quite loose [6]. More recently, it has been shown that one can equivalently rewrite the discrete (combinatorial) normalized Cheeger cut problem into a continuous optimization problem using the nonlinear -graph Laplacian [8, 15] which yields much better cuts than the spectral relaxation. In further work it is shown that even all balanced graph cut problems, including normalized cut, have a tight relaxation into a continuous optimization problem [9].

The first approach to integrate constraints into spectral clustering was based on the idea of modifying the weight matrix in order to enforce the must-link and cannot-link constraints and then solving the resulting unconstrained problem [11]. Another idea is to adapt the embedding obtained from the first eigenvectors of the graph Laplacian [12]. Closer to the original normalized graph cut problem are the approaches that start with the optimization problem of the spectral relaxation and add constraints that encode must-links and cannot-links [20, 5, 19, 18]. Furthermore, the case where the constraints are allowed to be inconsistent is considered in [4].

In this paper we contribute in various ways to the area of graph-based constrained learning. First, we show in the spirit of -spectral clustering [8, 9], that the constrained normalized cut problem has a tight relaxation as an unconstrained continuous optimization problem. Our method, which we call COSC, is the first one in the field of constrained spectral clustering, which can guarantee that all given constraints are fulfilled. While we present arguments that in practice it is the best choice to satisfy all constraints even if the data is noisy, in the case of inconsistent or unreliable constraints one should refrain from doing so. Thus our second contribution is to show that our framework can be extended to handle degree-of-belief and even inconsistent constraints. In this case COSC optimizes a trade-off between having small normalized cut and a small number of violated constraints. We present an efficient implementation of COSC based on an optimization technique proposed in [9] which scales to large datasets. While the continuous optimization problem is non-convex and thus convergence to the global optimum is not guaranteed, we can show that our method improves any given partition which satisfies all constraints or it stops after one iteration.

All omitted proofs and additional experimental results can be found in the supplementary material.

Notation. Set functions are denoted by a hat, , while the corresponding extension is . In this paper, we consider the normalized cut problem with general vertex weights. Formally, let be an undirected graph with vertex set and edge set together with edge weights and vertex weights and . Let and denote by . We define respectively the cut, the generalized volume and the normalized cut (with general vertex weights) of a partition as

 cut(C,¯¯¯¯C)=2∑i∈C,j∈¯¯¯¯Cwij, gvol(C)=∑i∈Cbi, bal(C)=2gvol(C)gvol(¯¯¯¯C)gvol(V), NCut(C,¯¯¯¯C)=cut(C,¯¯¯¯C)bal(C).

We obtain ratio cut and normalized cut for special cases of the vertex weights, , respectively. In the ratio cut case, is the cardinality of and in the normalized cut case, it is volume of , denoted by .

## 2 The Constrained Normalized Cut Problem

We consider the normalized cut problem with must-link and cannot-link constraints. Let denote the given graph and , be the constraint matrices, where the element (or ) specifies the must-link (or cannot-link) constraint between and . We will in the following always assume that is connected. All what is stated below and our suggested algorithm can be easily generalized to degree of belief constraints by allowing (and ) . However, in the following we consider only (and ) , in order to keep the theoretical statements more accessible.

###### Definition 2.1.

We call a partition consistent if it satisfies all constraints in and .

Then the constrained normalized cut problem is to minimize over all consistent partitions. If the constraints are unreliable or inconsistent one can relax this problem and optimize a trade-off between normalized cut and the number of violated constraints. In this paper, we address both problems in a common framework.

We define the set functions, , as

 ^M(C) :=2∑i∈C,j∈¯¯¯¯Cqmij ^N(C) :=∑i∈C,j∈Cqcij+∑i∈¯¯¯¯C,j∈¯¯¯¯Cqcij=vol(Qc)−2∑i∈C,j∈¯¯¯¯Cqcij.

and are equal to twice the number of violated must-link and cannot-link constraints of partition .

As we show in the following, both the constrained normalized cut problem and its soft version can be addressed by minimization of defined as

 ^Fγ(C)=cut(C,¯¯¯¯C)+γ(^M(C)+^N(C))bal(C), (1)

where . Note that if is consistent. Thus the minimization of corresponds to a trade-off between having small normalized cut and satisfying all constraints parameterized by .

The relation between the parameter and the number of violated constraints by the partition minimizing is quantified in the following lemma.

###### Lemma 2.1.

Let be consistent and . If , then any minimizer of violates no more than constraints.

###### Proof.

Any partition that violates more than constraints satisfies and thus

 ^Fγ(C) =NCut(C,¯¯¯¯C)+γ ^M(C)+^N(C)bal(C) ≥NCut(C,¯¯¯¯C)+γ2(l+1)12gvol(V)>4γ(l+1)gvol(V),

where we have used that and as the graph is connected. Assume now that the partition minimizes for and violates more than constraints. Then

 ^Fγ(D) >4γ(l+1)gvol(V)≥λ=^Fγ(C),

Note that it is easy to construct a partition which is consistent and thus the above choice of is constructive. The following theorem is immediate from the above lemma for the special case .

###### Theorem 2.1.

Let be consistent with the given constraints and . Then for , it holds that

 argminC⊂V:(C,¯¯¯¯C) consistentNCut(C,¯¯¯¯C)=argminC⊂V^Fγ(C)

and the optimum values of both problems are equal.

Thus the constrained normalized cut problem can be equivalently formulated as the combinatorial problem of minimizing . In the next section we will show that this problem allows for a tight relaxation into a continuous optimization problem.

### 2.1 A tight continuous relaxation of ^Fγ

Minimizing is a hard combinatorial problem. In the following, we derive an equivalent continuous optimization problem. Let denote a function on , and

denote the vector that is 1 on

and 0 elsewhere. Define

 M(f) :=n∑i,j=1qmij∣∣fi−fj∣∣and N(f) :=vol(Qc)(max(f)−min(f))−n∑i,j=1qcij∣∣fi−fj∣∣,

where and are respectively the maximum and minimum elements of . Note that and for any non-trivial111 A partition is non-trivial if neither nor . partition .

Let denote the diagonal matrix with the vertex weights on the diagonal. We define

We denote the numerator of by and the denominator by .

###### Lemma 2.2.

For any non-trivial partition it holds that .

###### Proof.

We have,

 n∑i,j=1wij|(1C)i−(1C)j|=cut(C,¯¯¯¯C) ∥∥∥B(1C−1gvol(V)⟨1C,b⟩1)∥∥∥1 = ∥∥∥B((1−gvol(C)gvol(V))1C−gvol(C)gvol(V)1¯¯¯¯C)∥∥∥1 = (1−gvol(C)gvol(V))gvol(C)+gvol(C)gvol(V)gvol(¯¯¯¯C)=bal(C)

This together with the discussion on finishes the proof. ∎

From Lemma 2.2 it immediately follows that minimizing is a relaxation of minimizing . In our main result (Theorem 2.2), we establish that both problems are actually equivalent, so that we have a tight relaxation. In particular a minimizer of is an indicator function corresponding to the optimal partition minimizing .

The proof is based on the following key property of the functional . Given any non-constant , optimal thresholding,

 C∗f=argminminifi ≤ t < maxifi^Fγ(Ctf),

where , yields an indicator function on some with smaller or equal value of .

###### Theorem 2.2.

For , we have

 minC⊂V^Fγ(C) = minf∈Rn, f % non-constantFγ(f).

Moreover, a solution of the first problem can be obtained from the solution of the second problem.

###### Proof.

It has been shown in [8], that

 n∑i,j=1wij|fi−fj| = ∫∞−∞cut(Ctf,¯¯¯¯¯¯Ctf)dt

We define as , if and , and otherwise. Denoting by , the cut on the constraint graph whose weight matrix is given by , we have

 Rγ(f) = ∫∞−∞cut(Ctf,¯¯¯¯¯¯Ctf)dt+γ ∫∞−∞cutQm(Ctf,¯¯¯¯¯¯Ctf) +γvol(Qc)∫maxifiminifi1dt−γ ∫∞−∞cutQc(Ctf,¯¯¯¯¯¯Ctf) = ∫∞−∞cut(Ctf,¯¯¯¯¯¯Ctf)dt+γ ∫∞−∞cutQm(Ctf,¯¯¯¯¯¯Ctf) +γvol(Qc)∫∞−∞^P(Ctf)dt−γ ∫∞−∞cutcQ(Ctf,¯¯¯¯¯¯Ctf) =∫∞−∞^Rγ(Ctf)dt

Note that is an even, convex and positively one-homogeneous function.222A function is positively one-homogeneous if for all . Moreover, every even, convex positively one-homogeneous function, has the form , where is a symmetric convex set, see e.g., [10]. Note that and thus because of the symmetry of it has to hold for all . Since and , we have for all ,

 Rγ(f) ≥∫∞−∞^Rγ(Ctf)^S(Ctf) ⟨u,1Ctf⟩dt ≥inft∈R^Rγ(Ctf)^S(Ctf)∫maxifiminifi ⟨u,1Ctf⟩dt, (2)

where in the last inequality we changed the limits of integration using the fact that . Let and . Then

 n∑i=1fi(⟨u,1Ci−1⟩−⟨u,1Ci⟩)=n∑i=1fiui= ⟨u,f⟩

Noting that (2.1) holds for all , we have

 Rγ(f)≥inft∈R^Fγ(Ctf) supu∈U ⟨u,f⟩=inft∈R^Fγ(Ctf) S(f).

This implies that

 Fγ(f)≥inft∈R^Fγ(Ctf)=Fγ(1C∗f), (3)

where .

This shows that we always get descent by optimal thresholding. Thus the actual minimizer of is a two-valued function, which can be transformed to an indicator function on some , because of the scale and shift invariance of . Then from Lemma 2.2, which shows that for non-trivial partitions, , the statement follows. ∎

Now, we state our second result: the problem of minimizing the functional over arbitrary real-valued non-constant , for a particular choice of , is in fact equivalent to the NP-hard problem of minimizing normalized cut with constraints.

###### Theorem 2.3.

Let be consistent and . Then for , it holds that

 minC⊂V:(C,¯¯¯¯C) consistentNCut(C,¯¯¯¯C)=minf∈Rn, f non-constantFγ(f)

Furthermore, an optimal partition of the constrained problem can be obtained from a minimizer of the right problem.

###### Proof.

From Theorem 2.1 we know that, for the chosen value of , the constrained problem is equivalent to

 minC⊂V^Fγ(C),

which in turn is equivalent, by Theorem 2.2, to the right problem in the statement. Moreover, as shown in Theorem 2.2, minimizer of is an indicator function on and hence we immediately get an optimal partition of the constrained problem. ∎

A few comments on the implications of Theorem 2.3. First, it shows that the constrained normalized cut problem can be equivalently solved by minimizing for the given value of . The value of depends on the normalized cut value of a partition consistent with given constraints. Note that such a partition can be obtained in polynomial time by 2-coloring the constraint graph as long as the constraints are consistent.

### 2.2 Integration of must-link constraints via sparsification

If the must-link constraints are reliable and therefore should be enforced, one can directly integrate them by merging the corresponding vertices together with re-definition of edge and vertex weights. In this way ones derives a new reduced graph, where the value of the normalized cut of all partitions that satisfy the must-link constraints are preserved.

The construction of a reduced graph is given below for a must-link constraint .

1. merge and into a single vertex .

2. update the vertex weight of by .

3. update the edges as follows: if is any vertex other than and , then add an edge between and with weight .

Note that this construction leads to a graph with vertex weights even if the original graph had vertex weights equal to . If there are many must-links, one can efficiently integrate all of them together by first constructing the must-link constraint graph and merging each connected component in this way.

The following lemma shows that the above construction preserves all normalized cuts which respect the must-link constraints. We prove it for the simple case where we merge and and the proof can easily be extended to the general case by induction.

###### Lemma 2.3.

Let be the reduced graph of obtained by merging vertices and . If a partition does not separate and , we have .

###### Proof.

Note that . If does not separate and , then we have either or . W.l.o.g. assume that . The corresponding partition of is then and . We get

 cut(C′,¯¯¯¯¯¯C′) =∑i∈C′,j∈¯¯¯¯¯C′w′ij=∑j∈¯¯¯¯¯C′w′τj+∑i∈C′∖τ,j∈¯¯¯¯¯C′w′ij =∑k∈{p,q},j∈¯¯¯¯Cwkj+∑i∈C∖{p,q},j∈¯¯¯¯Cwij=cut(C,¯¯¯¯C). gvol(C′) =∑i∈C′b′i=b′τ+∑i∈C′∖τ =bp+bq+∑i∈C∖{p,q}bi=∑i∈Cbi=gvol(C). gvol(¯¯¯¯¯¯C′) =∑i∈¯¯¯¯¯C′b′i=∑i∈¯¯¯¯Cbi=gvol(¯¯¯¯C).

Thus we have

All partitions of the reduced graph fulfill all must-link constraints and thus any relaxation of the unconstrained normalized cut problem can now be used. Moreover, this is not restricted to the cut criterion we are using but any other graph cut criterion based on cut and the volume of the subsets will be preserved in the reduction.

## 3 Algorithm for Constrained 1-Spectral Clustering

In this section we discuss the efficient minimization of based on recent ideas from unconstrained -spectral clustering [8, 9]. Note, that is a non-negative ratio of a difference of convex (d.c) function and a convex function, both of which are positively one-homogeneous. In recent work [8, 9], a general scheme, shown in Algorithm 1 (where denotes the subdifferential of the convex function at ), is proposed for the minimization of a non-negative ratio of a d.c function and convex function both of which are positively one-homogeneous.

It is shown in [9] that Algorithm 1 generates a sequence such that either or the sequence terminates. Moreover, the cluster points of correspond to critical points of . The scheme is given in Algorithm 1 for the problem , where

 R1(f) :=12n∑i,j=1(wij+γqmij)|fi−fj| +γ2n∑i,j=1qcij(max(f)−min(f)) R2(f) :=12n∑i,j=1qcij|fi−fj|, S(f) :=12∥∥∥B(f−1gvol(V)⟨f,b⟩1)∥∥∥1

Note that are both convex functions and .

Moreover, it is shown in [9], that if one wants to minimize only over non-constant functions, one has to ensure that . Note, that

 ∂S(f) =12(I−1gvol(V)b1T)Bsign(f− ⟨b,f⟩gvol(V)1) ∂R2(f) ={n∑j=1qcijuij|uij=−uji,uij∈sign(fi−fj)},

where if , otherwise it just the sign function. It is easy to check that for all and all and there exists always a vector for all such that .

In the algorithm the key part is the inner convex problem which one has to solve at each step. In our case it has the form,

 min∥f∥2≤1 12n∑i,j=1(wij+γqmij)∣∣fi−fj∣∣ (4) +γ2n∑i,j=1qcij(max(f)−min(f))−⟨f,γr2+λks⟩,

where , and .

To solve it more efficiently we derive an equivalent smooth dual formulation for this non-smooth convex problem. We replace by in the following.

###### Lemma 3.1.

Let denote the set of edges and be defined as . Moreover, let denote the simplex, . The above inner problem is equivalent to

 min{α∈RE|∥α∥∞≤1,αij=−αjiv∈U}Ψ(α,v):= (5) c∥∥−Aαc+v+b−PU(−Aαc+v+b)∥∥2,

where , and is the projection of on to the simplex .

###### Proof.

Noting that (see [8]) and , the inner problem can be rewritten as

 min∥f∥2≤1max{α∈RE|∥α∥∞≤1,αij=−αji}⟨f,Aα⟩ =min∥f∥2≤1max{α∈RE|∥α∥∞≤1,αij=−αjiu,v∈U}⟨f,Aα⟩ +c⟨f,u⟩−c⟨f,v⟩−γ⟨f,r2⟩−λk⟨f,s⟩ (s1)=maxα∈RE|∥α∥∞≤1αij=−αjiu,v∈Umin∥f∥2≤1⟨f,Aα+c(u−v)−γr2−λks⟩ (s2)=maxα∈RE|∥α∥∞≤1αij=−αjiu,v∈U−∥∥Aα−γr2−λks+c(u−v)∥∥2 =−minα∈RE|∥α∥∞≤1αij=−αjiu,v∈UΨ(α,u,v).

The step follows from the standard min-max theorem (see Corollary 37.3.2 in [13]) since , , and lie in non-empty compact convex sets. In the step , we used that the minimizer of the linear function over the Euclidean ball is given by

 f∗=Aα−γr2−λks+cu−cv∥Aα−γr2−λks+cu−cv∥2,

if ; otherwise is an arbitrary element of the Euclidean unit ball.

Finally, we have = c . We also know that for a convex set and any given , , where is the projection of onto the set . With , we have for any , and from this the result follows. ∎

The smooth dual problem can be solved efficiently using first order projected gradient methods like FISTA [2], which has a guaranteed convergence rate of , where is the number of steps, and is the Lipschitz constant of the gradient of the objective. The bound on the Lipschitz constant for the gradient of the objective in (5) can be rather loose if the weights are varying a lot. The rescaling of the variable introduced in Lemma 3.2 leads to a better condition number and also to a tighter bound on the Lipschitz constant. This results in a significant improvement in practical performance.

###### Lemma 3.2.

Let be a linear operator defined as and let , for positive constant . The above inner problem is equivalent to

 min{β∈RE|∥β∥∞≤sij,βij=−βji}v∈U~Ψ(β,v):=12∥d−PU(d)∥22,

where . The Lipschitz constant of the gradient of is upper bounded by 4.

###### Proof.

Let . Then and constraints on transform to and . Since the mapping between and is one-to-one, the transformation yields an equivalent problem (in the sense that minimizer of one problem can be easily derived from minimizer of the other problem).

Now we derive a bound on the Lipschitz constant.

The gradient of at w.r.t , and are given by

 (∇Ψ(x))β=−BTM(d−PU(d)), (∇Ψ(x))v=(d−PU(d)),

where is the adjoint operator of given by .

Let denote any other point and . then we have