The task of clustering is to find a natural grouping of items given e.g. pairwise similarities. In real world problems, such a natural grouping is often hard to discover with given similarities alone or there is more than one way to group the given items. In either case, clustering methods benefit from domain knowledge that gives bias to the desired clustering. Wagstaff et. al  are the first to consider constrained clustering by encoding available domain knowledge in the form of pairwise must-link (ML, for short) and cannot-link (CL) constraints. By incorporating these constraints into -means they achieve much better performance. Since acquiring such constraint information is relatively easy, constrained clustering has become an active area of research; see  for an overview.
Spectral clustering is a graph-based clustering algorithm originally derived as a relaxation of the NP-hard normalized cut problem. The spectral relaxation leads to an eigenproblem for the graph Laplacian, see [7, 14, 16]. However, it is known that the spectral relaxation can be quite loose . More recently, it has been shown that one can equivalently rewrite the discrete (combinatorial) normalized Cheeger cut problem into a continuous optimization problem using the nonlinear -graph Laplacian [8, 15] which yields much better cuts than the spectral relaxation. In further work it is shown that even all balanced graph cut problems, including normalized cut, have a tight relaxation into a continuous optimization problem .
The first approach to integrate constraints into spectral clustering was based on the idea of modifying the weight matrix in order to enforce the must-link and cannot-link constraints and then solving the resulting unconstrained problem . Another idea is to adapt the embedding obtained from the first eigenvectors of the graph Laplacian . Closer to the original normalized graph cut problem are the approaches that start with the optimization problem of the spectral relaxation and add constraints that encode must-links and cannot-links [20, 5, 19, 18]. Furthermore, the case where the constraints are allowed to be inconsistent is considered in .
In this paper we contribute in various ways to the area of graph-based constrained learning. First, we show in the spirit of -spectral clustering [8, 9], that the constrained normalized cut problem has a tight relaxation as an unconstrained continuous optimization problem. Our method, which we call COSC, is the first one in the field of constrained spectral clustering, which can guarantee that all given constraints are fulfilled. While we present arguments that in practice it is the best choice to satisfy all constraints even if the data is noisy, in the case of inconsistent or unreliable constraints one should refrain from doing so. Thus our second contribution is to show that our framework can be extended to handle degree-of-belief and even inconsistent constraints. In this case COSC optimizes a trade-off between having small normalized cut and a small number of violated constraints. We present an efficient implementation of COSC based on an optimization technique proposed in  which scales to large datasets. While the continuous optimization problem is non-convex and thus convergence to the global optimum is not guaranteed, we can show that our method improves any given partition which satisfies all constraints or it stops after one iteration.
All omitted proofs and additional experimental results can be found in the supplementary material.
Notation. Set functions are denoted by a hat, , while the corresponding extension is . In this paper, we consider the normalized cut problem with general vertex weights. Formally, let be an undirected graph with vertex set and edge set together with edge weights and vertex weights and . Let and denote by . We define respectively the cut, the generalized volume and the normalized cut (with general vertex weights) of a partition as
We obtain ratio cut and normalized cut for special cases of the vertex weights, , respectively. In the ratio cut case, is the cardinality of and in the normalized cut case, it is volume of , denoted by .
2 The Constrained Normalized Cut Problem
We consider the normalized cut problem with must-link and cannot-link constraints. Let denote the given graph and , be the constraint matrices, where the element (or ) specifies the must-link (or cannot-link) constraint between and . We will in the following always assume that is connected. All what is stated below and our suggested algorithm can be easily generalized to degree of belief constraints by allowing (and ) . However, in the following we consider only (and ) , in order to keep the theoretical statements more accessible.
We call a partition consistent if it satisfies all constraints in and .
Then the constrained normalized cut problem is to minimize over all consistent partitions. If the constraints are unreliable or inconsistent one can relax this problem and optimize a trade-off between normalized cut and the number of violated constraints. In this paper, we address both problems in a common framework.
We define the set functions, , as
and are equal to twice the number of violated must-link and cannot-link constraints of partition .
As we show in the following, both the constrained normalized cut problem and its soft version can be addressed by minimization of defined as
where . Note that if is consistent. Thus the minimization of corresponds to a trade-off between having small normalized cut and satisfying all constraints parameterized by .
The relation between the parameter and the number of violated constraints by the partition minimizing is quantified in the following lemma.
Let be consistent and . If , then any minimizer of violates no more than constraints.
Any partition that violates more than constraints satisfies and thus
where we have used that and as the graph is connected. Assume now that the partition minimizes for and violates more than constraints. Then
which leads to a contradiction. ∎
Note that it is easy to construct a partition which is consistent and thus the above choice of is constructive. The following theorem is immediate from the above lemma for the special case .
Let be consistent with the given constraints and . Then for , it holds that
and the optimum values of both problems are equal.
Thus the constrained normalized cut problem can be equivalently formulated as the combinatorial problem of minimizing . In the next section we will show that this problem allows for a tight relaxation into a continuous optimization problem.
2.1 A tight continuous relaxation of
Minimizing is a hard combinatorial problem. In the following, we derive an equivalent continuous optimization problem. Let denote a function on , and
denote the vector that is 1 onand 0 elsewhere. Define
where and are respectively the maximum and minimum elements of . Note that and for any non-trivial111 A partition is non-trivial if neither nor . partition .
Let denote the diagonal matrix with the vertex weights on the diagonal. We define
We denote the numerator of by and the denominator by .
For any non-trivial partition it holds that .
This together with the discussion on finishes the proof. ∎
From Lemma 2.2 it immediately follows that minimizing is a relaxation of minimizing . In our main result (Theorem 2.2), we establish that both problems are actually equivalent, so that we have a tight relaxation. In particular a minimizer of is an indicator function corresponding to the optimal partition minimizing .
The proof is based on the following key property of the functional . Given any non-constant , optimal thresholding,
where , yields an indicator function on some with smaller or equal value of .
For , we have
Moreover, a solution of the first problem can be obtained from the solution of the second problem.
It has been shown in , that
We define as , if and , and otherwise. Denoting by , the cut on the constraint graph whose weight matrix is given by , we have
Note that is an even, convex and positively one-homogeneous function.222A function is positively one-homogeneous if for all . Moreover, every even, convex positively one-homogeneous function, has the form , where is a symmetric convex set, see e.g., . Note that and thus because of the symmetry of it has to hold for all . Since and , we have for all ,
where in the last inequality we changed the limits of integration using the fact that . Let and . Then
Noting that (2.1) holds for all , we have
This implies that
This shows that we always get descent by optimal thresholding. Thus the actual minimizer of is a two-valued function, which can be transformed to an indicator function on some , because of the scale and shift invariance of . Then from Lemma 2.2, which shows that for non-trivial partitions, , the statement follows. ∎
Now, we state our second result: the problem of minimizing the functional over arbitrary real-valued non-constant , for a particular choice of , is in fact equivalent to the NP-hard problem of minimizing normalized cut with constraints.
Let be consistent and . Then for , it holds that
Furthermore, an optimal partition of the constrained problem can be obtained from a minimizer of the right problem.
From Theorem 2.1 we know that, for the chosen value of , the constrained problem is equivalent to
which in turn is equivalent, by Theorem 2.2, to the right problem in the statement. Moreover, as shown in Theorem 2.2, minimizer of is an indicator function on and hence we immediately get an optimal partition of the constrained problem. ∎
A few comments on the implications of Theorem 2.3. First, it shows that the constrained normalized cut problem can be equivalently solved by minimizing for the given value of . The value of depends on the normalized cut value of a partition consistent with given constraints. Note that such a partition can be obtained in polynomial time by 2-coloring the constraint graph as long as the constraints are consistent.
2.2 Integration of must-link constraints via sparsification
If the must-link constraints are reliable and therefore should be enforced, one can directly integrate them by merging the corresponding vertices together with re-definition of edge and vertex weights. In this way ones derives a new reduced graph, where the value of the normalized cut of all partitions that satisfy the must-link constraints are preserved.
The construction of a reduced graph is given below for a must-link constraint .
merge and into a single vertex .
update the vertex weight of by .
update the edges as follows: if is any vertex other than and , then add an edge between and with weight .
Note that this construction leads to a graph with vertex weights even if the original graph had vertex weights equal to . If there are many must-links, one can efficiently integrate all of them together by first constructing the must-link constraint graph and merging each connected component in this way.
The following lemma shows that the above construction preserves all normalized cuts which respect the must-link constraints. We prove it for the simple case where we merge and and the proof can easily be extended to the general case by induction.
Let be the reduced graph of obtained by merging vertices and . If a partition does not separate and , we have .
Note that . If does not separate and , then we have either or . W.l.o.g. assume that . The corresponding partition of is then and . We get
Thus we have ∎
All partitions of the reduced graph fulfill all must-link constraints and thus any relaxation of the unconstrained normalized cut problem can now be used. Moreover, this is not restricted to the cut criterion we are using but any other graph cut criterion based on cut and the volume of the subsets will be preserved in the reduction.
3 Algorithm for Constrained -Spectral Clustering
In this section we discuss the efficient minimization of based on recent ideas from unconstrained -spectral clustering [8, 9]. Note, that is a non-negative ratio of a difference of convex (d.c) function and a convex function, both of which are positively one-homogeneous. In recent work [8, 9], a general scheme, shown in Algorithm 1 (where denotes the subdifferential of the convex function at ), is proposed for the minimization of a non-negative ratio of a d.c function and convex function both of which are positively one-homogeneous.
It is shown in  that Algorithm 1 generates a sequence such that either or the sequence terminates. Moreover, the cluster points of correspond to critical points of . The scheme is given in Algorithm 1 for the problem , where
Note that are both convex functions and .
Moreover, it is shown in , that if one wants to minimize only over non-constant functions, one has to ensure that . Note, that
where if , otherwise it just the sign function. It is easy to check that for all and all and there exists always a vector for all such that .
In the algorithm the key part is the inner convex problem which one has to solve at each step. In our case it has the form,
where , and .
To solve it more efficiently we derive an equivalent smooth dual formulation for this non-smooth convex problem. We replace by in the following.
Let denote the set of edges and be defined as . Moreover, let denote the simplex, . The above inner problem is equivalent to
where , and is the projection of on to the simplex .
Noting that (see ) and , the inner problem can be rewritten as
The step follows from the standard min-max theorem (see Corollary 37.3.2 in ) since , , and lie in non-empty compact convex sets. In the step , we used that the minimizer of the linear function over the Euclidean ball is given by
if ; otherwise is an arbitrary element of the Euclidean unit ball.
Finally, we have = c . We also know that for a convex set and any given , , where is the projection of onto the set . With , we have for any , and from this the result follows. ∎
The smooth dual problem can be solved efficiently using first order projected gradient methods like FISTA , which has a guaranteed convergence rate of , where is the number of steps, and is the Lipschitz constant of the gradient of the objective. The bound on the Lipschitz constant for the gradient of the objective in (5) can be rather loose if the weights are varying a lot. The rescaling of the variable introduced in Lemma 3.2 leads to a better condition number and also to a tighter bound on the Lipschitz constant. This results in a significant improvement in practical performance.
Let be a linear operator defined as and let , for positive constant . The above inner problem is equivalent to
where . The Lipschitz constant of the gradient of is upper bounded by 4.
Let . Then and constraints on transform to and . Since the mapping between and is one-to-one, the transformation yields an equivalent problem (in the sense that minimizer of one problem can be easily derived from minimizer of the other problem).
Now we derive a bound on the Lipschitz constant.
The gradient of at w.r.t , and are given by
where is the adjoint operator of given by .
Let denote any other point and . then we have