# Tight Continuous Relaxation of the Balanced k-Cut Problem

Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the computation of multiple clusters, corresponding to a balanced k-cut of the graph, are either based on greedy techniques or heuristics which have weak connection to the original motivation of minimizing the normalized cut. In this paper we propose a new tight continuous relaxation for any balanced k-cut problem and show that a related recently proposed relaxation is in most cases loose leading to poor performance in practice. For the optimization of our tight continuous relaxation we propose a new algorithm for the difficult sum-of-ratios minimization problem which achieves monotonic descent. Extensive comparisons show that our method outperforms all existing approaches for ratio cut and other balanced k-cut criteria.

There are no comments yet.

## Authors

• 5 publications
• 5 publications
• 61 publications
• ### Constrained 1-Spectral Clustering

An important form of prior information in clustering comes in form of ca...
05/24/2015 ∙ by Syama Sundar Rangapuram, et al. ∙ 0

• ### Kernel Cuts: MRF meets Kernel & Spectral Clustering

We propose a new segmentation model combining common regularization ener...
06/24/2015 ∙ by Meng Tang, et al. ∙ 0

• ### Approximation of Steiner Forest via the Bidirected Cut Relaxation

The classical algorithm of Agrawal, Klein and Ravi [SIAM J. Comput., 24 ...
11/17/2019 ∙ by Ali Çivril, et al. ∙ 0

• ### Continuous iterative algorithms for anti-Cheeger cut

As a judicious correspondence to the classical maxcut, the anti-Cheeger ...
03/19/2021 ∙ by Sihong Shao, et al. ∙ 0

• ### GANC: Greedy Agglomerative Normalized Cut

This paper describes a graph clustering algorithm that aims to minimize ...
05/05/2011 ∙ by Seyed Salim Tabatabaei, et al. ∙ 0

• ### Compassionately Conservative Balanced Cuts for Image Segmentation

The Normalized Cut (NCut) objective function, widely used in data cluste...
03/27/2018 ∙ by Nathan D. Cahill, et al. ∙ 0

• ### Normalized Cut with Adaptive Similarity and Spatial Regularization

In this paper, we propose a normalized cut segmentation algorithm with s...
06/06/2018 ∙ by Faqiang Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graph-based techniques for clustering have become very popular in machine learning as they allow for an easy integration of pairwise relationships in data. The problem of finding

clusters in a graph can be formulated as a balanced -cut problem [1, 2, 3, 4], where ratio and normalized cut are famous instances of balanced graph cut criteria employed for clustering, community detection and image segmentation. The balanced -cut problem is known to be NP-hard [4] and thus in practice relaxations [4, 5] or greedy approaches [6] are used for finding the optimal multi-cut. The most famous approach is spectral clustering [7], which corresponds to the spectral relaxation of the ratio/normalized cut and uses -means in the embedding of the vertices found by the first eigenvectors of the graph Laplacian in order to obtain the clustering. However, the spectral relaxation has been shown to be loose for [8] and for no guarantees are known of the quality of the obtained -cut with respect to the optimal one. Moreover, in practice even greedy approaches [6] frequently outperform spectral clustering.

This paper is motivated by another line of recent work [9, 10, 11, 12] where it has been shown that an exact continuous relaxation for the two cluster case () is possible for a quite general class of balancing functions. Moreover, efficient algorithms for its optimization have been proposed which produce much better cuts than the standard spectral relaxation. However, the multi-cut problem has still to be solved via the greedy recursive splitting technique.

Inspired by the recent approach in [13], in this paper we tackle directly the general balanced -cut problem based on a new tight continuous relaxation. We show that the relaxation for the asymmetric ratio Cheeger cut proposed recently by [13] is loose when the data does not contain well-separated clusters and thus leads to poor performance in practice. Similar to [13] we can also integrate label information leading to a transductive clustering formulation. Moreover, we propose an efficient algorithm for the minimization of our continuous relaxation for which we can prove monotonic descent. This is in contrast to the algorithm proposed in [13] for which no such guarantee holds. In extensive experiments we show that our method outperforms all existing methods in terms of the achieved balanced -cuts. Moreover, our clustering error is competitive with respect to several other clustering techniques based on balanced -cuts and recently proposed approaches based on non-negative matrix factorization. Also we observe that already with small amount of label information the clustering error improves significantly.

## 2 Balanced Graph Cuts

Graphs are used in machine learning typically as similarity graphs, that is the weight of an edge between two instances encodes their similarity. Given such a similarity graph of the instances, the clustering problem into sets can be transformed into a graph partitioning problem, where the goal is to construct a partition of the graph into sets such that the cut, that is the sum of weights of the edge from each set to all other sets, is small and all sets in the partition are roughly of equal size.

Before we introduce balanced graph cuts, we briefly fix the setting and notation. Let denote an undirected, weighted graph with vertex set with vertices and weight matrix with . There is an edge between two vertices if . The cut between two sets is defined as and we write

for the indicator vector of set

. A collection of sets is a partition of if , if and , . We denote the set of all -partitions of by . Furthermore, we denote by the simplex .

Finally, a set function is called submodular if for all , . Furthermore, we need the concept of the Lovasz extension of a set function.

###### Definition 1

Let be a set function with . Let be ordered in increasing order and define where . Then given by, , is called the Lovasz extension of . Note that for all .

The Lovasz extension of a set function is convex if and only if the set function is submodular [14]. The cut function , where , is submodular and its Lovasz extension is given by .

### 2.1 Balanced k-cuts

The balanced -cut problem is defined as

 min(C1,…,Ck)∈Pkk∑i=1cut(Ci,¯¯¯¯¯¯Ci)^S(Ci)=:BCut(C1,…,Ck) (1)

where is a balancing function with the goal that all sets are of the same “size”. In this paper, we assume that and for any , , for some . In the literature one finds mainly the following submodular balancing functions (in brackets is the name of the overall balanced graph cut criterion ),

 ^S(C) =|C|, (Ratio Cut), (2) ^S(C) =min{|C|,|¯¯¯¯C|}, (Ratio Cheeger Cut), ^S(C) =min{(k−1)|C|,¯¯¯¯C} (Asymmetric Ratio Cheeger Cut).

The Ratio Cut is well studied in the literature e.g. [3, 7, 6] and corresponds to a balancing function without bias towards a particular size of the sets, whereas the Asymmetric Ratio Cheeger Cut recently proposed in [13] has a bias towards sets of size ( attains its maximum at this point) which makes perfect sense if one expects clusters which have roughly equal size. An intermediate version between the two is the Ratio Cheeger Cut which has a symmetric balancing function and strongly penalizes overly large clusters. For the ease of presentation we restrict ourselves to these balancing functions. However, we can also handle the corresponding weighted cases e.g., , where , leading to the normalized cut[4].

## 3 Tight Continuous Relaxation for the Balanced k-Cut Problem

In this section we discuss our proposed relaxation for the balanced -cut problem (1). It turns out that a crucial question towards a tight multi-cut relaxation is the choice of the constraints so that the continuous problem also yields a partition (together with a suitable rounding scheme). The motivation for our relaxation is taken from the recent work of [9, 10, 11], where exact relaxations are shown for the case . Basically, they replace the ratio of set functions with the ratio of the corresponding Lovasz extensions. We use the same idea for the objective of our continuous relaxation of the -cut problem (1) which is given as

 minF=(F1,…,Fk),F∈Rn×k+ k∑l=1TV(Fl)S(Fl) (3) subjectto: F(i)∈Δk, i=1,…,n, (simplex constraints) max{F(i)}=1, ∀i∈I, (membership constraints) S(Fl)≥m, l=1,…,k, (size constraints)

where is the Lovasz extension of the set function and . We have , for Ratio Cut and Ratio Cheeger Cut whereas for Asymmetric Ratio Cheeger Cut. Note that is the Lovasz extension of the cut functional . In order to simplify notation we denote for a matrix by the -th column of and by the -th row of . Note that the rows of correspond to the vertices of the graph and the -th column of corresponds to the set of the desired partition. The set in the membership constraints is chosen adaptively by our method during the sequential optimization described in Section 4.

An obvious question is how to get from the continuous solution of (3) to a partition which is typically called rounding. Given we construct the sets, by assigning each vertex to the column where the -th row attains its maximum. Formally,

 Ci={j∈V|i=argmaxs=1,…,kFjs},i=1,…,k,(Rounding) (4)

where ties are broken randomly. If there exists a row such that the rounding is not unique, we say that the solution is weakly degenerated. If furthermore the resulting set do not form a partition, that is one of the sets is empty, then we say that the solution is strongly degenerated.

First, we connect our relaxation to the previous work of [11] for the case . Indeed for symmetric balancing function such as the Ratio Cheeger Cut, our continuous relaxation (3) is exact even without membership and size constraints.

###### Theorem 1

Let be a non-negative symmetric balancing function, , and denote by the optimal value of (3) without membership and size constraints for . Then it holds

 p∗=min(C1,C2)∈P22∑i=1cut(Ci,¯¯¯¯¯¯Ci)^S(Ci).

Furthermore there exists a solution of (3) such that , where is the optimal balanced -cut partition.

Proof:  Note that is a symmetric set function and by assumption. Thus with ,

 cut(C1,¯¯¯¯¯¯C1)^S(C1)+cut(C2,¯¯¯¯¯¯C2)^S(C2)=2cut(C1,¯¯¯¯¯¯C1)^S(C1).

Moreover, as and by symmetry of also (see [14, 11]). The simplex constraint implies that and thus

 TV(F2)S(F2)=TV(1−F1)S(1−F1)=TV(F1)S(F1).

Thus we can write problem (3) equivalently as

 minf∈[0,1]V2TV(f)S(f).

As for all , and , we have

 minf∈[0,1]VTV(f)S(f)≤minC⊂Vcut(C,¯¯¯¯C)^S(C).

However, it has been shown in [11] that and that there exists a continuous solution such that , where . As this finishes the proof.
Note that rounding trivially yields a solution in the setting of the previous theorem.

A second result shows that indeed our proposed optimization problem (3) is a relaxation of the balanced -cut problem (1). Furthermore, the relaxation is exact if .

###### Proposition 1

The continuous problem (3) is a relaxation of the -cut problem (1). The relaxation is exact, i.e., both problems are equivalent, if .

Proof:  For any -way partition , we can construct . It obviously satisfies the membership and size constraints and the simplex constraint is satisfied as and if . Thus is feasible for problem (3) and has the same objective value because

 TV(1C)=cut(C,¯¯¯¯C),S(1C)=^S(C).

Thus problem (3) is a relaxation of (1).

If , then the simplex together with the membership constraints imply that each row contains exactly one non-zero element which equals 1, i.e., . Define for , (i.e, ), then it holds and , . From the size constraints, we have for , . Thus , which by assumption on implies that each is non-empty. Hence the only feasible points allowed are indicators of -way partitions and the equivalence of (1) and (3) follows.
The row-wise simplex and membership constraints enforce that each vertex in belongs to exactly one component. Note that these constraints alone (even if ) can still not guarantee that corresponds to a -way partition since an entire column of can be zero. This is avoided by the column-wise size constraints that enforce that each component has at least one vertex.

If it is immediate from the proof that problem (3) is no longer a continuous problem as the feasible set are only indicator matrices of partitions. In this case rounding yields trivially a partition. On the other hand, if (i.e., no membership constraints), and it is not guaranteed that rounding of the solution of the continuous problem yields a partition. Indeed, we will see in the following that for symmetric balancing functions one can, under these conditions, show that the solution is always strongly degenerated and rounding does not yield a partition (see Theorem 2). Thus we observe that the index set controls the degree to which the partition constraint is enforced. The idea behind our suggested relaxation is that it is well known in image processing that minimizing the total variation yields piecewise constant solutions (in fact this follows from seeing the total variation as Lovasz extension of the cut). Thus if is sufficiently large, the vertices where the values are fixed to or propagate this to their neighboring vertices and finally to the whole graph. We discuss the choice of in more detail in Section 4.

##### Simplex constraints alone are not sufficient to yield a partition:

Our approach has been inspired by [13] who proposed the following continuous relaxation for the Asymmetric Ratio Cheeger Cut

 minF=(F1,…,Fk),F∈Rn×k+ k∑l=1TV(Fl)∥∥Fl−% quantk−1(Fl)∥∥1 (5) subjectto: F(i)∈Δk,i=1,…,n, (simplex constraints)

where is the Lovasz extension of and is the

-quantile of

. Note that in their approach no membership constraints and size constraints are present.

We now show that the usage of simplex constraints in the optimization problem (3) is not sufficient to guarantee that the solution can be rounded to a partition for any symmetric balancing function in (1). For asymmetric balancing functions as employed for the Asymmetric Ratio Cheeger Cut by [13] in their relaxation (5) we can prove such a strong result only in the case where the graph is disconnected. However, note that if the number of components of the graph is less than the number of desired clusters , the multi-cut problem is still non-trivial.

###### Theorem 2

Let be any non-negative symmetric balancing function. Then the continuous relaxation

 minF=(F1,…,Fk),F∈Rn×k+ k∑l=1TV(Fl)S(Fl) (6) subjectto: F(i)∈Δk,i=1,…,n,(simplex % constraints)

of the balanced -cut problem (1) is void in the sense that the optimal solution of the continuous problem can be constructed from the optimal solution of the -cut problem and cannot be rounded into a -way partition, see (4). If the graph is disconnected, then the same holds also for any non-negative asymmetric balancing function.

Proof:  First, we derive a lower bound on the optimum of the continuous relaxation (6). Then we construct a feasible point for (6) that achieves this lower bound but cannot yield a partitioning thus finishing the proof.

Let be an optimal -way partition for the given graph. Using the exact relaxation result for the balanced -cut problem in Theorem 3.1. in [11], we have

 minF:F(i)∈Δkk∑l=1TV(Fl)S(Fl)≥k∑l=1minf∈RnTV(f)S(f)=k∑l=1minC⊂Vcut(C,¯¯¯¯C)^S(C)=kcut(C∗,¯¯¯¯¯¯¯C∗)^S(C∗).

Now define and such that . Clearly is feasible for the problem (6) and the corresponding objective value is

 TV(1C∗)S(1C∗)+k∑l=2αlTV(1¯C∗)αlS(1¯¯¯¯¯¯C∗)=k∑l=1cut(C∗,¯¯¯¯¯¯¯C∗)^S(C∗),

where we used the -homogeneity of and [14] and the symmetry of and .

Thus the solution constructed as above from the -cut problem is indeed optimal for the continuous relaxation (6) and it is not possible to obtain a -way partition from this solution as there will be sets that are empty. Finally, the argument can be extended to asymmetric set functions if there exists a set such that as in this case it does not matter that in order that the argument holds.
The proof of Theorem 2 shows additionally that for any balancing function if the graph is disconnected, the solution of the continuous relaxation (6) is always zero, while clearly the solution of the balanced -cut problem need not be zero. This shows that the relaxation can be arbitrarily bad in this case. In fact the relaxation for the asymmetric case can even fail if the graph is not disconnected but there exists a cut of the graph which is very small as the following corollary indicates.

###### Corollary 1

Let be an asymmetric balancing function and and suppose that Then there exists a feasible with and such that for (6) which has objective and which cannot be rounded to a -way partition.

Proof:  Let and such that . Clearly is feasible for the problem (6) and the corresponding objective value is

 k∑l=1TV(Fl)S(Fl) =cut(C∗,¯¯¯¯¯¯¯C∗)^S(¯¯¯¯¯¯¯C∗)+(k−1)cut(C∗,¯¯¯¯¯¯¯C∗)^S(C∗),

where we used the -homogeneity of and [14] and the symmetry of . This cannot be rounded into a -way partition as there will be sets that are empty.
Theorem 2 shows that the membership and size constraints which we have introduced in our relaxation (3) are essential to obtain a partition for symmetric balancing functions. For the asymmetric balancing function failure of the relaxation (6) and thus also of the relaxation (5) of [13] is only guaranteed for disconnected graphs. However, Corollary 1 indicates that degenerated solutions should also be a problem when the graph is still connected but there exists a dominating cut. We illustrate this with a toy example in Figure 1 where the algorithm of [13] for solving (5) fails as it converges exactly to the solution predicted by Corollary 1 and thus only produces a -partition instead of the desired -partition. The algorithm for our relaxation enforcing membership constraints converges to a continuous solution which is in fact a partition matrix so that no rounding is necessary.

## 4 Monotonic Descent Method for Minimization of a Sum of Ratios

Apart from the new relaxation another key contribution of this paper is the derivation of an algorithm which yields a sequence of feasible points for the difficult non-convex problem (3) and reduces monotonically the corresponding objective. We would like to note that the algorithm proposed by [13] for (5) does not yield monotonic descent. In fact it is unclear what the derived guarantee for the algorithm in [13] implies for the generated sequence. Moreover, our algorithm works for any non-negative submodular balancing function.

The key insight in order to derive a monotonic descent method for solving the sum-of-ratio minimization problem (3) is to eliminate the ratio by introducing a new set of variables .

 minF=(F1,…,Fk),F∈Rn×k+, β∈Rk+ k∑l=1βl (7) subjectto: TV(Fl)≤βlS(Fl), l=1,…,k, (descent constraints) F(i)∈Δk, i=1,…,n, (simplex constraints) max{F(i)}=1, ∀i∈I, (membership constraints) S(Fl)≥m, l=1,…,k. (size constraints)

Note that for the optimal solution of this problem it holds (otherwise one can decrease and hence the objective) and thus equivalence holds. This is still a non-convex problem as the descent, membership and size constraints are non-convex. Our algorithm proceeds now in a sequential manner. At each iterate we do a convex inner approximation of the constraint set, that is the convex approximation is a subset of the non-convex constraint set, based on the current iterate . Then we optimize the resulting convex optimization problem and repeat the process. In this way we get a sequence of feasible points for the original problem (7) for which we will prove monotonic descent in the sum-of-ratios.

Convex approximation: As is submodular, is convex. Let be an element of the sub-differential of at the current iterate . We have by Prop. 3.2 in [14], , where is the index of the smallest component of and . Moreover, using the definition of subgradient, we have .

For the descent constraints, let and introduce new variables that capture the amount of change in each ratio. We further decompose as . Let , then for ,

 TV(Fl)−βlS(Fl) ≤TV(Fl)−λtl⟨stl,Fl⟩−δ+lS(Fl)+δ−lS(Fl) ≤TV(Fl)−λtl⟨stl,Fl⟩−δ+lm+δ−lM

Finally, note that because of the simplex constraints, the membership constraints can be rewritten as . Let and define (ties are broken randomly). Then the membership constraints can be relaxed as follows: . As we get . Thus the convex approximation of the membership constraints fixes the assignment of the -th point to a cluster and thus can be interpreted as “label constraint”. However, unlike the transductive setting, the labels for the vertices in are automatically chosen by our method. The actual choice of the set will be discussed in Section 4.1. We use the notation for the label set generated from (note that is fixed once is fixed).

Descent algorithm: Our descent algorithm for minimizing (7) solves at each iteration the following convex optimization problem (8).

 minF∈Rn×k+,δ+∈Rk+, δ−∈Rk+ k∑l=1δ+l−δ−l (8) subjectto: TV(Fl)≤λtl⟨stl,Fl⟩+δ+lm−δ−lM, l=1,…k, (descent constraints) F(i)∈Δk, i=1,…,n, (simplex constraints) Fiji=1, ∀(i,ji)∈L, (label constraints) ⟨stl,Ftl⟩≥m, l=1,…,k. (size constraints)

As its solution is feasible for (3) we update and and repeat the process until the sequence terminates, that is no further descent is possible as the following theorem states, or the relative descent in is smaller than a predefined . The following Theorem 3 shows the monotonic descent property of our algorithm.

###### Theorem 3

The sequence produced by the above algorithm satisfies for all or the algorithm terminates.

Proof:  Let be the optimal solution of the inner problem (8). By the feasibility of and ,

 TV(Ft+1l)S(Ft+1l) ≤λtl+mδ+, t+1l−Mδ−, t+1lS(Ft+1l)≤λtl+δ+, t+1l−δ−, t+1l

Summing over all ratios, we have

 k∑l=1TV(Ft+1l)S(Ft+1l) ≤k∑l=1λtl+k∑l=1δ+, t+1l−δ−, t+1l

Noting that is feasible for (8), the optimal value has to be either strictly negative in which case we have strict descent

 k∑l=1TV(Ft+1l)S(Ft+1l)

or the previous iterate together with is already optimal and hence the algorithm terminates.
The inner problem (8) is convex, but contains the non-smooth term

in the constraints. We eliminate the non-smoothness by introducing additional variables and derive an equivalent linear programming (LP) formulation. We solve this LP via the PDHG algorithm

[15, 16]. The LP and the exact iterates can be found in the supplementary material.

###### Lemma 1

The convex inner problem (8) is equivalent to the following linear optimization problem where is the set of edges of the graph and are the edge weights.

 minF∈Rn×k+,α∈R|E|×k+,δ+∈Rk+, δ−∈Rk+ k∑l=1δ+l−δ−l (9) subjectto: l=1,…,k, (descent constraints) F(i)∈Δk, i=1,…,n, (simplex constraints) Fiji=1, ∀(i,ji)∈L, (label constraints) ⟨stl,Ftl⟩≥m, l=1,…,k, (size constraints) −(αl)ij≤Fil−Fjl≤(αl)ij, l=1,…,k, ∀(i,j)∈E.

Proof:  We define new variables for each column and introduce constraints , which allows us to rewrite as . These equality constraints can be replaced by the inequality constraints without changing the optimality of the problem, because at the optimal these constraints are active. Otherwise one can decrease while still being feasible since is non-negative. Finally, these inequality constraints are rewritten using the fact that , for .

#### 4.0.1 Solving LP via PDHG

Recently, first-order primal-dual hybrid gradient descent (PDHG for short) methods have been proposed [17, 15] to efficiently solve a class of convex optimization problems that can be rewritten as the following saddle-point problem

 minx∈Xmaxy∈Y⟨Ax,y⟩+G(x)−Φ∗(y),

where and are finite-dimensional vector spaces and is a linear operator and and

are convex functions. It has been shown that the PDHG algorithm achieves good performance in solving huge linear programming problems that appear in computer vision applications. We now show how the linear programming problem

 minx≥0 ⟨c,x⟩ subjectto: A1x≤b1 A2x=b2

can be rewritten as a saddle-point problem so that PDHG can be applied.

By introducing the Lagrange multipliers , the optimal value of the LP can be written as

 =minxmaxy1, y2⟨c,x⟩+ιx≥0(x)+⟨y1,A1x⟩+⟨y2,A2x⟩−⟨b1,y1⟩−⟨b2,y2⟩−ιy1≥0(y1),

where is the indicator function that takes a value of on the non-negative orthant and elsewhere.

Define and . Then the saddle point problem corresponding to the LP is given by

 minxmaxy1, y2⟨c,x⟩+ιx≥0(x)+⟨y,Ax⟩−⟨b,y⟩−ιy1≥0(y1).

The primal and dual iterates for this saddle-point problem can be obtained as

 xr+1 =max{0,xr−τ(ATyr+c)}, yr+11 =max{0,yr1+σ(A1¯xr+1−b1)}, yr+12 =yr2+σ(A2¯xr+1−b