Graph-based techniques for clustering have become very popular in machine learning as they allow for an easy integration of pairwise relationships in data. The problem of findingclusters in a graph can be formulated as a balanced -cut problem [1, 2, 3, 4], where ratio and normalized cut are famous instances of balanced graph cut criteria employed for clustering, community detection and image segmentation. The balanced -cut problem is known to be NP-hard  and thus in practice relaxations [4, 5] or greedy approaches  are used for finding the optimal multi-cut. The most famous approach is spectral clustering , which corresponds to the spectral relaxation of the ratio/normalized cut and uses -means in the embedding of the vertices found by the first eigenvectors of the graph Laplacian in order to obtain the clustering. However, the spectral relaxation has been shown to be loose for  and for no guarantees are known of the quality of the obtained -cut with respect to the optimal one. Moreover, in practice even greedy approaches  frequently outperform spectral clustering.
This paper is motivated by another line of recent work [9, 10, 11, 12] where it has been shown that an exact continuous relaxation for the two cluster case () is possible for a quite general class of balancing functions. Moreover, efficient algorithms for its optimization have been proposed which produce much better cuts than the standard spectral relaxation. However, the multi-cut problem has still to be solved via the greedy recursive splitting technique.
Inspired by the recent approach in , in this paper we tackle directly the general balanced -cut problem based on a new tight continuous relaxation. We show that the relaxation for the asymmetric ratio Cheeger cut proposed recently by  is loose when the data does not contain well-separated clusters and thus leads to poor performance in practice. Similar to  we can also integrate label information leading to a transductive clustering formulation. Moreover, we propose an efficient algorithm for the minimization of our continuous relaxation for which we can prove monotonic descent. This is in contrast to the algorithm proposed in  for which no such guarantee holds. In extensive experiments we show that our method outperforms all existing methods in terms of the achieved balanced -cuts. Moreover, our clustering error is competitive with respect to several other clustering techniques based on balanced -cuts and recently proposed approaches based on non-negative matrix factorization. Also we observe that already with small amount of label information the clustering error improves significantly.
2 Balanced Graph Cuts
Graphs are used in machine learning typically as similarity graphs, that is the weight of an edge between two instances encodes their similarity. Given such a similarity graph of the instances, the clustering problem into sets can be transformed into a graph partitioning problem, where the goal is to construct a partition of the graph into sets such that the cut, that is the sum of weights of the edge from each set to all other sets, is small and all sets in the partition are roughly of equal size.
Before we introduce balanced graph cuts, we briefly fix the setting and notation. Let denote an undirected, weighted graph with vertex set with vertices and weight matrix with . There is an edge between two vertices if . The cut between two sets is defined as and we write
for the indicator vector of set. A collection of sets is a partition of if , if and , . We denote the set of all -partitions of by . Furthermore, we denote by the simplex .
Finally, a set function is called submodular if for all , . Furthermore, we need the concept of the Lovasz extension of a set function.
Let be a set function with . Let be ordered in increasing order and define where . Then given by, , is called the Lovasz extension of . Note that for all .
The Lovasz extension of a set function is convex if and only if the set function is submodular . The cut function , where , is submodular and its Lovasz extension is given by .
2.1 Balanced -cuts
The balanced -cut problem is defined as
where is a balancing function with the goal that all sets are of the same “size”. In this paper, we assume that and for any , , for some . In the literature one finds mainly the following submodular balancing functions (in brackets is the name of the overall balanced graph cut criterion ),
The Ratio Cut is well studied in the literature e.g. [3, 7, 6] and corresponds to a balancing function without bias towards a particular size of the sets, whereas the Asymmetric Ratio Cheeger Cut recently proposed in  has a bias towards sets of size ( attains its maximum at this point) which makes perfect sense if one expects clusters which have roughly equal size. An intermediate version between the two is the Ratio Cheeger Cut which has a symmetric balancing function and strongly penalizes overly large clusters. For the ease of presentation we restrict ourselves to these balancing functions. However, we can also handle the corresponding weighted cases e.g., , where , leading to the normalized cut.
3 Tight Continuous Relaxation for the Balanced -Cut Problem
In this section we discuss our proposed relaxation for the balanced -cut problem (1). It turns out that a crucial question towards a tight multi-cut relaxation is the choice of the constraints so that the continuous problem also yields a partition (together with a suitable rounding scheme). The motivation for our relaxation is taken from the recent work of [9, 10, 11], where exact relaxations are shown for the case . Basically, they replace the ratio of set functions with the ratio of the corresponding Lovasz extensions. We use the same idea for the objective of our continuous relaxation of the -cut problem (1) which is given as
where is the Lovasz extension of the set function and . We have , for Ratio Cut and Ratio Cheeger Cut whereas for Asymmetric Ratio Cheeger Cut. Note that is the Lovasz extension of the cut functional . In order to simplify notation we denote for a matrix by the -th column of and by the -th row of . Note that the rows of correspond to the vertices of the graph and the -th column of corresponds to the set of the desired partition. The set in the membership constraints is chosen adaptively by our method during the sequential optimization described in Section 4.
An obvious question is how to get from the continuous solution of (3) to a partition which is typically called rounding. Given we construct the sets, by assigning each vertex to the column where the -th row attains its maximum. Formally,
where ties are broken randomly. If there exists a row such that the rounding is not unique, we say that the solution is weakly degenerated. If furthermore the resulting set do not form a partition, that is one of the sets is empty, then we say that the solution is strongly degenerated.
First, we connect our relaxation to the previous work of  for the case . Indeed for symmetric balancing function such as the Ratio Cheeger Cut, our continuous relaxation (3) is exact even without membership and size constraints.
Proof: Note that is a symmetric set function and by assumption. Thus with ,
Thus we can write problem (3) equivalently as
As for all , and , we have
However, it has been shown in  that
and that there exists a continuous solution such that , where . As this finishes the proof.
Note that rounding trivially yields a solution in the setting of the previous theorem.
Proof: For any -way partition , we can construct . It obviously satisfies the membership and size constraints and the simplex constraint is satisfied as and if . Thus is feasible for problem (3) and has the same objective value because
If , then the simplex together with the membership constraints imply that each row contains exactly one non-zero element which equals 1, i.e., .
Define for , (i.e, ), then it holds and , .
From the size constraints, we have for , . Thus , which by assumption on implies that each is non-empty.
Hence the only feasible points allowed are indicators of -way partitions and the equivalence of (1) and (3) follows.
The row-wise simplex and membership constraints enforce that each vertex in belongs to exactly one component. Note that these constraints alone (even if ) can still not guarantee that corresponds to a -way partition since an entire column of can be zero. This is avoided by the column-wise size constraints that enforce that each component has at least one vertex.
If it is immediate from the proof that problem (3) is no longer a continuous problem as the feasible set are only indicator matrices of partitions. In this case rounding yields trivially a partition. On the other hand, if (i.e., no membership constraints), and it is not guaranteed that rounding of the solution of the continuous problem yields a partition. Indeed, we will see in the following that for symmetric balancing functions one can, under these conditions, show that the solution is always strongly degenerated and rounding does not yield a partition (see Theorem 2). Thus we observe that the index set controls the degree to which the partition constraint is enforced. The idea behind our suggested relaxation is that it is well known in image processing that minimizing the total variation yields piecewise constant solutions (in fact this follows from seeing the total variation as Lovasz extension of the cut). Thus if is sufficiently large, the vertices where the values are fixed to or propagate this to their neighboring vertices and finally to the whole graph. We discuss the choice of in more detail in Section 4.
Simplex constraints alone are not sufficient to yield a partition:
Our approach has been inspired by  who proposed the following continuous relaxation for the Asymmetric Ratio Cheeger Cut
where is the Lovasz extension of and is the
-quantile of. Note that in their approach no membership constraints and size constraints are present.
We now show that the usage of simplex constraints in the optimization problem (3) is not sufficient to guarantee that the solution can be rounded to a partition for any symmetric balancing function in (1). For asymmetric balancing functions as employed for the Asymmetric Ratio Cheeger Cut by  in their relaxation (5) we can prove such a strong result only in the case where the graph is disconnected. However, note that if the number of components of the graph is less than the number of desired clusters , the multi-cut problem is still non-trivial.
Let be any non-negative symmetric balancing function. Then the continuous relaxation
of the balanced -cut problem (1) is void in the sense that the optimal solution of the continuous problem can be constructed from the optimal solution of the -cut problem and cannot be rounded into a -way partition, see (4). If the graph is disconnected, then the same holds also for any non-negative asymmetric balancing function.
Proof: First, we derive a lower bound on the optimum of the continuous relaxation (6). Then we construct a feasible point for (6) that achieves this lower bound but cannot yield a partitioning thus finishing the proof.
Let be an optimal -way partition for the given graph. Using the exact relaxation result for the balanced -cut problem in Theorem 3.1. in , we have
Now define and such that . Clearly is feasible for the problem (6) and the corresponding objective value is
where we used the -homogeneity of and  and the symmetry of and .
Thus the solution constructed as above from the -cut
problem is indeed optimal for the continuous relaxation (6) and it is not possible to obtain a -way partition from this solution
as there will be sets that are empty.
Finally, the argument can be extended to asymmetric set functions if there exists a set such that as in this case it does not matter that in order that the argument holds.
The proof of Theorem 2 shows additionally that for any balancing function if the graph is disconnected, the solution of the continuous relaxation (6) is always zero, while clearly the solution of the balanced -cut problem need not be zero. This shows that the relaxation can be arbitrarily bad in this case. In fact the relaxation for the asymmetric case can even fail if the graph is not disconnected but there exists a cut of the graph which is very small as the following corollary indicates.
Let be an asymmetric balancing function and and suppose that Then there exists a feasible with and such that for (6) which has objective and which cannot be rounded to a -way partition.
Proof: Let and such that . Clearly is feasible for the problem (6) and the corresponding objective value is
where we used the -homogeneity of and  and the symmetry of . This cannot be rounded into a -way partition
as there will be sets that are empty.
Theorem 2 shows that the membership and size constraints which we have introduced in our relaxation (3) are essential to obtain a partition for symmetric balancing functions. For the asymmetric balancing function failure of the relaxation (6) and thus also of the relaxation (5) of  is only guaranteed for disconnected graphs. However, Corollary 1 indicates that degenerated solutions should also be a problem when the graph is still connected but there exists a dominating cut. We illustrate this with a toy example in Figure 1 where the algorithm of  for solving (5) fails as it converges exactly to the solution predicted by Corollary 1 and thus only produces a -partition instead of the desired -partition. The algorithm for our relaxation enforcing membership constraints converges to a continuous solution which is in fact a partition matrix so that no rounding is necessary.
4 Monotonic Descent Method for Minimization of a Sum of Ratios
Apart from the new relaxation another key contribution of this paper is the derivation of an algorithm which yields a sequence of feasible points for the difficult non-convex problem (3) and reduces monotonically the corresponding objective. We would like to note that the algorithm proposed by  for (5) does not yield monotonic descent. In fact it is unclear what the derived guarantee for the algorithm in  implies for the generated sequence. Moreover, our algorithm works for any non-negative submodular balancing function.
The key insight in order to derive a monotonic descent method for solving the sum-of-ratio minimization problem (3) is to eliminate the ratio by introducing a new set of variables .
Note that for the optimal solution of this problem it holds (otherwise one can decrease and hence the objective) and thus equivalence holds. This is still a non-convex problem as the descent, membership and size constraints are non-convex. Our algorithm proceeds now in a sequential manner. At each iterate we do a convex inner approximation of the constraint set, that is the convex approximation is a subset of the non-convex constraint set, based on the current iterate . Then we optimize the resulting convex optimization problem and repeat the process. In this way we get a sequence of feasible points for the original problem (7) for which we will prove monotonic descent in the sum-of-ratios.
Convex approximation: As is submodular, is convex. Let be an element of the sub-differential of at the current iterate . We have by Prop. 3.2 in , , where is the index of the smallest component of and . Moreover, using the definition of subgradient, we have .
For the descent constraints, let and introduce new variables that capture the amount of change in each ratio. We further decompose as . Let , then for ,
Finally, note that because of the simplex constraints, the membership constraints can be rewritten as . Let and define (ties are broken randomly). Then the membership constraints can be relaxed as follows: . As we get . Thus the convex approximation of the membership constraints fixes the assignment of the -th point to a cluster and thus can be interpreted as “label constraint”. However, unlike the transductive setting, the labels for the vertices in are automatically chosen by our method. The actual choice of the set will be discussed in Section 4.1. We use the notation for the label set generated from (note that is fixed once is fixed).
As its solution is feasible for (3) we update and and repeat the process until the sequence terminates, that is no further descent is possible as the following theorem states, or the relative descent in is smaller than a predefined . The following Theorem 3 shows the monotonic descent property of our algorithm.
The sequence produced by the above algorithm satisfies for all or the algorithm terminates.
Proof: Let be the optimal solution of the inner problem (8). By the feasibility of and ,
Summing over all ratios, we have
Noting that is feasible for (8), the optimal value has to be either strictly negative in which case we have strict descent
or the previous iterate together with is already optimal and hence the algorithm terminates.
The inner problem (8) is convex, but contains the non-smooth term
in the constraints. We eliminate the non-smoothness by introducing additional variables and derive an equivalent linear programming (LP) formulation. We solve this LP via the PDHG algorithm[15, 16]. The LP and the exact iterates can be found in the supplementary material.
The convex inner problem (8) is equivalent to the following linear optimization problem where is the set of edges of the graph and are the edge weights.
We define new variables for each column and introduce constraints , which allows us to rewrite as .
These equality constraints can be replaced by the inequality constraints without changing the optimality of the problem, because at the optimal these constraints are active.
Otherwise one can decrease while still being feasible since is non-negative.
Finally, these inequality constraints are rewritten using the fact that , for .
4.0.1 Solving LP via PDHG
Recently, first-order primal-dual hybrid gradient descent (PDHG for short) methods have been proposed [17, 15] to efficiently solve a class of convex optimization problems that can be rewritten as the following saddle-point problem
where and are finite-dimensional vector spaces and is a linear operator and and
are convex functions. It has been shown that the PDHG algorithm achieves good performance in solving huge linear programming problems that appear in computer vision applications. We now show how the linear programming problem
can be rewritten as a saddle-point problem so that PDHG can be applied.
By introducing the Lagrange multipliers , the optimal value of the LP can be written as
where is the indicator function that takes a value of on the non-negative orthant and elsewhere.
Define and . Then the saddle point problem corresponding to the LP is given by
The primal and dual iterates for this saddle-point problem can be obtained as