Consider an undirected graph with the set of vertices and the set of edges. Each edge is associated with a weight that indicates the similarity between its two end vertices–the lager the weight, the more “similar” the two vertices are. The hard graph clustering (HGC) problem is to create distinct partitions (clusters, or, communities) of the set of vertices according to their similarities, i.e., to form , where , and for all , . For a thorough literature review of the graph clustering problems, see, e.g., [Schaeffer2007], and for fast algorithms for large-scale networks, see, e.g., [Girvan and Newman2002, Clauset et al.2004, Rosvall and Bergstrom2008]. For datasets, see, e.g., SNAP network datasets [sna] and SNAP biomedical datasets [Marinka Zitnik and Leskovec2018].
The soft graph clustering (SGC) problem, (also known as fuzzy graph clustering), on the other hand, allows clusters to have overlaps. A vertex may be a member of more than one cluster. There are numerous applications of SGC, such as: brain research, social network research, natural language processing, citation, and collaboration networks, and so on. A precise problem definition of the SGC varies and is dependent on the application, and sometimes it may not be possible to provide a precise problem definition.
The subject of study in this paper considers the combinatorial optimisation problem where we are required to determine: 1) the composition of each of the clusters; 2) for each vertex that belongs to more than one cluster, how the membership is distributed amongst the clusters (we denote this by , for Vertex in Cluster , hence for all ). We consider the case that an equal balance of the cluster total vertex memberships is desirable, and that not all vertices are required to be in a cluster. We consider two equally important objectives: 1) to minimize the sum of inter-cluster edge weights (cut across clusters); and 2) to maximize the sum of intra-cluster edge weights (cluster association).
1.1 Literature review
There are a number of existing soft clustering algorithms, each designed to suit different applications see, e.g., CFinder of [Palla et al.2005], (see also [Derényi et al.2005], [Adamcsek et al.2006]), the MaxMax Algorithm of [Hope and Keller2013], the WATSET methods of [Ustalov et al.2018] for NLP, the Chinese Whisper method of [Biemann2006], Betweenness-based method of [Pinney and Westhead2006], and the Purifying and Filtering the Coupling Matrix approach of [Liu and Foroushani2016]. Of these methods, [Biemann2006], [Pinney and Westhead2006], [Liu and Foroushani2016], and [Ustalov et al.2018] are designed for unweighted graphs only (i.e., graphs with unit edge weight).The MaxMax Algorithm is designed for weighted undirected graphs. For unweighted graphs, however, it will return a trivial solution–each connected component of the graph will be a cluster. The CFinder is based on the finding of -clique neighbourhoods. The mixed-integer linear programming method we propose in this paper is able to accomodate both weighted and unweighted graphs, with a small modification required for the latter. In the preliminary experiments section, we will compare and contrast the different methods.
We are not aware of any mixed-integer linear programming (MILP) models for the SGC problems. There are, however, MILP models for other graph clustering, machine learning, and data classification problems. The article[Bertsimas and Shioda2007] presents MILP formulations for classification and regression. The idea for classification, e.g., is to partition Class 1 points into
disjoint subsets by finding the hyperplanes that describe the partitioning polyhedrons such that no Class 0 points can be expressed as a convex combination of the Class 1 points in each partition. In general clustering problems,[Sağlam et al.2006] proposes a MILP formulation where one wishes to partition a set of data set into
(a predetermined number of) clusters. The objective is to minimize the maximum diameter of the generated clusters in order to obtain evenly compact clusters. Essentially the method is an IP-based heuristic method, with some variables fixed by the solution of maximal independent set of size, where each member of this set is a seed member in the clusters. The IP model (which is in fact a bilinear model, but linearized using standard linearization strategies) is then solved to obtain an optimal solution to the general clustering problem. Other MILP-based work can be found in, e.g., [Gilpin et al.2013] and [Ye2007]
for hierarchical clustering. The latter presents an application in recommendation systems. In Clique Covering Problem (CCP), an NP-hard combinatorial optimisation problem where an undirected graph is to be partitioned to form complete subgraphs,[Miyauchi et al.2018] proposes a compact ILP formulation for a relaxed problem, as well as a post-optimization repair procedure and a proof of optimality for the final solution to the original problem.
1.2 Contribution of the paper
As far as we aware, this paper is the first to propose a methodology for SGC that i) deals with undirected graphs with general integer edge weights and truely takes the values of into the optimization process in the way that the larger the value is, the more favourably it will be considered, and at the same time, also deals with unweighted graphs (i.e., graphs with unit edge weight); ii) simultaneously allocates membership proportion (the values) for vertices that lie in multiple clusters; and iii) it enforces an equal balance of the clusters that the sum of vertex memberships over all clusters are roughly the same. We propose an approach that is based on a polynomial-size MILP model, beginning with a small value for –we enforce that the graph has at least clusters. One can apply an adaptive approach to find the best value of iteratively, but the focus of this research is to solve an instantaneous SGC problem with a given .
The method of [Palla et al.2005] requires obtaining , and the graph is subsequently clustered by finding -clique neighbourhoods on an unweighted graph . Our method does not require the finding of -cliques, and it takes the values of into account during optimization. Comparing with the method of [Hope and Keller2013] (MaxMax), for unweighted graphs, if the graph is connected, then MaxMax will produce only one cluster which is the entire graph. Our method, however, can deal with unweighted graphs by converting them into weighted ones via a simple transformation.
In Section 2.1, we present our basic MILP model by considering a number of standard requirements for the SGCP. In Section 2.2, we discuss our strategies for graph connectivity. In Section 2.3, we discuss two objectives: 1) minimizing the total inter-cluster cut, and 2) maximize the total intra-cluster association. In Section 3, we present preliminary numerical results. We then conclude our findings and discuss future research directions in Section 4.
2 A mixed-integer linear programming formulation
2.1 The basic model
We first introduce the notation used in this paper. Let:
be the adjacency matrix of ;
the maximum edge weight;
be the number of clusters;
be the set of clusters;
be a binary decision variable with indicating Vertex is a member of Cluster ;
be a continuous decision variable indicating the membership of Vertex in Cluster ;
be the cut between clusters and ;
a predetermined minimum membership if a vertex is a member of a cluster;
a predetermined tolerance equal balance of cluster membership;
a predetermined maximum overlap factor.
Now we introduce the constraints. First, we have a set of Membership Constraints. The membership of Vertex in Cluster can only be non-zero if it is a member of , and when it is, the membership must be no less than a predetermined value.
Let , for each
, be an auxiliary binary variable withindicating is in at least one of the clusters and otherwise. We require that the sum of memberships for any vertex over all clusters is exactly 1 if the vertex is a member of at least one cluster and 0 otherwise.
We consider an Equal Balance Requirement where the sum of memberships in all clusters are “roughly” the same. As far as we aware, this is the first method that considers such a requirement. This requirement can be modelled as below.
Next, we consider the Overlap Cardinality Constraints. Let be auxiliary binary decision variable such that if and only if is a member of both clusters , , i.e., . We have that:
Given any pairs of clusters , with , the number of vertices that are in both clusters cannot be larger than a predetermined fraction, , of the cardinality of either of the clusters.
To calculate the Inter-cluster Cuts between and , we require auxiliary binary variables and nonlinear terms. First, let be a binary variable such that if both of are in the intersection of clusters and .
We then use a binary variable to indicate the existence of an edge (cut) (i.e., ) with in and in , but not both in the intersection of and (otherwise the “cut” should not be counted). The constraints are as below. For each pair of distinct vertices and each pair of distinct clusters , we have that:
Now, the cut between two distinct vertices across two distinct clusters , if edge exists, is defined by: , for
The terms and are bilinear, and can be linearized by introducing auxiliary non-negative continuous variables and and the following constraints. For each , we have that:
The cut is calculated by the following linear term.
Now, we consider the Intra-cluster Association calculations. Let be an auxiliary binary variable with if are both in and that the edge exists in . The constraints are give by:
The intra-cluster association of is given by
We linearize the association using auxiliary continuous variables and to capture the memberships of two vertices in the same cluster should an edge exists between them. I.e., when , , otherwise, . (Similarly for ). The constraints are as below.
The association within a cluster , denoted by , is calculated as follows:
2.2 Cluster connectivity
Clusters are required to be connected, it is likely that this can only be achieved with exponentially many variables and solved using a branch-and-price approach, or with exponentially many constraints and solved using a branch-and-cut approach. Our approach here is not an exact one, as we wish to keep the formulation compact (i.e., polynomial in size). We derive a number of constraints that capture a few conditions for graph connectivity that are necessary but not sufficient. When there is a connectivity violation, violation elimination constraints can be added in a lazy fashion. (Lazy constraints is a technical term in integer programming–hard (and often exponentially many) constraints are relaxed, and are only added when the current integer optimal solution to the relaxed problem violates them–usually only a very small number of them are violated–the problem is then re-optimized, and the procedure recurs until there is no more violated hard constraints). In our preliminary test where we used randomly generated undirected graphs (see Preliminary Numerical Results section), most of the problem instances produced connected clusters.
First of all, the number of edges in a connected undirected graph cannot be smaller than the cardinality of the graph minus one. We define to be a span variable that can only be one when and are both in Cluster and that the edge exists in . We have that:
We also require that all vertices in each cluster must be connected to at least one other vertex in the same cluster.
However, Constraints (39)–(43) are not enough to eliminate multiple loops within a cluster. A cluster may contain vertices , but the constraints cannot eliminate the formation of subgraphs and within the cluster.
Therefore, we borrowed the time constraints idea for Asymmetric Travelling Salesman Problem (ATSP) subtour elimination ([Miller et al.1960]). Let , , be a decision variable indicating the “time of arrival” at Vertex . Suppose is a strict subset of a cluster , to prevent the relevant span variables from forming a loop, we require that if and only if , for each pair of distinct , and each . We have that:
E.g., consider the vertices a strict subset of , a loop with edges a violation of the time constraints, as cannot be equal to and simultaneously. Constraints (44) and (45) can eliminate certain types of loops formed by the span variables, but not all. In any case, adding them will give the span variables a better chance at making a full span in a cluster and thus the connectivity.
Notice that the time constraints do not cut off feasible solutions due to the fact that even if a variable is forced to be 0, there is no impact on the values of and , as (39)–(41) do not induce a bi-conditional relation. In our preliminary numerical experiments, the time constraints are expensive to implement, and often we do not have disconnected clusters with just (39)–(43) alone.
When cluster size is a hard constraint, and there exist a disconnected cluster, in the case of maximizing total association, e.g., one can consider adding the following cut in a lazy fashion and re-optimize. Let be the set of -variables with a value of 1 in the optimal solution, we add:
2.3 The objective function
Notice that it is necessary to enforce a minimum cluster size, otherwise a trivial optimal solution to (47) is to assign no vertices to any clusters. Thus we have:
for a predetermined value.
The two objectives should be considered simultaneously. Some may consider minimizing the sum of ratio of inter-cluster cuts and intra-cluster associations over all pairs of distinct clusters.
Diagram (1) below shows how the sum of ratios of inter-cluster cuts and intra-cluster associations changes. The first data point on the left is obtained when we optimize (47) and set to be the value of in the optimal solution. We can see that the total intra-cluster association is also small, and so is the value of (50). The last data point on the right is obtained when we optimize (48) and we set to be the optimal objective value. We then minimize (47) with , for , .
One can, however, consider recent advances in Bi-objective Integer Programming (see, e.g., [Dai and Charkhgard2018]). In fact, different applications of the SGC may have a different objective function that describes the SGC better. Besides, the structure of the graph must be taken into consideration in determining what the most appropriate objective function is. One may therefore consider applying machine learning to automate the finding of an appropriate objective function.
2.4 Edge weight transformation for unweighted graphs
For undirected graphs with , for all , the MILP does not work very well because it cannot distinguish between an edge in a sparse neighbourhood with one in a dense neighbourhood. To give favour to edges in a dense neighbourhood, we obtain new edge weights by calculating , i.e., the new edge weight of will be one plus the number of vertices that are connected to both of the end nodes of .
3 Preliminary numerical results
We compared our method with CFinder [Palla et al.2005, Derényi et al.2005, Adamcsek et al.2006] and MaxMax Algorithm [Hope and Keller2013]. We used a 21-vertex instance of weighted graph. We can see that the clusters of CFinder are the -clique neighbourhoods, and therefore some of the edges with heavy weight are not included, e.g., , , .
The MaxMax solution, however, have these heavy-weight edges covered. However, for unweighted graphs, MaxMax will cluster the graph by connected components, so if the graph is connected, then there will be only one cluster.
The MILP solution produces not only the clusters, but also the membership proportion for each vertex that belongs to more than one cluster. None of MaxMax or CFinder provides this information. One can see clearly that the Min Cut solution can also produce a relatively balanced set of clusters in terms of total weighted memberships of the clusters when the clusters are connected. None of the existing SGC methods considered such a constraint.
In the Max Association solution below, unfortunately, because the MILP enforced three clusters, and the blue cluster is not connected, we have four clusters instead, so the equal balance cannot be guarantee in this case. In any case, the four clusters demonstrated the four strongest connections by . As for the membership proportions, take Vertex 9 as an example, since there are more heavily weighted links in the pink cluster rather than the green cluster, 0.9 of its membership is in the former, and 0.1 in the latter.
3.1 KKI instances
In this section, we applied our method on some KKI instances from https://github.com/shiruipan/graph_datasets. Since the KKI instances are unweighted, we perform the edge weight transformation described in Section 2.4.
In Table 1, column Opt presents the time taken (in seconds) for proving optimality, and the column Gap presents the gap between the MIP relative gap, (i.e., the gap between the best integer objective and the objective of the best active node on the branch-and-bound tree), for instances that are not solved to optimality within the given time limit (10 minutes). The column r is the ratio of total inter-cluster cut over total intra-cluster association. We used IBM ILOG CPLEX Optimization Studio v12.8 [Support2018] for solving the MILPs.
|Min Cut||Max Association|
3.2 Instances from a random problem generator
We have generated 10 problem classes with 5 instances for each problem class. The problem instances were generated using a random problem generator where number of vertices on the undirected graph, (), density of graph , and an upper bound on the edge weight are taken as user input. The edge weights are generated by each given a number chosen uniformly randomly between . In Table 2, the names of the problem classes are in the form of: N followed by the value of , d followed by the density of , and then followed by the value of . The column CON
is the percentage of clusters that are connected in each problem class. In the columns where two numbers were given for each problem class, the first row presents the average value over 5 problem instances, and the second row presents the value of the standard deviation. The bracket underneath each problem class indicates the number of instances that are solved to optimality versus the number of instances that are not.
3.3 Preliminary testing
We experimented with instances with maximum edge weight and , clusters, and . From the results of Table 2, we can see that the computation time grows exponentially as the size of the problem grows, or the density of the graph grows. For problem instances that are not solved to optimality, the MIP gaps are large. However in all problem instances, feasible solutions are found reasonably quickly. We can see that the computation time for maximizing total intra-cluster association is substantially longer than minimizing total inter-cluster cuts. For problem instances that are not solved to optimality, however, the former has a smaller MIP gap whilst the latter can solve larger problem instances. With an objective of maximizing total association, we did not include Constraint (49) as the objective will drive as much edges used as possible. We cannot obtain a conclusive remark on the effects of in terms of computation time.
|Min Cut||Max Association|
|Min Cut||Max Association|
|Min Cut||Max Association|
|Min Cut||Max Association|
4 Conclusions and future research directions
We proposed and developed a polynomial-size MILP model for the SGCP. As far as we aware, this is the first approach that simultaneously allocates membership proportion (the values) for vertices that lie in multiple clusters, and that enforces an equal balance of the clusters so that the sum of vertex memberships over all clusters are roughly the same. Compared to CFinder ([Palla et al.2005], [Derényi et al.2005], [Adamcsek et al.2006]), the clusters found in our method are not limited to -clique neighbourhoods. Compared to MaxMax ([Hope and Keller2013]), our method can produce non-trivial clusters even for a connected unweighted graph. An obvious future research direction is to perform a thorough numerical experiment on all parameters used in the model. One can consider alternative formulations, e.g., a branch-and-cut approach, using constraints to cut off infeasible solutions (namely, solutions with disconnected clusters); or, a branch-and-price approach, using exponentially many variables each representing a feasible cluster (in which case connectivity is guaranteed). Even though the cut separation and column generation subproblems themselves are expected to be NP-hard, heuristic approaches can be derived for speedy execution. Further, one may consider an adaptive approach to find the optimal number of clusters, instead of iteratively optimising over different values of .
- [Adamcsek et al.2006] Balázs Adamcsek, Gergely Palla, Illés Farkas, Imre Derényi, and Tamás Vicsek. Cfinder: Locating cliques and overlapping modules in biological networks. Bioinformatics (Oxford, England), 22:1021–3, 05 2006.
- [Bertsimas and Shioda2007] Dimitris Bertsimas and Romy Shioda. Classification and regression via integer optimization. Oper. Res., 55(2):252–271, March 2007.
- [Biemann2006] Christian Biemann. Chinese whispers - an efficient graph clustering algorithm and its application to natural language processing problems. 2006.
- [Clauset et al.2004] Aaron Clauset, M. E. J. Newman, and Cristopher Moore. Finding community structure in very large networks. Phys. Rev. E, 70:066111, Dec 2004.
- [Dai and Charkhgard2018] Rui Dai and Hadi Charkhgard. A two-stage approach for bi-objective integer linear programming. Oper. Res. Lett., 46:81–87, 2018.
- [Derényi et al.2005] Imre Derényi, Gergely Palla, and Tamás Vicsek. Clique percolation in random networks. Phys. Rev. Lett., 94:160202, Apr 2005.
- [Gilpin et al.2013] Sean Gilpin, Siegried Nijssen, and Ian Davidson. Formalizing hierarchical clustering as integer linear programming, 2013.
- [Girvan and Newman2002] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.
- [Hope and Keller2013] David Hope and Bill Keller. Maxmax: A graph-based soft clustering algorithm applied to word sense induction. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, pages 368–381, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
- [Liu and Foroushani2016] Ying Liu and Amir Foroushani. Pfc: An efficient soft graph clustering method for ppi networks based on purifying and filtering the coupling matrix. In De-Shuang Huang, Vitoantonio Bevilacqua, and Prashan Premaratne, editors, Intelligent Computing Theories and Application, pages 247–257, Cham, 2016. Springer International Publishing.
- [Marinka Zitnik and Leskovec2018] Sagar Maheshwari Marinka Zitnik, Rok Sosič and Jure Leskovec. BioSNAP Datasets: Stanford biomedical network dataset collection. http://snap.stanford.edu/biodata, August 2018.
- [Miller et al.1960] C. E. Miller, A. W. Tucker, and R. A. Zemlin. Integer programming formulation of traveling salesman problems. Journal of the ACM, 7(4):326–329, October 1960.
- [Miyauchi et al.2018] Atsushi Miyauchi, Tomohiro Sonobe, and Noriyoshi Sukegawa. Exact clustering via integer programming and maximum satisfiability, 2018.
- [Palla et al.2005] Gergely Palla, Imre Derényi, Illés Farkas, and Tamás Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043):814–818, June 2005.
- [Pinney and Westhead2006] John W. Pinney and David R. Westhead. Betweenness-based decomposition methods for social and biological networks. 2006.
- [Rosvall and Bergstrom2008] Martin Rosvall and Carl T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4):1118–1123, 2008.
- [Sağlam et al.2006] Burcu Sağlam, F. Sibel Salman, Serpil Sayın, and Metin Türkay. A mixed-integer programming approach to the clustering problem with an application in customer segmentation. European Journal of Operational Research, 173(3):866 – 879, 2006.
- [Schaeffer2007] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27 – 64, 2007.
- [Support2018] IBM Support. Cplex optimization studio v12.8. http://www-01.ibm.com/support/docview.wss?uid=swg27050618, 2018.
- [Ustalov et al.2018] Dmitry Ustalov, Alexander Panchenko, Chris Biemann, and Simone Paolo Ponzetto. Local-global graph clustering with applications in sense and frame induction. CoRR, abs/1808.06696, 2018.
- [Ye2007] M. Ye. An integer programming clustering approach with application to recommendation systems. Master’s thesis, Iowa State University, 2007.