 # Cluster deletion revisited

In the Cluster Deletion problem the input is a graph G and an integer k, and the goal is to decide whether there is a set of at most k edges whose removal from G results a graph in which every connected component is a clique. In this paper we give an algorithm for Cluster Deletion whose running time is O^*(1.404^k).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A graph is called a cluster graph if every connected component of is a clique. In the Cluster Deletion problem the input is a graph and an integer , and the goal is to decide whether there is a set of at most edges whose removal from results a clique graph.

A graph is a cluster graph if and only if there is no induced in , where is a path on 3 vertices. Therefore, there is a simple -time branching algorithm for Cluster Deletion: If the current graph is not a cluster graph, find an induced in the graph and create two new instances by removing each edge of the path. Faster algorithms for Cluster Deletion were given in [4, 3, 2, 1]. First, an -time algorithm was given in . An -time algorithm was given in , and an -time algorithm was given in . Finally, Böcker and Damaschke  gave an algorithm with a claimed running time of .

In this paper we show that there is an error in the analysis of the algorithm of Böcker and Damaschke. We give a corrected analysis that shows that the running time of the algorithm is as claimed in . Additionally, we give an algorithm for Cluster Deletion whose running time is .

## 2 Preliminaries

For set of vertices in a graph , is the subgraph of induced by (namely, ). For a set of edges , is the graph obtained from by deleting the edges of . For a vertex , is the set of vertices that are adjacent to .

Let denote a path on 3 vertices, and denote a chordless cycle on 4 vertices.

A graph is an -almost clique if there is a set of size at most such that is a clique.

###### Lemma 1 (Damaschke ).

Let be a constant. There is a polynomial time algorithm that given an -almost clique , finds a set of edges of minimum size such that is a cluster graph.

## 3 The algorithm of Böcker and Damaschke

In this section we describe the algorithm of Böcker and Damaschke  and give a corrected analysis of the algorithm. The algorithm is a branching algorithm. When we say that the algorithm branches on sets , we mean that the algorithm is called recursively on the instances .

The algorithm uses the following branching rules.

(B1)  Suppose that is an induced path in with at least 7 vertices, and let be the edges of (in order). Branch on and .

We note that Rule (B1) is a simplified version of the corresponding rule given in . To show the safeness of this rule, suppose that is a yes instance and let be a solution of (namely, is a cluster graph and ). Since induces a in , contains at least one edge from . Suppose that . The set induces a in , so either or . Therefore, the set is also a solution of (since the deletion of from does not generate new induced ’s). By repeating this process we obtain a solution such that . Similarly, if we can obtain a solution such that . Therefore, Rule (B1) is safe.

The branching vector of Rule (B1) is at least

and the branching number is less than 1.26.

For an edge , let be a set containing all edges such that induces a .

(B2)  If there is an edge for which , branch on and .

The branching vector of Rule (B2) is at least and the branching number is less than 1.381.

(B3)  Let be an induced in . Branch on and .

The branching vector of Rule (B3) is and the branching number is less than 1.415.

The algorithm applies Rules (B1)–(B3) until none of these rule is applicable. If is a graph on which Rules (B1)–(B3) are not applicable and is not a cluster graph, the algorithm applies a new branching rule, denoted (B3), which is described below.

In Rule (B3), the algorithm finds an induced . To describe the rule, we define the following sets. Let . Let the set of all vertices not in with one or two neighbors in . Let (resp., ) be the set of all vertices such that is adjacent to exactly one vertex from (resp., from ). Clearly, . Let be the set of all vertices with three neighbors in . Let be the set of all vertices not in that have at least one neighbor in .

For , let be a set containing every vertex not in such that the minimum distance between and a vertex in is exactly . Additionally, denote and . For a vertex , let , , and denote the sets of neighbors of in , , and , respectively. Note that by definition, for every . For , let be the set of all edges with one endpoint in and the other endpoint in .

The following lemmas are proved in .

###### Lemma 2.

If is a graph in which Rule (B3) is not applicable then and .

###### Lemma 3.

If is a graph in which Rule (B3) is not applicable then is a clique.

###### Lemma 4.

Let be a graph in which Rule (B3) is not applicable and is clique. Then, . Additionally, for every , .

We now give several lemmas which will be used in the analysis of Rule (B3).

###### Lemma 5.

If is a graph in which Rule (B3) cannot be applied, for every .

###### Proof.

We first consider the case . Let and . The set contains the edge and the edge for every . Therefore, .

We now consider the case . By the definition of , there are vertices such that is adjacent to and , and is not adjacent to . Therefore, the set contains the edge and the edge for every . We obtain again that . ∎

By Lemma 5 we have that and .

Let be the minimum index such that there is an edge for which (we note that while does not have an edge with , we will later use this definition on graphs which can have such edges). If no such index exists, .

If , for every .

###### Proof.

Let and . As in the proof of Lemma 5, . Since the edge is in , by the definition of , and the lemma follows. ∎

###### Lemma 7.

If , for every . Additionally, if then .

###### Proof.

Let and . Since , by Lemma 6 we have that . Therefore, for every we have that . Since also contains every , we conclude that ‘, and the lemma follows. ∎

###### Lemma 8.

If , for every . Additionally, if then .

###### Lemma 9.

If , if and then . Additionally, let be the unique vertex in . Then, .

The proofs of Lemma 8 and Lemma 9 are similar to the proof of Lemma 7 and were thus omitted.

From the lemmas above we obtain the following corollaries.

.

###### Corollary 11.

If , is a collection of at most disjoint paths.

When we consider a subgraph of , we use in superscript to refer to a set (or an integer) defined for . For example, the set of all vertices such that and is adjacent in to one or two vertices in is denoted .

We now describe Rule (B3). Recall that the algorithm first finds an induced . The main idea is to disconnect from the rest of the graph. If , the algorithm picks an arbitrary edge such that . Then, the algorithm branches on and . Namely, the algorithm builds two instances and . This process is repeated on each of these two instances: Let be one of the two instances. If , the algorithm picks an arbitrary edge such that and creates two new instances: and . This is repeated until every instance satisfies . An instance with generated by Rule (B3) will be called a first stage instance.

Next, on each first stage instance , the algorithm repeatedly applies Rule (B3) as follows. Since , we have by Corollary 11 that is a collection of at most disjoint paths. On each path of these paths that has at least 7 vertices, the algorithm applies Rule (B3). If there are such paths, this generates instances from . These instances will be called second stage instances.

Let be a second stage instance and let be the first stage instance from which was generated. Note that is a collection of at most disjoint paths, where each path contains at most 6 vertices. Let be the connected component of in . We will later show that is an -almost clique. The algorithm uses the algorithm of Lemma 1 to find a set of minimum size such that is a cluster graph. Then, the algorithm makes a recursive call on . Note that the size of is at least 1 since is an induced in . This is repeated for every second stage instance. The instances will be called third stage instances.

We now show that is an -almost clique. By Lemma 4, . We have that and (Lemma 4). Additionally, and therefore . Moreover, and (since every edge in is an edge in ). Therefore, by Lemma 3, is an -almost clique.

We note that in the analysis of Rule (B3) in , it is claimed that if is a first stage instance then . However, this is not true. Suppose that , , , , , , , , , , and . In this graph, Rule (B3) generates an instance by deleting one by one all 8 edges between and .

We now bound the branching number of Rule (B3). We consider several cases. In the first case, suppose that and . Recall that the algorithm picks an edge and generates the instances and . Clearly, . We now show that . Since is obtained from by erasing the edges in , and all the edges in that are incident on are in , we have that . Therefore, . We now claim that does not contain an edge that is not in . Suppose conversely that is such an edge. Then must be in and for some . Therefore, in we have . This contradicts Lemma 7. Thus, does not contain edges that are not in . It follows that . By Corollary 10, for . Using the same arguments, for every first stage instance we have that . This implies that . Therefore, the number of first stage instances generated by Rule (B3) is at most .

Suppose for example that . In the worst case, Rule (B3) generates four first stage instances: , , , and . Additionally, , , and . Suppose that Rule (B3) is not applied on any of the four instances above. Then, the algorithm generates a third stage instance from each first stage instance . Since for every , we obtain that , , , . In other words, the branching vector of Rule (B3) in this case is at least .

To analyze the case in which Rule (B3) is applied (at least once) on at least one of the four first stage instances note that for the sake of the analysis, we can assume for the sake of the analysis that the applications of Rule (B3) are done after the application of Rule (B3). For example, suppose that Rule (B3) is applied only on the instance and it is applied once on this instance, generating instances and , where . Therefore, the branching vector of Rule (B3) in this case is at least . However, we can assume for the analysis that Rule (B3) generates only four third stage instances, namely the instances , and then the algorithm applies Rule (B3) on . The branching vectors for these two rules are and , respectively.

For a general value of , we have that in the worst case Rule (B3) generates third stage instances (as discussed above, we can assume that Rule (B3) was not applied on the first stage instances). The branching vector is at least , where is a vector of length in which the value appears times, for . We note that this bound is not good enough for our purpose. Even for , we have that the branching vector has branching number of approximately 1.406 and we need a branching number less than 1.404. The solution is to give better bounds on the sizes of the sets . This will be discussed below.

Now consider the case . In this case as in the first case. However, we now can have . This occurs if . In this case, belongs to , and for every , the edge is in and not in . Note that in this case we have that and are not adjacent in to vertices in (If is adjacent in to or then must be adjacent to both and , otherwise the edge between and or is in . is also adjacent to and in . Therefore, is in or . By Lemma 6, , so ). Therefore, we can ignore the edges and . Formally, if the case above occurs, we mark the edges and for every . We now change the definition of to include all the unmarked edges with one endpoint in and the other endpoint in . For this new definition we have . Additionally, Corollary 10 remains true. Therefore, the analysis of the case is the same as the analysis for the case . That is, for a specific value of , the worst branching vector is at least .

We now consider the case . To handle this case (and to get a better bound on the branching number for the case ), we use a Python script. The script goes over possible cases for the graph and for each case it computes a branching vector that gives an upper bound on the branching number for this case. Formally, a configuration graph is a graph whose vertices are partitioned into 4 sets: (1) A set that contains 3 vertices which form a . (2) A set that contains vertices that are adjacent to 1 or 2 vertices of . (3) A set that contains vertices that are adjacent to at least one vertex in and are not adjacent to vertices in . (4) A set such that every vertex in has exactly one neighbor and this neighbor is in . Additionally, for every edge with at least one endpoint in . We note that the restriction on the degree of the vertices in is required in order to restrict the number of configuration graphs.

Let be a graph with an induced . We say that matches a configuration graph if there is a bijection such that (1) is an isomorphism between and . (2) maps to , respectively. (3) For a vertex , the number of neighbors of in is equal to the number of neighbors of in .

The script goes over all possible configuration graphs. For each configuration graph , the script builds a vector whose branching number is an upper bound on the branching number of Rule (B3) when it is applied on a graph which matches .

For each configuration graph the script generates a branching vector as follows. Suppose that is a graph that matches . The script generates graphs of the form like the generation of first stage instances in Rule (B3), except that now the process ends when . A graph with generated by the script will be called a first stage configuration graph. Consider some first stage configuration graph . This graph corresponds to the instance that Rule (B3) generates when it is applied on the graph and the induced path . Note that if then is not a first stage instance and Rule (B3) will continue generating instances from . By the analysis of the case above, at the worst case, the number of first stage instances generated from is . The vector gives a lower bound on the differences between the parameter and the parameters of these instances. Therefore, the vector can be the concatenation of for every first stage graph . We can get a better branching vector by giving a better bound on for the graphs of the second stage instances that are generated by Rule (B3) from an instance , where matches the first stage configuration graph . For our purpose, we it was suffices to give a simple bound based on the edges between and . For example, suppose that there are vertices such that and . Then, in there are vertices such that and . The edges and remain in every graph of a second stage instance generated from . Therefore, each such graph contains two edge disjoint induced s ( and ). Therefore, . We use similar lower bounds in case of other edges between and .

All the branching vectors generated by the script have branching numbers less than 1.393. Therefore, the branching number of Rule (B3) is less then 1.393.

Since all the branching rules of the algorithm have branching numbers less than 1.415, it follows that the running time of the algorithm is .

## 4 New algorithm

Our algorithm first applies Rule (B3) and Rule (B3) until these rule cannot be applied (note that Rule (B3) is not applied). Let be a graph in which these rules cannot be applied. If does not contain induced then the algorithm applies Rule (B3). Otherwise, the algorithm applies a new branching rule, denoted (B3), which is as follows. The algorithm picks an induced . Note that is an induced . Let and be the sets defined in the previous section. We have that since . If is a clique then the algorithm applies Rule (B3) on . Otherwise, let be two non-adjacent vertices. Note that is an induced . The algorithm applies Rule (B3) on the cycle . This generates two instances: and .

Consider an instance and note that is an induced in . If Rule (B3) is applicable on , the algorithm applies Rule (B3). Now suppose that Rule (B3) is not applicable. In , the vertex is adjacent to and not adjacent to . Therefore, . We also have , so by Lemma 2 we have that . Additionally, . If is a clique then the algorithm applies Rule (B3) on the graph and the path . Now suppose that is not a clique. let be two non-adjacent vertices. We have that is an induced in . The algorithm applies Rule (B3) on the cycle . This generates two instances: and . Consider the instance . Again, we have that is an induced in . We now have that . By Lemma 2, Rule (B3) is applicable on . Using the same arguments, Rule (B3) is applicable on . Thus, the algorithm applies Rule (B3) on and on .

We now analyze the branching number of Rule (B3). There are three cases that we need to consider. In the first case, the algorithm generates four instances , , , and , and then applies Rule (B3) on each of these instances. Therefore, the branching vector in this case is and the branching number is less than 1.404. In the second case, the algorithm generates, without loss of generality, the instances , , and . The algorithm then applies Rule (B3) on and , and applies Rule (B3) or Rule (B3) on . The worst case is when Rule (B3) is applied on . The branching vector in this case is , and the branching number is less than 1.402. In the third case, the algorithm generates the instances and and applies Rule (B3) or Rule (B3) on each of these instances. The worst branching vector in this case is and the branching number is less than 1.398. Therefore, the branching number of Rule (B3) is less than 1.404.

Since all the branching rules of the algorithm have branching numbers less than 1.404, it follows that the running time of the algorithm is .

## References

•  S. Böcker and P. Damaschke. Even faster parameterized cluster deletion and cluster editing. Information Processing Letters, 111(14):717–721, 2011.
•  P. Damaschke. Bounded-degree techniques accelerate some parameterized graph algorithms. In Proc. 4th International Workshop on Parameterized and Exact Computation (IWPEC), pages 98–109, 2009.
•  J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Automated generation of search tree algorithms for hard graph modification problems. Algorithmica, 39(4):321–347, 2004.
•  J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Graph-modeled data clustering: Exact algorithms for clique generation. Theory of Computing Systems, 38(4):373–392, 2005.