In graph theoretic notions, clustering is the task of partitioning the vertices of the graph into subsets, called clusters, in such a way that there should be many edges within each cluster and relatively few edges between the clusters. In many applications, the clusters are restricted to induce cliques, as the represented data of each edge corresponds to a similarity value between two objects [17, 18]. Under the term cluster graph. which refers to a disjoint union of cliques, one may find a variety of applications that have been extensively studied [1, 6, 22]. Here we consider the Cluster Deletion problem which asks for a minimum number of edge deletions from an input graph, so that the resulting graph is a disjoint union of cliques. In the decision version of the problem, we are also given an integer and we want to decide whether at most edge deletions are enough to produce a cluster graph.
Although Cluster Deletion is NP-hard on general graphs , settling its complexity status restricted on graph classes has attracted several researchers. Regarding the maximum degree of a graph, Komusiewicz and Uhlmann  have shown an interesting complexity dichotomy result: Cluster Deletion remains NP-hard on -free graphs with maximum degree four, whereas it can be solved in polynomial time for graphs having maximum degree at most three. Quite recently, Golovach et al. have shown that it remains NP-hard on planar graphs . For graph classes characterized by forbidden induced subgraphs, Gao et al.  showed that Cluster Deletion is NP-hard on -free graphs and on -free graphs. Regarding -free graphs, Grüttemeier et al. , showed a complexity dichotomy result for any graph consisting of at most four vertices. In particular, for any graph on four vertices such that , Cluster Deletion is NP-hard on -free graphs, whereas it can be solved in polynomial time on - or paw-free graphs . Interestingly, Cluster Deletion remains NP-hard on -free chordal graphs .
On the positive side, Cluster Deletion have been shown to be solved in polynomial time on cographs , proper interval graphs , split graphs , and -reducible graphs . More precisely, iteratively picking maximum cliques defines a clustering on the graph which actually gives an optimal solution on cographs (i.e., -free graphs), as shown by Gao et al. in . In fact, the greedy approach of selecting a maximum clique provides a -approximation algorithm, though not necessarily in polynomial-time . As the problem is already NP-hard on chordal graphs , it is natural to consider subclasses of chordal graphs such as interval graphs and split graphs. Although for split graphs there is a simple polynomial-time algorithm, restricted to interval graphs only the complexity of a proper subclass, proper interval graphs, was determined by giving a solution that runs in polynomial-time . Settling the complexity of Cluster Deletion on interval graphs, was left open [3, 2, 11].
For proper interval graphs, Bonomo et al.  characterized their optimal solution by consecutiveness of each cluster with respect to their natural ordering of the vertices. Based on this fact, a dynamic programming approach led to a polynomial-time algorithm. It is not difficult to see that such a consecutiveness does not hold on interval graphs, as potential clusters might require to break in the corresponding vertex ordering. Here we characterize an optimal solution of interval graphs whenever a cluster is required to break. In particular, we take advantage of their consecutive arrangement of maximal cliques and describe subproblems of maximal cliques containing the last vertex. One of our key observations is that the candidate clusters containing the last vertex can be enumerated in polynomial time given two vertex orderings of the graph. We further show that each such candidate cluster separates the graph in a recursive way with respect to optimal subsolutions, that enables to define our dynamic programming table to keep track about partial solutions. Thus, our algorithm for interval graphs suggests to consider a particular consecutiveness of a solution and apply a dynamic programming approach defined by two vertex orderings.
Furthermore, we complement the previously-known NP-harness of Cluster Deletion on -free chordal graphs, by providing a proper subclass of such graphs for which we prove that the problem remains NP-hard. This result is inspired and motivated by the very simple characterization of an optimal solution on split graphs: either a maximal clique constitutes the only non-edgeless cluster, or there are exactly two non-edgeless clusters whenever there is a vertex of the independent set that is adjacent to all the vertices of the clique except one . Due to the fact that true twins belong to the same cluster in an optimal solution, it is natural to consider true twins at the independent set, as they are expected not to influence the solution characterization. Surprisingly, we show that Cluster Deletion remains NP-complete even on such a slight generalization of split graphs. We then study two different classes of such generalization of split graphs that can be viewed as the parallel of split graphs that admit disjoint clique-neighborhood and nested clique-neighborhood. For Cluster Deletion we provide polynomial-time algorithms on both classes of graphs.
All graphs considered here are simple and undirected. A graph is denoted by with vertex set and edge set . We use the convention that and . The neighborhood of a vertex of is and the closed neighborhood of is . For , and . A graph is a subgraph of if and . For , the subgraph of induced by , , has vertex set , and for each vertex pair from , is an edge of if and only if and is an edge of . For , denotes the graph , that is a subgraph of and for , denotes the graph , that is an induced subgraph of . For two set of vertices and , we write to denote the edges that have one endpoint in and one endpoint in . Two adjacent vertices and are called true twins if , whereas two non-adjacent vertices and are called false twins if .
A clique of is a set of pairwise adjacent vertices of , and a maximal clique of is a clique of that is not properly contained in any clique of . An independent set of is a set of pairwise non-adjacent vertices of . For , the chordless path on vertices is denoted by and the chordless cycle on vertices is denoted by . For an induced path , the vertices of degree one are called endvertices. A vertex is universal in if and is isolated if . A graph is connected if there is a path between any pair of vertices. A connected component of is a maximal connected subgraph of . For a set of finite graphs , we say that a graph is -free if does not contain an induced subgraph isomorphic to any of the graphs of .
The problem of Cluster Deletion is formally defined as follows: given a graph , the goal is to compute the minimum set of edges such that every connected component of is a clique. A cluster graph is a -free graph, or equivalently, any of its connected components is a clique. Thus, the task of Cluster Deletion is to turn the input graph into a cluster graph by deleting the minimum number of edges. Let be a solution of Cluster Deletion such that is a clique. In such terms, the problem can be viewed as a vertex partition problem into . Each is simple called cluster. Edgeless clusters, i.e., clusters containing exactly one vertex, are called trivial clusters. The edges of are partitioned into internal and external edges: an internal edge has both its endpoints in the same cluster , whereas an external edge has its endpoints in different clusters and , for . Then, the goal of Cluster Deletion is to minimize the number of external edges which is equivalent to maximize the number of internal edges. We write to denote an optimal solution for Cluster Deletion of the graph , that is, a cluster subgraph of having the maximum number of edges. Given a solution , the number of edges incident only to the same cluster, that is the number of internal edges, is denoted by .
For a clique , we say that a vertex is -compatible if . We start with few preliminary observations regarding twin vertices. Notice that for true twins and , if belongs to any cluster then is -compatible.
Lemma 2.1 ().
Let and be true twins in . Then, in any optimal solution and belong to the same cluster.
The above lemma shows that we can contract true twins and look for a solution on a vertex-weighted graph that does not contain true twins. Even though false twins cannot be grouped into the same cluster as they are non-adjacent, we can actually disregard one of the false twins whenever their neighborhood forms a clique.
Let and be false twins in such that is a clique. Then, there is an optimal solution such that constitutes a trivial cluster.
Let and be the clusters of and , respectively, in an optimal solution such that and . We construct another solution by replacing both clusters by and , respectively. To see that this indeed a solution, first observe that is adjacent to all the vertices of because , and forms a clique by the assumption. Moreover, since and , we know that , implying that the number of internal edges in the constructed solution is at least as the number of internal edges of the optimal solution. ∎
Moreover, we prove the following generalization of Lemma 2.1.
Let and be two clusters of an optimal solution and let and . If is -compatible then is not -compatible.
Let be an optimal solution such that . Assume for contradiction that is -compatible. We show that is not optimal. Since is -compatible, we can move to and obtain a solution that contains the clusters and . Similarly, we construct a solution from , by moving to so that . Notice that the forms a clustering, since is -compatible. We distinguish between the following cases, according to the values and .
If then , because .
If then , because .
In both cases we reach a contradiction to the optimality of . Therefore, is not -compatible. ∎
Let be a cluster of an optimal solution and let . If there is a vertex that is -compatible and , then belongs to .
Assume for contradiction that belongs to a cluster different than . Then, observe that is -compatible. Indeed, for any vertex of , we know , since is adjacent to and . Thus, by Lemma 2.3 we reach a contradiction, so that . ∎
3 Polynomial-time algorithm on interval graphs
Here we present a polynomial-time algorithm for the Cluster Deletion problem on interval graphs. A graph is an interval graph if there is a bijection between its vertices and a family of closed intervals of the real line such that two vertices are adjacent if and only if the two corresponding intervals intersect. Such a bijection is called an interval representation of the graph, denoted by . We identify the intervals of the given representation with the vertices of the graph, interchanging these notions appropriately. Whether a given graph is an interval graph can be decided in linear time and if so, an interval representation can be generated in linear time . Notice that every induced subgraph of an interval graph is an interval graph.
Let be an interval graph. Instead of working with the interval representation of , we consider its sequence of maximal cliques. It is known that a graph with maximal cliques is an interval graph if and only if there is an ordering of the maximal cliques of , such that for each vertex of , the maximal cliques containing appear consecutively in the ordering (see e.g., ). A path following such an ordering is called a clique path of . Notice that a clique path is not necessarily unique for an interval graph. Also note that an interval graph with vertices contains at most maximal cliques. By definition, for every vertex of , the maximal cliques containing form a connected subpath in .
Given a vertex , we denote by the maximal cliques containing with respect to , where and are the first (leftmost) and last (rightmost) maximal cliques containing . Notice that holds. Moreover, for every edge of there is a maximal clique of that contains both endpoints of the edge. Thus, two vertices and are adjacent if and only if or .
For a set of vertices , we write and to denote the minimum and maximum value, respectively, among all with . Similarly, and correspond to the minimum and maximum value, respectively, with respect to .
With respect to the Cluster Deletion problem, observe that for any cluster of a solution, we know that where , as forms a clique. A vertex is called guarded by two vertices and if
For a clique , observe that is -compatible if and only if there exists a maximal clique such that with .
Let be three vertices of such that is guarded by and . If and belong to the same cluster of an optimal solution and is -compatible then .
To ease the presentation, for three non-negative numbers we write if holds. Without loss of generality, assume that . Assume for contradiction that belongs to another cluster . We apply Lemma 2.3 to either and or and . To do so, we need to show that is -compatible or is -compatible, as is already -compatible. Since is a cluster that contains , there is a maximal clique such that with .
We show that or . If then , because . As is guarded by and , we know that . Now observe that if then , implying that and are non-adjacent, reaching a contradiction to the fact that . Thus, which shows that . This means that or .
Hence, or belong to the maximal clique for which . Therefore, at least one of or is -compatible and by Lemma 2.3 we conclude that . ∎
Let be an ordering of the vertices such that . For every with , we define the following set of vertices:
That is, contains all vertices that are guarded by and . We write to denote the value of and we simple write and instead of and . Notice that for a neighbor of with , we have either or . This means that all neighbors of that are totally included (i.e., all vertices such that ) belong to for any with . To distinguish such neighbors of , we define the following sets:
contains the neighbors of such that (neighbors of in that partially overlap ).
contains the neighbors of such that (neighbors of that are totally included within ).
In the forthcoming arguments, we restrict ourselves to the graph induced by . It is clear that the first maximal clique that contains a vertex of is , whereas the last maximal clique is .
We now explain the necessary sets that our dynamic programming algorithm uses in order to compute an optimal solution of . For two vertices with , we define the following:
is the value of an optimal solution for Cluster Deletion of the graph .
To ease the notation, when we say a cluster of we mean a cluster of an optimal solution of . Notice that is the desired value for the whole graph , since .
Our task is to construct the values for by taking into account all possible clusters that contain . To do so, we show that (i) the number of clusters containing in is polynomial and (ii) each such candidate cluster containing separates the graph in a recursive way with respect to optimal subsolutions.
Observe that if then if and only if , whereas if and only if ; in the latter case, it is not difficult to see that , according to the definition of . Thus, whenever holds, we have . The candidates of a cluster of containing lie among and . Let us show with the next two lemmas that we can restrict ourselves into a polynomial number of such candidates. To avoid repeating ourselves, in the forthcoming statements we let be two vertices with .
Let be a cluster of containing . If there is a vertex such that then there is a maximal clique with such that and .
Since , we know that there is a maximal clique for which with . We show that all other vertices of are guarded by and . Notice that for every vertex we already know that and . Thus, for every vertex we have and . This means that all vertices of are guarded by and . Moreover, since , we know that all vertices of are -compatible. Therefore, we apply Lemma 3.1 to every vertex of , showing that . Furthermore, there is no vertex of that belongs to , because . ∎
By Lemma 3.2, we know that we have to pick the entire set for constructing candidates to form a cluster that contains and some vertices of . As there are at most choices for , we get a polynomial number of such candidate sets. We next show that we can construct polynomial number of candidate sets that contain and vertices of . For doing so, we consider the vertices of increasingly ordered with respect to their first maximal clique. More precisely, let be an increasingly order of the vertices of such that . The right part of Figure 1 illustrates the corresponding case.
Let be a cluster of containing and let .If then every vertex of that is -compatible belongs to .
Let be a vertex of . We show that is guarded by and . By the definition of , we know that . Moreover, observe that holds by the fact that and . Thus, we apply Lemma 3.1 to , because and is -compatible, showing that as desired. ∎
For , let . Observe that each may be an empty set. On the part , all vertices are grouped into the sets . Similar to , let . Then, all vertices of are -compatible and all vertices of are -compatible. Figure 1 depicts the corresponding sets.
Let be a cluster of containing . Then, there is such that .
Assume for contradiction that no set is contained in . Let and let . Notice that because of the assumption as there are no other neighbors of in . Then, holds, because . We show that . Observe that . If then clearly . Assume that and let be a non-empty subset of that forms a cluster in . Then, all vertices of are -compatible and all vertices of are -compatible, because . Thus, we reach a contradiction by Lemma 2.3 to the optimality of . This means that there is a vertex that is contained in together with . Therefore, by Lemma 3.2, there is a set that is included in . ∎
All vertices of a cluster containing belong to . Thus, can be partitioned into and . Also notice that for some . Combined with the previous lemmas, we can enumerate all such subsets of in polynomial-time. In particular, we first build all candidates for , which are exactly the sets by Lemma 3.2 and Lemma 3.4. Then, for each of such candidate , we apply Lemma 3.3 to construct all subsets containing the last vertices of . Thus, there are at most number of candidate sets from the vertices of that belong to the same cluster with .
3.1 Splitting into partial solutions
We further partition the vertices of . Given a pivot group , we consider the vertices that lie on the right part of . More formally, for , we define the set
The reason of breaking the vertices of the part into sets is the following.
Let be a cluster of such that , for . Then, for any two vertices and , there is no cluster of that contains both of them.
First observe that . We consider two cases for , depending on whether or not. Assume that . If , then by Lemma 3.2, which implies that . If then .
Now assume that . If , then does not belong to , so that . If , then we show that does not belong to a cluster with any vertex of . Assume for contradiction that belongs to a cluster such that . This means that with and . Then is -compatible and is -compatible, as both and belong to . Therefore, by Lemma 2.3 we reach a contradiction to and belonging to different clusters. ∎
For a non-empty set , we write to denote the following solutions:
, where is the vertex of having the smallest and is the vertex of having the largest .
Having this notation, observe that , for any with . However, it is important to notice that does not necessarily represent the optimal solution of , since the vertices of may not be consecutive with respect to , so that is only a subset of in the corresponding solution for . Under the following assumptions, with the next result we show that for the chosen sets we have .
Let be two vertices with and let , for any maximal clique of with .
If then , where and .
If then , where and .
We prove the case for . As each contains vertices of , we have . Observe that either or . In both cases we show that . Assume that there is a vertex with . Then as , and by the consecutiveness of the clique path. This shows that because . Thus, . We show that . If there is a vertex in with then leading to a contradiction that . Hence we have and . Moreover, observe that by the definition of , we already know that . Now it remains to notice that for every vertex with and we have . This follows from the fact that and . Therefore we get . Completely symmetric arguments along the previous lines, shows the case for . ∎
Given the clique path , a clique-index is an integer . Let be two clique-indices such that and . We denote by the minimum value of among all vertices of having . Clearly, holds. A pair of clique-indices is called admissible pair for a vertex , if both and hold. Given an admissible pair , we define the following set of vertices:
Observe that all vertices of induce a clique in , because . We say that a vertex crosses the pair if and . It is not difficult to see that for a vertex that crosses , we have . We prove the following properties of .
Let be two vertices with and let be an admissible pair for . Moreover, let be the vertices of having the smallest and largest , respectively. If the vertices of form a cluster in then the following statements hold:
If holds for a vertex , then crosses .
Every vertex of does not belong to the same cluster with any vertex of .
Every vertex that crosses does not belong to the same cluster with any vertex having .
First we show that . Assume that there is a vertex . Then and is distinct from because, by definition, . Also notice that implies and . By the second inequality, we get . Suppose that . As we already know that , we conclude that leading to a contradiction that . Thus we have and , showing that . This means that , so that .
For the second statement, observe that if then . Since , we conclude that by the first statement. Thus holds, implying that crosses .
With respect to the third statement, observe that no vertex of belongs to the clique . This means that all vertices of belong to both sets and . Thus Lemma 3.5 and the first statement show that no two vertices and belong to the same cluster.
For the fourth statement, let be a vertex that crosses . By the first statement we know that . If then and the third statement show that and do not belong to the same cluster. Suppose that . If then contradicting the fact that . Putting together, we have . Now assume for contradiction that and belong to the same cluster . By the fact that , observe that . We consider the graph induced by . We show that there is a vertex of that is -compatible and there is a vertex of that is -compatible. Notice that is -compatible, because crosses so that . To see that there is a vertex of that is -compatible, choose to be the vertex of having the smallest . This means that . Then is adjacent to every vertex of because and . Thus, is -compatible. Therefore, Lemma 2.3 shows the desired contradiction, implying that and do not belong to the same cluster. ∎
Notice that the number of admissible pairs for is polynomial because there are at most choices for each clique-index. Moreover, if then . A pair of clique-indices with is called bounding pair for if either holds, or crosses . Given an bounding pair for , we write to denote the set of bounding pairs for such that
, whenever holds, and