 # Improved approximation algorithms for hitting 3-vertex paths

We study the problem of deleting a minimum cost set of vertices from a given vertex-weighted graph in such a way that the resulting graph has no induced path on three vertices. This problem is often called cluster vertex deletion in the literature and admits a straightforward 3-approximation algorithm since it is a special case of the vertex cover problem on a 3-uniform hypergraph. Recently, You, Wang, and Cao described an efficient 5/2-approximation algorithm for the unweighted version of the problem. Our main result is a 9/4-approximation algorithm for arbitrary weights, using the local ratio technique. We further conjecture that the problem admits a 2-approximation algorithm and give some support for the conjecture. This is in sharp contrast with the fact that the similar problem of deleting vertices to eliminate all triangles in a graph is known to be UGC-hard to approximate to within a ratio better than 3, as proved by Guruswami and Lee.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Graphs in this paper are finite, simple, and undirected. Given a graph and cost function , the cluster vertex deletion problem (Cluster-VD) is to find a minimum cost set of vertices such that each component of is a complete graph. Equivalently, is a feasible solution if and only if contains no induced subgraph isomorphic to , the path on three vertices.

The problem admits a staightforward -approximation algorithm: Assuming unit costs for simplicity, build any inclusionwise maximal collection of vertex-disjoint induced ’s in and include in every vertex covered by some member of . If contains subgraphs then we get a lower bound of on the optimum. On the other hand, the cost of is .

The problem also admits an approximation-preserving reduction from Vertex Cover: if is any given graph, let denote the graph obtained from by adding a pendent edge to every vertex. Then solving Vertex Cover on is equivalent to solving Cluster-VD on . Hence, known hardness and inapproximability results for Vertex Cover apply to Cluster-VD as well, and in particular it is UGC-hard to approximate Cluster-VD to within any ratio better than . We show that we can however come close to .

###### Theorem 1.

We further conjecture that Cluster-VD can be -approximated in polynomial time, as is the case for Vertex Cover. We give some support for this conjecture in Section 5, where we notice that our -approximation algorithms is in fact a -approximation algorithm for the case where the largest clique in the input graph has size at most , and can be easily modified to a -approximation algorithm if the input graph does not contain any diamond ( minus an edge) as an induced subgraph.

In contrast, the problem of finding a minimum cost set of vertices such that has no triangle is known to be UGC-hard to approximate to within any ratio better than , as proved by Guruswami and Lee  (see also  for related inapproximability results).

#### Previous Work.

Cluster-VD was previously mostly studied in terms of fixed parameter algorithms. Hüffner, Komusiewicz, Moser, and Niedermeier  first gave a -time fixed-parameter algorithm, parameterized by the solution size , where and denote the number of vertices and edges of the graph, respectively. This was subsequently improved by Boral, Cygan, Kociumaka, and Pilipczuk , who gave a -time algorithm. See also Iwata and Oka  for related results in the fixed parameter setting.

As for approximation algorithms, nothing better than a -approximation was known until the recent work of You, Wang, and Cao , who showed that the unweighted version of Cluster-VD admits a -approximation algorithm.

In a previous version of this paper , we gave a -approximation algorithm for Cluster-VD. The algorithm in this version of the paper achieves a better approximation ratio and is at the same time much simpler.

Finally, we note that there has been recent activity on another restriction of the vertex cover problem on -uniform hypergraph, namely, the feedback vertex set problem in tournaments. For that problem, the -approximation algorithm by Cai, Deng and Zang  was the best known for many years, until the very recent work of Mnich, Vassilevska Williams and Végh  who found a -approximation algorithm for the problem.

#### Our approach.

Our approximation algorithm is based on the local ratio technique. In order to illustrate the general approach, let us give a very simple -approximation algorithm for hitting all -subgraphs (instead of induced subgraphs) in a given weighted graph , see Algorithm 1 below.

It can be easily verified that the set returned by Algorithm 1 is an inclusionwise minimal feasible solution. The reason why the algorithm is a -approximation is that optimum cost for the weighted star is while the solution returned by the algorithm misses at least one of the vertices of the star, and thus has a local cost of at most .

We remark that a -approximation algorithm for the problem of hitting -subgraphs can also be obtained via a straightforward modification of the primal/dual -approximation algorithm of Chudak et al.  for the feedback vertex set problem. (Indeed, this is exactly what was done by Tu and Zhou .) However, the resulting algorithm is much more complicated than Algorithm 1.

It is perhaps worth pointing out that, in the case of triangle-free graphs, hitting ’s or induced ’s are the same problem. This was actually an important insight for the -approximation algorithm of You, Wang, and Cao . However, for arbitrary graphs the induced version of the problem seems much more difficult. Nevertheless, we are tempted to take the simplicity of Algorithm 1 as a hint that the local ratio technique is a good approach to attack the problem.

From a high level point of view, the structure of our -approximation algorithm for Cluster-VD is as follows: As long as there is an induced in the graph, either we can apply a reduction operation (identifying true twins) that does not change the optimum, or we find some induced subgraph and decrease the weights of its vertices in proportionally to a carefully chosen weighting for the vertices of , ensuring a local ratio of . (We remark that depends on only and is thus independent of the weights of vertices in , similarly as in Algorithm 1.)

The induced subgraphs we consider are as follows: cycles of length  (’s), -cliques plus distinguishing sets (’s plus distinguishing sets), and second-neighborhood subgraphs induced by the vertices at distance at most two from a maximum degree vertex of .

## 2. Definitions and Preliminaries

Let be a graph. Recall that the feasible solutions to Cluster-VD in are the sets of vertices that intersect every induced subgraph isomorphic to . For this reason, we call such sets hitting sets of . We denote by the minimum size of a hitting set of . The definitions extend naturally in the weighted setting: Given a weighted graph , where , we let denote the minimum weight (cost) of a hitting set of . As expected, the weight (or cost) of set is defined as .

For , the subgraph of induced by is denoted by . When is an induced subgraph of or isomorphic to an induced subgraph of , we sometimes say that contains . If does not contain , we also say that is -free.

For , the neighborhood of is denoted by . From time to time, to indicate that is a neighbor of , we simply say that sees .

## 3. Tools

### 3.1. True Twins and Distinguishers

Two vertices of a graph are called true twins if they are adjacent and have the same neighborhood in . True twins have a particularly nice behavior regarding Cluster-VD, as proved in our next lemma. This is our first main technical tool.

###### Lemma 2.

Let be a weighted graph and be true twins. Let denote the weighted graph obtained from by transferring the whole cost of to and then deleting , that is, let and if and if . Then .

###### Proof.

We have because every hitting set of yields a hitting set of with the same cost: we let if contains and otherwise. Here we use that no induced in contains both and .

Conversely, we have because any inclusionwise minimal cost hitting set of either contains both of the true twins and , or none of them. ∎

If does not contain any pair of true twins, we say that is twin-free.

Notice that two adjacent vertices and are not true twins if and only if has an induced containing and . The third vertex of such a is adjacent to one of and , and nonadjacent to the other. We say that it is a distinguisher for the edge , and call the induced a distinguishing .

Now let . A set disjoint from is said to be a distinguishing set for if for every edge whose endpoints are true twins in , the set contains a distinguisher for the edge .

###### Lemma 3.

Let be a graph whose vertex set is partitioned into a clique and a distinguishing set for . Then, there exists a weight function such that for all , and every set hitting each distinguishing has weight . In particular, .

###### Proof.

First, we claim that for every fixed , the set of edges of that are distinguished by but are not distinguished by any other vertex of forms a matching. Indeed, assume that has two incident edges and that are uniquely distinguished by . Then either is adjacent to both and , or to none of them. Thus does not distinguish the edge . Let be any vertex distinguishing the edge . Then is a distinguisher of or that is distinct from , a contradiction.

Next, we define the weight function by the following iterative procedure.

• Pick any distinguisher .

• Let denote the edges of that are uniquely distinguished by . By the claim, is a matching. Define .

• Let be any set of vertices hitting each edge of exactly once. Delete the vertices of from , delete from , and repeat until there are no more vertices in .

Notice that at each step remains a distinguishing set for . Notice also that the graph obtained after deleting from the distinguishing set and from the clique does not depend on the particular choice of . Indeed, all the possible choices for lead to isomorphic graphs since gets deleted.

Finally, we show that every set hitting all the distinguishing ’s has weight at least , by induction.

Let denote the first distinguisher picked by the weighting procedure and the corresponding set . If , consider the reduced instance , . It is true that hits all the distinguishing ’s for this new instance. By induction, we get .

Now assume that . Thus meets each edge of at least once. Let be any set meeting each edge of exactly once. By the remark above, we may assume that . As before, consider the reduced instance , . Clearly, hits all the distinguishing ’s for this instance. By induction, we get . ∎

### 3.2. α-Good Induced Subgraphs

Given a graph , an induced subgraph of , and a weighting , we say that is -good in if for every inclusionwise minimal hitting set of we have

 ∑v∈X∩V(H)cH(v)⩽α⋅OPT(H,cH). (1)

Moreover, we say that an induced subgraph of is itself -good in if there exists a weighting such that is -good.

We start by considering two different types of weighted induced subgraphs that satisfy the stronger condition , which obviously implies that they are -good.

###### Lemma 4.

Let be a graph. If is an induced in , then is -good.

###### Proof.

We let for all . Then and

 ∑v∈V(H)cH(v)=4=2⋅OPT(H,cH).

###### Lemma 5.

Let be a twin-free graph, let be a -clique in and let be a distinguishing set for . The induced subgraph is -good in for .

###### Proof.

With the weight function defined in Lemma 3, we have

 ∑v∈V(H)cH(v)=|C|+|C|−1=9⩽(9/4)⋅OPT(H,cH).

The next lemma is our main tool for constructing -good weighted induced subgraphs for . This time, we use the minimality of the hitting set to establish -goodness, however in a very simple way.

###### Lemma 6.

Let be a graph that is twin-free, -free and -free. Let be a vertex of maximum degree, and let , …, denote the components of . For , let denote the set of vertices in that see at least one vertex in . Let denote the subgraph of induced by . Then there exists a weight function such that is -good in .

###### Proof.

Notice that since is -free, the sets are pairwise disjoint.

In all case except in one sporadic case (part of Case 1.3 below), we let for all , that is, we put unit weight on these vertices. The weights on the vertices in will be determined later.

Let denote a minimal hitting set of . We wish to show that (1) always holds for our choice of weights and . We split the discussion into two cases according to the number of components of . Each of these cases is split into several subcases according to the structure of the induced subgraphs , .

In all the cases, we make sure that the weight on is at least , and hence

 ∑v∈X∩V(H)cH(v)⩽∑v∈V(H)cH(v)−1.

This follows from the assumption that is minimal: has to exclude at least one of the vertices of , and each of these vertices has weight at least . In order to prove -goodness, it suffices then to show that .

Case 1. . Then .

Case 1.1. is a clique. We let and use Lemma 3 on the clique and distinguishing set to define weights on . We get and , and thus .

Case 1.2. is not a clique and has clique number . If , we let and for . Then . This can be seen as follows. Let denote a minimum weight hitting set of . Either contains and at least one vertex of , or does not contain and is a clique. In both cases the weight of is at least . We have .

Otherwise, and is a . Let denote the middle vertex of this . Since is twin-free, and are not true twins. Thus, there exists a vertex that sees and not . We put unit weights on and , and zero weights on the vertices of . We get .

Case 1.3. is not a clique and has clique number . First, assume that and the minimum size of a hitting set of is at least . We let and for . By an argument similar to that used in Case 1.2, we have . Then .

Second, assume that there is a vertex that is a hitting set of . Because is connected, not a clique, and does not contain any -clique, one can check that the following holds for the graph : (1) has no true twin, (2) every pair of true twins lie in a triangle, (3) every triangle contains a pair of true twins, and (4) every two pairs of true twins are vertex-disjoint and there is no edge between them.

There is at least one pair of true twins in (since has a triangle), and each pair of true twins in is distinguished in by some vertex in . Moreover, every two such pairs are distinguished by distinct vertices in , since is -free. So there is a nonempty set with the following properties: (i) every pair of true twins in has a distinguisher in , (ii) there are vertex-disjoint induced ’s with one endvertex in and the other two vertices in .

Assume for now that . Then, we put a weight of on , unit weights on the vertices of and zero weights on the vertices of . Consider a minimum weight hitting set of . Either contains and at least further vertices in , or does not contain and contains at least vertices in . Therefore, we have and .

Otherwise, and using , and , we have . In both cases, contains a vertex (possibly ) that sees every vertex in , in addition to . Since has maximum degree, has exactly the same neighbors as , and is thus a true twin of , a contradiction.

Finally, the last case to consider is when the minimum size of a hitting set in is at least and . Using that is -free, one can check that in this case, and that the minimum size of a hitting set in is exactly . Then the maximum degree in is at most since otherwise by maximality of its degree, would have a true twin in . Since contains at least one triangle and is -free, this leaves only one possibility: is a bull, that is, a triangle with two extra vertices of degree , say and , each seeing a different vertex in the triangle. We increase the weight of one of these vertices to , say , put a unit weight on , an zero weights on . Then and .

Case 2. . In this case the weight on is set implicitly. Remember that we require

 cH(v0) ⩾1. (2)

For , we let denote the minimum weight of a hitting set of , and denote the minimum weight of a hitting set of not containing . Notice that these quantities depend on the weight function , which is not fully determined at this point.

We claim that the following lower bound holds on , regardless of how is chosen for :

 OPT(H,cH)⩾min⎛⎝{cH(v0)+∑iOPTi}∪⎧⎨⎩∑i≠j|Ai|+OPT′j∣j∈[k]⎫⎬⎭⎞⎠.

In order to verify that this claim is true, consider a hitting set of . If contains , then is a hitting set of for each . In this case, . Otherwise, does not contain . Then, there exists an index such that contains for all . Moreover, is a hitting set of not containing . In this case, .

Thanks to the above lower bound on , it suffices to satisfy the following inequalities in order to guarantee that is -good (remember that we put unit weights over the ’s, thus for every ):

 cH(v0)+∑i(|Ai|+cH(Bi)) ⩽2(cH(v0)+∑iOPTi)+1 ⟺cH(v0) ⩾∑i(|Ai|+cH(Bi)−2OPTi)−1 (3)

and, for all ,

 cH(v0)+∑i(|Ai|+cH(Bi)) ⩽2⎛⎝∑i≠jcH(Ai)+OPT′j⎞⎠+1 ⟺cH(v0) ⩽∑i≠j(|Ai|−cH(Bi))+2OPT′j−|Aj|−cH(Bj)+1. (4)

By eliminating the variable from the system (2)–(4), we get the following inequalities not involving . For all :

 |Aj|+cH(Bj) ⩽∑i≠j(OPTi−cH(Bi))+OPTj+OPT′j+1 (5)

and

 |Aj|+cH(Bj) ⩽∑i≠j(|Ai|−cH(Bi))+2OPT′j. (6)

If (5) and (6) are satisfied for all , then is -good.

In order to simplify these constraints, we add the extra requirements that and for all . Since and , both (5) and (6) follow if, for all :

 |Aj|+cH(Bj)⩽1+OPTj+OPT′j. (7)

Fix any . We set the weights on the vertices of by inspecting the structure of the induced graph . We consider three subcases, see below. In each of these cases, it is straightforward to check that the two extra requirements are satisfied for .

Case 2.1. is a clique. By Lemma 3, we may set weights on to have and . Now , so that inequality (7) is satisfied, since

 |Aj|+cH(Bj)=|Aj|+|Aj|−1=1+(|Aj|−1)+(|Aj|−1)⩽1+OPTj+OPT′j.

Case 2.2. is not a clique and has clique number . In this case, we put zero costs on . We get because is not a clique and also , so that

 |Aj|+cH(Bj)=|Aj|=1+1+(|Aj|−2)⩽1+OPTj+OPT′j.

Case 2.3. is not a clique and has clique number . If the minimum size of a hitting set in is at least , we put zero weights on . Thus (7) is satisfied, since then and

 |Aj|+cH(Bj)=|Aj|⩽1+2+|Aj|−3⩽1+OPTj+OPT′j.

Now, assume that there exists some vertex that hits all the induced -paths in . As in Case 1.3, we see that there is a set with the following properties: (i) every pair of true twins in has a distinguisher in , (ii) among the distinguishing ’s defined by the vertices in , there are vertex-disjoint ’s.

We put unit weights on the vertices of and zero weight on the vertices of . We get since a hitting set in has to have one vertex on each of the vertex-disjoint distinguishing ’s but this is not enough to hit all the induced ’s. And also since every triangle in has one pair of true twins, which is distinguished by some vertex of . We have

 |Aj|+cH(Bj)=|Aj|+|B′j|⩽1+(|Aj|−2)+(|B′j|+1)⩽1+OPTj+OPT′j.

## 4. Algorithm

Our -approximation algorithm is described below, see Algorithm 2. Although we could have presented it as a primal-dual algorithm, we chose to present it within the local ratio framework in order to avoid some technicalities, especially those related to the elimination of true twins.

The following lemma makes explicit a simple property of Cluster-VD that is key when using the local ratio technique. This property is common to many minimization problems, and is often referred to as the Local Ratio Lemma; see e.g. the survey of Bar-Yehuda, Bendel, Freund, and Rawitz .

###### Lemma 7 (Local Ratio Lemma).

Let be a weighted graph with the sum of two cost functions and , and let . If is a hitting set of such that and , then .

###### Proof.

Since , it is enough to show that . To see this, let be a minimum weight hitting set for . Then . ∎

Algorithm 2 uses an ordered list of weighted induced subgraphs of as defined in Lemmas 4, 5 and 6. We order the weighted induced subgraphs in in order to make sure that the hypotheses of the corresponding lemma are satisfied when is used. The first elements of the list are induced ’s (if any), next come the induced ’s (if any) each of them taken together with a distinguishing set, and finally the second neighborhood of any maximum degree vertex . Notice that the list is always nonempty and of polynomial size. This ensures that Algorithm 2 has polynomial complexity.

We are now ready to prove our main result.

###### Proof of Theorem 1.

By induction on the number of recursive calls, we prove the following claim:

The set output by Algorithm 2 on input is an inclusionwise minimal hitting set of and .

If the algorithm does not call itself, then it returns the empty set and in this case claim trivially holds. Now assume that the algorithm calls itself at least once and that the output of the recursive call is an inclusionwise minimal hitting set of that satisfies . There are three cases to consider.

Case 1: The recursive call occurs at Step 6. Then we have and because is simply with one zero-cost vertex removed. By construction, is an inclusionwise minimal hitting set of . Moreover, by what precedes, .

Case 2: The recursive call occurs at Step 11. Again, is an inclusionwise minimal hitting set of and , where the last equality holds by Lemma 2.

Case 3: The recursive call occurs at Step 18. In this case, and , thus is automatically an inclusionwise minimal hitting set of . Let denote the weighting extended to by letting for . We have by induction and since all the weighted induced subgraphs in are -good in (see Lemmas 4, 5 and 6). Because , Lemma 7 implies . ∎

## 5. Conclusion

In this paper we presented a -approximation algorithm for the Cluster-VD problem, based on the local ratio technique. The main idea underlying the algorithm is that in a twin-free, -free graph, one can define weights on the vertices of the second neighborhood of any maximum degree vertex in order to guarantee a local ratio of at most . Moreover, the input graph can be made twin-free and -free without worsening the approximation ratio beyond . Making the graph -free is what causes the approximation ratio to increase to . If the input graph is -free, our algorithm is in fact a -approximation algorithm.

Furthermore, looking closely at the proof of Lemma 6, we see that one can also obtain a -approximation algorithm for diamond-free graphs. This is due to the fact that, if is diamond-free, the open neighborhood of any vertex is a union of cliques.

###### Theorem 8.

There is a -approximation algorithm for Cluster-VD in the class of -free graphs, and in the class of diamond-free graphs.

We note that Theorem 8 can be seen as a generalization of the fact that there is a -approximation for Cluster-VD in triangle-free graphs, a result that was used by You, Wang, and Cao  in their -approximation algorithm for (unweighted) Cluster-VD.