Clustering is a basic task in a variety of areas, and clustering problems are ubiquitous in practice, and are well-studied in algorithms and discrete optimization. Recently fairness has become an important concern as automated data analysis and decision making have become increasingly prevalent in society. This has motivated several problems in fair clustering and associated algorithmic challenges. In this paper, we show that two different fairness views are inherently connected with a previously studied clustering problem called the Priority -Center problem.
The input to Priority -Center is a metric space and a priority radius for each . The objective is to choose -centers such that is minimized. If one imagines clients located at each point in , and is the “speed” of a client at point , then the objective is to open -centers so that every client can reach an open center as quickly as possible. When all the ’s are the same, then one obtains the classic -center problem [HS86]. Plesník [PLE87] introduced this problem under the name of weighted111Plesník [PLE87] considered every client to have a weight and thus named it. At around the same time, Hochbaum and Shmoys [HS86] called the version of -center where every center has a weight and the total weight of centers is bounded, the weighted -center problem. Possibly to allay this confusion, Gørtz and Wirth [GW06] called the Plesník version the Priority -Center problem. Hochbaum and Shmoys’ weighted -center problem is nowadays (including this paper) called the knapsack center problem to reflect the knapsack-style constraint on the possible centers. -center; the name Priority -Center was given by Gørtz and Wirth [GW06] and this is what we use. Plesník [PLE87] generalized Hochbaum and Shmoys’ [HS86] -approximation algorithm for the -center problem and obtained the same bound for Priority -Center. This approximation ratio is tight since -factor approximation is ruled out even for the classic -center problem under the assumption that [GON85, HS85].
Connections to Fair Clustering. Our motivation to revisit Priority -Center came from two recent papers that considered fair variants in clustering, without explicitly realizing the connection to Priority -Center. One of them is the paper of Jung, Kannan and Lutz [JKL20] who defined a version of fair clustering as follows. Given representing clients/people in a geographic area, and an integer , for each let denote the smallest radius such that there are at least points of inside a ball of radius around . They suggested a notion of fair -clustering as one in which each point should be served by a center not farther than since the average size of a cluster in a -clustering is . [JKL20] describe an algorithm that finds centers such that each point is served by a center at most distance away from . Once the radii are fixed for the points, then one obtains an instance of Priority -Center, and the result essentially222One needs to observe that Plesník’s analysis [PLE87] can be made with respect to a natural LP which has a feasible solution with . See Section 6 for more details. follows from the algorithm in [PLE87]; indeed, the algorithm in [JKL20] is the same.
Another notion of fairness related to the Priority -Center is the lottery model introduced by Harris et al. [HLP+19b]. In this model, every client
has a “probability demand”and a “distance demand” . The objective is to find a distribution over -center locations such that for every client , . One needs to either prove such a solution is not possible, or provide a distribution where the distance to can be relaxed to . Using a by now almost standard reduction via the ellipsoid method [CV02, AAK+20], this boils down to the outlier version of Priority -Center, where some points in are allowed to be discarded. The outlier version of Priority -Center had not been explicitly studied before.
Our Contributions. Motivated by these connections to fairness, we study the natural generalizations of Priority -Center that have been studied for the classical -center problem. The main generalization is the outlier version of Priority -Center: the algorithm is allowed to discard a certain number of points when evaluating the quality of the centers chosen. First, the outlier version arises in the lottery model of fairness. Second, in many situations it is useful and important to discard outliers to obtain a better solution. Finally, it is also interesting from a technical point of view. We also consider the situation when the constraint on where centers can be opened is more general than the cardinality constraint. In particular, we study the matroid priority center problem where the set of centers must be an independent set of a given matroid, and the knapsack priority center problem where the total weight of centers opened is at most a certain amount. Our main contribution is an algorithmic framework to study the outlier problems in all these variations.
1.1 Statement of Results
We briefly describe some variants of Priority -Center. In the supplier version, the metric space is partitioned into facilities and clients , and goal is to select facilities to minimize . In the Priority Matroid Supplier problem, the subset of facilities need to be an independent set of matroid on . In the Priority Knapsack Supplier problem, the subset of facilities must have weight at most a certain amount. All these generalizations have a -approximation [HS86, CLL+16] in the vanilla version where all ’s are the same. Our first observation is that these extend to Priority -Center in a simple fashion. This result also implicitly relates the approximation ratio to the integrality gap of the natural LP relaxation. This allows us to rederive and extend the algorithmic results in [JKL20]; we give details in Section 6.
There is a -approximation for Priority -Supplier and Priority Matroid Supplier and Priority Knapsack Supplier.
Our second, and the main technical contribution, is a general framework to handle outliers. Given an instance of Priority -Center and an integer , the outlier version that we refer to as PCO, is to find centers and a set of at least an points from such that is minimized. While the -Center with outliers admits a clever, yet relatively simple, greedy -approximation due to Charikar et al. [CKM+01], a similar approach seems difficult to adapt for Priority -Center. Instead, we take a more general and powerful LP-based approach from [CGK20, CN19] to develop a framework to handle PCO, and also the outlier version of Matroid Center (PMCO), where the opened centers must be an independent set, and Knapsack Center (PKnapCO), where the total weight of the open centers must fit in a budget. We obtain the following results.
There is a -approximation for PCO and PMCO and a -approximation for PKnapCO. Moreover the approximation ratio for PCO and PMCO are based on a natural LP relaxation.
At this point we remark that a result in Harris et al. [HLP+19b] (Theorem 2.8 in the arXiv version) also indirectly gives a -approximation for PCO. We believe that our framework is more general and can handle PMCO and PKnapCO easily. The [HLP+19b] paper do not consider these versions, and indeed for the PKnapCO problem their framework cannot give a constant factor approximation for they (in essence) use a weak LP relaxation.
Furthermore, our framework yields better approximation factors when either the number of distinct priorities are small, or they are in different scales. In practice, one indeed expects this to be the case. In particular, when there are only two distinct types of radii, then we get a -approximation which is tight; it is not too hard to show that it is NP-hard to obtain a better than -approximation for PCO with two types333Interestingly, when there is a single priority, the vanilla -center with outliers has a -approximation [CGK20] showing a gap between the two problems. of priorities. We get improved factors ( and ) when the number of radii are three and four as well. On the other hand, if all the different priorities are powers of (for some parameter ), then we get a -approximation. Thus, if all the priorities are in vastly different scales (), then our approximation factor approaches . A summary of our results can be found in the third column of Table 1.
Suppose there are only two distinct priority radii among the clients. Then there is a -approximation for PCO, PMCO and PKnapCO. With distinct types of priorities, the approximation factor for PCO and PMCO is . If all distinct types are powers of , the approximation factor for PCO and PMCO becomes .
It is possible that the PCO problem has a -approximation in general, and even the natural LP-relaxation may suffice; we have not been able to obtain a worse than integrality gap example. As we explain in Section 1.2 below, many approaches to the -center type problems begin with a Hochbaum-Shmoys [HS86] style partition of the points to representatives. We could show examples where such an approach has a gap worse than , though not showing an integrality gap instance. Resolving the integrality gap of the natural LP-relaxation and/or obtaiting improved approximation ratios are interesting open questions highlighted by our work.
Consequences for Fair Clustering. The algorithm in [JKL20] for their fair clustering model is made much more transparent by the connection to Priority -Center. Since Priority -Center is more general, it allows one to refine and generalize the constraints that one can impose in the clustering model as well use LP relaxations to find more effective solutions in particular scenarios. In addition, by allowing outliers, one can find tradeoffs between the quality of the solution and the number of points served. We give more details in Section 6.
Recall the lottery model of Harris et al. [HLP+19b] which we discussed above. The algorithm in [HLP+19b] is based on a sophisticated dependent rounding scheme and analysis. In fact we observe that implicit in their result is a -approximation for PCO modulo some technical details. We can ask whether the result in [HLP+19b] extends to the more general setting of Matroid Center and Knapsack Center. We prove that an -approximation algorithm for weighted outliers can be translated, via the ellipsoid method, to yield the results in the probabilistic model of [HLP+19b]. This is not surprising and a very similar reduction was shown in [AAK+20] in the context of the colorful -Center problem with outliers. The advantage of this black box reduction is evident from our algorithm from PKnapCO which is non-trivial and is based on dynamic programming and on the the cut-and-round approach since the natural LP has an unbounded integrality gap. It is not at all obvious how one can directly round a fractional solution to the problem while the generic transformation is clean and simple at the high level. For instance our -approximation for two radii extends to the lottery model immediately.
|-Center||2 [HS86]||2 [PLE87, JKL20], (Theorem 1)|
|-Supplier||3 [HS86]||3 (Theorem 2)|
|Knapsack Supplier||3 [HS86]||3 (Theorem 2)|
|Matroid Supplier||3 [CLL+16]||3 (Theorem 2)|
|-Center with Outliers||2 [CGK20]||9 (Theorem 3)|
|Matroid Center with Outliers||3 [HPS+19a, CN19]||9 (Theorem 10)|
|Knapsack Center with Outliers||3 [CN19]||14 (Theorem 14)|
1.2 Technical Discussion
Almost all clustering algorithms for the -center objective proceeds via a partitioning subroutine due to Hochbaum and Shmoys [HS86] (HS, henceforth). This procedure returns a partition of along with a representative for each part such that all vertices of a part “piggy-back” on the representative. More precisely, if the representative is assigned to a center , then so are all other vertices in that part. To ensure a good algorithm in vanilla -center, it suffices to ensure the radius of each part is small.
For the Priority -Center objective, one needs to be more careful : to use the above idea, one needs to make sure that if vertex is piggybacking on vertex , then better be more than . Indeed, this can be ensured by running the HS procedure in a particular order, namely by allowing vertices with smaller to form the parts first. This precisely gives Plesník’s algorithm [PLE87]. In fact, this idea easily gives a -approximation for the matroid and supplier versions as well.
Outliers are challenging in the setting of Priority -Center. We start with the approach of Chakrabarty et al. [CGK20] for -center. First, they construct an LP where denotes the fractional coverage (amount to which one is not an outlier) of any point, and then write a natural LP for it. They show that if the HS algorithm is run according to the order (higher coverage vertices first), then the resulting partition can be used to obtain a -approximation for the -center with outliers problem.
When one moves to the priority -center with outliers, one sees the obvious trouble: what if the order and the order are at loggerheads? Our approach out of this is a simple bucketing idea. We first write a natural LP with fractional coverages for every point. Then, we partition vertices into classes: all vertices with between and are in the same class. We then use the HS partitioning algorithm in the decreasing order separately on each class. The issue now is to handle the interaction across classes. To handle this, we define a directed acyclic graph across these various partitions where representative has an edge to representative iff is small (). It is a DAG because we point edges from higher to the lower . Our main observation is that if we can peel out paths with “large value” (each representative’s value is how many points piggyback on it), then we can get a -approximation for the priority -center with outlier problem. We can show that a fractional solution of large value does exist using the fact that the DAG was constructed in a greedy fashion. Also, since the graph is a DAG, this LP is an integral min-cost max-flow LP. The factor arises out of a geometric series and bucketing. Indeed, when the radii are exact powers of , we get a -approximation, and when there are only two type of radii, we get a approximation which is tight.
The above framework can handle the outlier versions for the matroid and knapsack version. For the matroid version, the flow problem is no longer a min-cost max-flow problem, but rather it reduces to a submodular flow problem which is solvable in polynomial time. Modulo this, the above framework gives a -approximation. For the knapsack version, there are two issues. One is that the flow problem involves non-uniform numbers and is no longer integral and solving the underlying optimization problem is likely to be NP-hard (we did not attempt a formal proof). Nevertheless, our framework has sufficient flexibility that by increasing the approximation factor from to , the DAG can in fact be made into a rooted forest. In this rooted forest, we can employ dynamic programming to solve the problem of finding the desired paths. The second issue is that a fractional LP solution of the natural LP does not suffice when using the DP based algorithm on the forest; indeed the natural LP has an unbounded gap. Here we need to use the round-or-cut framework from [CN19]; either the DP on the rooted forest succeeds or we find a violated inequality for the large implicit LP that we use.
1.3 Other Related Works
There is a huge literature on clustering, and instead of summarizing the landscape, we mention a few works relevant to our paper. Gørtz and Wirth [GW06] study the priorty -center problem in the asymmetric metric case, and prove that it is NP-hard to obtain any non-trivial approximation. A related problem to priority -clustering is the non-uniform -center problem by Chakrabarty et al. [CGK20] where instead of clients having radii bounds, the objective is to figure out centers of balls for different types of radii. Another related problem [GGS16] is the local -median problem where clients need to connect to facilities within a certain radius, but the objective is the sum instead of the max.
Fairness in clustering has also seen a lot of works recently. Apart from the two notions of fairness described above, which can be thought of as “individual fairness” guarantees, Chierichetti et al. [CKL+17] introduce the “group fairness” notion where points have color classes, and each cluster needs to contain similar proportion of colors as in the universe. Their results were generalized by a series of follow ups [RS18, BGK+19, BCF+19]. A similar concept for outliers led to the study of fair colorful -center. In this problem, the objective is to find centers which covers at least a prescribed number of points from each color class. This was introduced by Bandapadhyay et al. [BIP+19], and recently true approximation algorithms were concurrently obtained by Jia et al. [JSS20] and Anegg et al. [AAK+20].
Another notion of fairness is introduced by Chen et al. [CFL+19] in which a solution is called fair if there is no facility and a group of at least clients, such that opening that facility lowers the cost of all members of the group. They give a -approximation for , , and norm distances for the setting where facilities can be places anywhere in the real space. Recently Micha and Shah [MS20] showed that a modification of the same approach can give a close to -approximation for case and proved factor is tight for and .
Coming back to the model of Jung et al. [JKL20], the local notion of neighborhood radius is also present in the metric embedding works of [CDG06, CMM10] and were recently used by Mahabadi and Vakilian [MV20] to extend the results in [JKL20] to other objectives such as -median and -means. We leave the outlier versions of these problems as an open direction of study.
We provide some formal definitions and describe a clustering routine from [HS86].
Definition 1 (Priority Center).
The input is a metric space . We are also given a radius function , and integer . The goal is to find of size at most to minimize such that for all ,
Definition 2 (Generalization: Priority -supplier [Cn19]).
The input is a metric space where , is the set of points, and the set of facilities. We are also given a radius function . The goal is to find to minimize such that for all , . The constraint on is that it must be selected from a down-ward closed family . Different families lead to different problems. We get the priority -supplier problem if . We get the priority matroid supplier problem when is a matroid. We get the priority knapsack supplier problem when there is a weight function and for some budget .
For the remainder of this manuscript, we focus on the feasibility version of the problem. More precisely, given an instance of the problem, we either want to show there is no solution with , or find a solution with . If we succeed, then via binary search we get a -approximation.
Plesník [PLE87] obtained a -approximation for Priority -Center. Algorithm 1 is a slight generalization of his algorithm; in addition to the radius function and the metric, we take as input a function which encodes an ordering over the points (we can think of the points as being ordered from largest to smallest values). The algorithm is a similar procedure to that of Hochbaum and Shmoys from [HS86], but while [HS86] picks points arbitrarily, points get picked in the order mandated by .
The following is true for the output of HS: (a) , (b) The set partitions , (c) , and (d) .
[PLE87] There is a -approximation for Priority -Center.
(For completeness and later use.) We claim that , the output of Algorithm 1 for , is a 2-approximate solution; this follows from the observations in creftypecap 1. For any there is some for which . By our choice of , . Since , we have . To see why , recall that for any , by creftypecap 1, so the same point cannot cover more than one member of . Thus any feasible solution needs at least many points to cover all of . ∎
In fact, the algorithm almost immediately gives a -approximation for Priority -center for many families via the framework in [CN19].
One needs to check if given any partition of , whether the following partition feasibility problem is solvable: does there exist such that for all ? We ask this for the partition returned by Algorithm 1, that is, . If no such exists, then the instance is infeasible since the centers of the parts cannot be covered. If such an exists, then by construction every in part satisfies since . It is easy to see for the supplier, knapsack, and matroid center versions, the partition feasibility problem is solvable in polynomial time. This leads to the following theorem.
There is a -approximation for Priority -Supplier, Priority Knapsack Center, and the Priority Matroid Center problem.
3 Priority -Center with Outliers
In this section we describe our framework for handling priorities and outliers and give a -approximation algorithm for the following problem. .
Definition 3 (Priority Center with Outliers (PCo)).
The input is a metric space , a radius function , and parameters . The goal is to find of size at most to minimize such that for at least points , .
There is a 9-approximation for PCO.
The following is the natural LP relaxation for the feasibility version of PCO. For each point , there is a variable that denotes the (fractional) amount by which is opened as a center. We use to indicate the amount by which is covered by itself or other open facilities. To be precise, is the sum of over all at distance at most from . We want to ensure that at least units of coverage are assigned using at most centers (hence the first two constraints).
Next, we define another problem called Weighted -Path Packing (WPP) on a DAG. Our approach is to do an LP-aware reduction from PCO to WPP. To be precise, we use a fractional solution of the PCO LP to reduce to a WPP instance . We show that a good integral solution for translates to a -approximate solution for the PCO instance. We prove that has a good integral solution by constructing a feasible fractional solution for an LP relaxation of WPP; this LP relaxation is integral. Henceforth, denotes the set of all the paths in where each path is an ordered subset of the edges in .
Definition 4 (Weighted k-Path Packing (WPp)).
The input is where is a DAG, for some integer . The goal is to find a set of vertex disjoint paths that maximizes:
Even though this problem is NP-hard on general graphs444 and unit is the longest path problem which is known to be NP-hard [GJ79]., it can be easily solved if is a DAG by reducing to Min-Cost Max-Flow (MCMF). To build the corresponding flow network, we augment to a new DAG with source and sink nodes . . Each node has unit capacity and cost equal to . and have zero cost with capacities and respectively. As for the arcs, includes the entirety of , plus arcs and for all . All the arcs have unit capacity and zero cost. One can now write the MCMF LP for WPP which is known to be integral. We use and to denote the set of outgoing and incoming edges of a vertex respectively. The LP has a variable for each arc to denote the amount of (fractional) flow passing through it. Similarly, the amount of flow entering a vertex is denoted by . The objective is to minimize the cost of the flow which is equivalent to maximizing the negation of the costs.
WPP is equivalent to solving MCMF on .
Proof of Claim creftypecap 4.
Observe that any solution for the WPP instance translates to a valid flow of cost for the flow problem. For any path with start vertex and sink vertex , send one unit of flow from to , through to and then to . Since the paths in are vertex disjoint and there are at most of them, the edge and vertex capacity constraints in the network are satisfied.
Now we argue that any solution to the MCMF instance with cost translates to a solution for the original WPP instance with . To see this, note that the MCMF solution consists of at most many paths that are vertex disjoint with respect to . This is because of our choice of vertex capacities. Let be those paths modulo vertices and . For a , is counted towards the MCMF cost iff has a flow passing through it which means is included in some path in . Thus . ∎
3.1 Reduction to WPp
Using a fractional solution of the PCO LP we construct a WPP instance. In particular, we use the cov assignment generated by the LP solution. Without loss of generality, by scaling the distances, we assume that the smallest neighborhood radius is 1. Let , where is the largest value of (after scaling). We use to denote . Partition according to each point’s radius into , where for . Note that some sets may be empty if no radius falls within its range.
Algorithm 2 shows the PCO to WPP reduction. The algorithm constructs a DAG called contact DAG (see Definition 5) as a part of the WPP instance definition. We first run Algorithm 1 on each to produce a set of representatives and a set . The function is constructed using the ’s. Each defines a row of the contact DAG starting with at the top. Arcs in the contact DAG exist only between points in different rows, and only when they share a point in that can cover them both within their desired radii. We always have arcs pointing downwards, that is, from points in to points in where . See Figure 1 for an example on how a contact DAG looks like.
Definition 5 (contact DAG).
Our first observation is that the WPP instance has a good fractional solution and since the LP is integral, it also has a good integral solution. The proof of this lemma is straightforward and we defer it to later.
Theorem 3 now follows from the following lemma.
Any solution with value at least for the WPP instance given by Algorithm 2 translates to a 9-approximation for the PCO instance .
We begin with a few observations. Per definition of contact DAG we have the following property. Note that the converse is not necessarily true.
If , , and is an arc in contact DAG, .
as constructed in Algorithm 2 partitions .
partitions and HS further partitions each according to creftypecap 1. ∎
For any , reachable from in a contact DAG, .
Proof of creftypecap 7.
Observe that by definition of contact DAG, . A path from to may contain a vertex from any level of the DAG between and . In the worst case, the path has a vertex from every level for :
|(by creftypecap 2)|
Now we are armed with all the facts we need to prove Lemma 6. We are assuming the constructed WPP instance has a solution of value at least , which means there exists a set of disjoint paths in the contact DAG such that . For any path , let denote the last node in this path (i.e. ). Our final solution would be . We argue that this is a 9-approximate solution for the initial PCO instance. Since has at most many paths, .
Now we show any where , can be covered by with dilation at most 9. Assume for some .
The last piece is to argue at least points will be covered by . The set of points that are covered by within times their radius is precisely the set . So we need to show . By creftypecap 3 we have:
In the special case where there are 2 types of radii we can slightly modify our approach to get a 3-approximation algorithm. This result is tight. To see this consider PCO instances where clients having priority radii in with of the former type and of the latter, and the number of outliers allowed is . Clients with priority radii either need to have a facility opened at that same point, or need to be an outlier. Since only outliers and centers are allowed, all the outliers and centers are on these points. Thus, the points act as facilities in the -supplier problem which is hard to approximate with a factor better than 3. This shows a gap with the vanilla -center with outliers has a -approximation [CGK20].
In general, our framework yields improved approximation factors when the number of distinct priorities are less than 5 (see Theorem 8). In the special case when all radii are powers of , our algorithm is actually a -approximation. This factor improves if the radii are powers of some and approaches 3 as goes to infinity (see Theorem 9).
There is a -approximation for PCO instances where there are only types of radii.
Given PCO instance obtain fractional solution by solving the PCO LP. Partition according to each point’s radius into , where is points of radius type for . Run Algorithm 2 with input cov corresponding to and take resulting WPP instance . Assuming the WPP instance has a solution with value at least , we can show how to obtain a -approximate solution as follows. Let be the WPP solution. Take any . If is a single vertex, simply add it to solution . Otherwise, instead of adding to , if is the vertex before in , add a point that covers both the endpoints and ( exists by Definition 5).
Take any where and assume for some . Similar to the proof of creftypecap 7 one can show by bounding the radius of any vertex in between them by and noting that is in level 2 or higher (remember ends at and is an edge). Since and (creftypecap 1), plus we have . But by definition of so is covered by with dilation at most . The part to argue at least points will be covered by , is done similar to the proof of lemma 6.
The remainder of this proof, i.e. showing that does indeed have a solution of value at least that can be determined in polynomial time using an MCMF algorithm, is identical to the proof of Theorem 3. ∎
There is a -approximation for PCO instances where the radii are powers of .
Given PCO instance obtain fractional solution by solving the PCO LP. Partition according to each point’s radius into , where and for . Run Algorithm 2 with input cov corresponding to and take resulting WPP instance . Assume the WPP instance has a solution with value at least . For any add to solution . Consider arbitrary where and assume for some . Similar to the proof of creftypecap 7 one can show . By creftypecap 1 . Thus any is covered by dilation as . To argue at least points will be covered by , see the proof of lemma 6. Showing that does indeed have a solution of value at least that can be determined in polynomial time using an MCMF algorithm, is identical to the proof of Theorem 3 as well. ∎
Proof of Lemma 5.
We construct a fractional solution for the WPP LP with objective value at least . Recall per definition of contact DAG. For any let be the set of points for which contributes to . That is for all :
There is an edge between points iff there exists some for which and . Thus for any , we have . Now recall555Refer to the definition of WPP and the LP formulation based on MCMF. how we augmented to get a flow network by adding source and sink vertices and , plus arcs and for all . Observe that resembles an path in . Formally, let be sorted in decreasing order of neighborhood radii. Define to be the path that passes through in this order. That is, . Note that the same arc can be in of multiple . This motivates the definition of . Set as follows:
Now we argue that is a feasible solution for WPP LP. First, notice that the flow is conserved for each vertex . That is, . This is due to the fact that for any , we add the same amount to of all . Next, we see that .
This concludes the proof of Theorem 3. ∎
Remark: The only place we used the cardinality constraint on the facilities (i.e. ) is to make sure our solution corresponds to at most many paths. In other words, the reduction algorithm and the analysis of the approximation ratio are completely independent of this constraint. This observation leads us to generalize our approach to handle a wider range of constraints that define a feasible set of facilities.
4 Priority Matroid-Center with Outliers
In this section, we show how to generalize the results from the previous section for the case of Priority Matroid-Center with Outliers (PMCO).
Definition 6 (Priority Matroid-Center with Outliers (Pmco)).
The input is a metric space , parameter , radius function , and a family of independent sets of a matroid. The goal is to find to minimize such that for at least points , .
There is a 9-approximation for PMCO.
As in the previous section, we assume and consider the feasibility version of the problem. For any , let be the rank of in the given matroid. The natural LP relaxation for this problem is very similar to that of PCO LP except that we replace the cardinality constraints with rank constraints for all . This is because for any , .
Similar to WPP, we have the path packing version of PMCO defined below. Recall from the last section, that after reducing from PCO to WPP we returned a set of vertices in DAG as our final solution. Now that we have matroid constraints, we must instead return a set of vertices such that . Doing so is not as straightforward, since our reduction does not guarantee that such a subset of vertices actually exists and covers enough points in their corresponding vertex disjoint paths. Instead, we show there is an such that each member of this is close to some vertex of . These close points in will correspond to a set of vertex disjoint paths that will cover enough points.
Definition 7 (Weighted -Path Packing (WMatPP)).
The input is and same as in WPP, plus a finite set , , and a family of independent sets of a matroid. The goal is to find a set of disjoint paths with maximum for which there exists such that , .
Observe that the reduction procedure in Algorithm 2 and all of our subsequent observations in Section 3.1 do not rely on how we define a feasible set of centers. Hence, the main obstacle in proving Theorem 10 lies in our reduction to MCMF. Luckily, the result of [CKR+15] helps us address this by giving LP integrality results similar to MCMF using the following formulation on directed polymatroidal flows [EG77, LM82]: For a network , for all , we are given polymatroids666Monotone integer-valued submodular functions. and on and respectively. For every arc there is a variable . The capacity constraints for each are defined as:
We augment the DAG given in WMatPP to construct a polymatroidal flow network . In this new network, where each node has cost . Note: Even though a vertex might correspond to a point in , in we make a distinction between the two copies. includes all of , plus arcs for all . Finally, instead of adding arcs , we add arcs and for all .
The polymatroids for this instance are constructed as follows: for any , for all non-empty and is defined similarly on . For , we only have outgoing edges where for all . Finally, we enforce the matroid constraints of on . For any , let be the set of starting nodes in . That is, . Set . Since , these capacity constraints on are equivalent to the following set of constraints:
Now, we prove a claim analogous to that of creftypecap 4.
WMatPP is equivalent to solving the polymatroidal flow on network .
Any solution for the WMatPP instance translates to a valid flow of cost for the flow problem. Let be the independent set that intersects for all . For any path with start vertex and sink vertex , take arbitrary . Send one unit of flow from to , through to and then to and . All the polymatroidal constraints in WMatPP LP are satisfied.
Now we argue that any solution to the flow instance with cost translates to a solution for WMatPP with . To see this, note that the flow solution consists of paths that are vertex disjoint with respect to . This is due to our choice of polymatroids. Each path passes through one , then immediately to and then ends in . By polymatroidal constraints on , the subset of that has a flow going through it will be an independent set of .
Let be the described paths induced on . For a , is counted towards the MCMF cost iff has a flow passing through it. This means is included in some path in . Thus . ∎
The polymatroidal LP for this particular construction is as follows (recall ):
As for reducing PMCO to WMatPP, most of the notation and results can be recycled from Section 3.1. Specially, the reduction itself (Algorithm 3) is just Algorithm 2 with creftypecap 5 added. Note: By definition of an arc in contact DAG, for two nodes , is an arc iff intersects .
Before we start to prove our 9-approximation result for PMCO, we need to slightly modify creftypecap 7 to account for the fact that a vertex covered by (the sink of some path) has to travel slightly farther than to reach an . Fortunately, the proof of creftypecap 7 has a slight slack that allows us to derive the same distance guarantees even with this extra step.
For any and reachable from in contact DAG , and any , .
By definition of it must be the case that . Also for all , . If is reachable from , a path between and may contain a vertex from every level for :