In this paper, we aim to design approximation algorithms for several natural graph problems, in the setting where the points in the graph lie in a metric space. Following the seminal work of , we aim to provide sublinear approximation algorithms; that is, on problems with points and hence edge distances, we aim to provide randomized algorithms that require time and in fact only consider edges, by making use of sampling. Similar to the previous work, we assume we can query the weight of any single edge in time; when we use the term “query”, we mean an edge weight query throughout.
A well known technique to design sublinear algorithms is uniform sampling; that is, a subset of edges (or vertices) is sampled uniformly at random. Several algorithms use uniform sampling to improve speed, space, or the number of queries [4, 5, 6, 7, 12, 13, 18, 26, 31, 46]. Uniform sampling is very easy to implement, but problematically it is oblivious to the edge weights. When it comes to maximization problems on graphs, a few high weight edges may have a large effect on the solution, and hence the uniform sampling technique may fail to provide a suitable solution because it fails to sample these edges. For example, consider the densest subgraph problem, where the density of a subgraph is the sum of the edges weights divided by the number of vertices. It is known that for general unweighted graphs, the densest subgraph of a uniformly sampled subgraph with edges is a approximation of the densest subgraph of the original graph [31, 46, 47]. However, as we show in Appendix A
this result is not true for weighted graphs, even in a metric space. This problem suggests we should design approaches that sample edges with probabilities proportional to (or otherwise related to) their weight in a metric space.
As our main result, we design a novel sampling approach using a sublinear number of queries for graphs in a metric space, where independently for each edge, the probability the edge is in the sample is proportional to its weight; we call such a sampling a linear sampling. Specifically, for a fixed factor , we can ensure for an edge with weight , if then the edge appears in the sample with weight 1 with probability , and if , then the edge is in the sample with weight . Hence the edge weights are “downsampled” by a factor of , in a natural way. We can choose an to suitably sparsify our sample, graph, run an approximation algorithm on that sample, and use that result to obtain a corresponding, nearly-as-good approximation to the original problem. Interestingly, we only query 111 notation hides logarithmic factors. edge weights to provide the sample, where is “almost” the expected weight of the edges in the sampled graph. (See Subsection 1.1 for a formal definition). Our algorithm to construct the sample also runs in time.
Utilizing our sampling approach, we show that for several problems a -approximate solution on a linear sample with expected weight (roughly) is a -approximate solution on the input graph. From an information theory perspective this says that queries are sufficient to find a approximate solution for these problems. Moreover, as the sampled graph has a reduced number of edges, if an approximation algorithm on the sampled graph runs in linear time on the sampled edges, the total time is sublinear in the size of the original graph.
In what follows, after describing the related work and a summary of our results, we present our sampling method. Our approach decomposes the graph into a sequence of subgraphs, where the decomposition depends strongly on the fact that the graph lies in a metric space. Using this decomposition, and an estimate of the average edge weight in the graph, we can determine a suitable sampled graph. We then show this sampling approach allows us to find sublinear approximation algorithms for several problems, including densest subgraph and max cut, in the manner described above.
In some applications, such as diversity maximization, it can be beneficial to go slightly beyond metric distances . We can extend our results to more general spaces that satisfy what is commonly referred to as a parametrized triangle inequality [8, 15, 20], in which for every three points , and we have for a parameter . As an example, if the weight of each edge is the squared distance between the two points, the graph satisfies a parametrized triangle inequality with . We provide analysis for this more general setting throughout, and refer to a graph satisfying such a parametrized triangle inequality as a -metric graph. (Throughout, we take ).
1.1 Our Results
As our main technical contribution we provide an approach to sample a graph from a -metric graph with the properties specified below that makes only queries and succeeds with probability at least . It is easy to observe that our algorithm runs in time as well.
For some fixed factor (which is a function of ) independently for each edge we have:
If , we have edge with weight in with probability .
If , we have edge with weight in .
We have , where is the weight of in .222This can be extended to , for any arbitrary (See the footnote on Theorem 10 for details.) In our work, the upper bound only affects the number of queries; we prefer to set and simplify the argument.
As the weight of each edge in is at least , implies that .
We note that for three points , and in a -metric space and any parameter , directly implies . Therefore one can use our technique to sample edges proportional to (a.k.a. sampling). In the streaming setting, sampling has been extensively studied and appears to have several applications ; as far as we are aware, our approach provides the first sampling techniques that uses a sublinear number of edge weight queries.
As previously mentioned, in Section 3 we consider several problems and show that for some , any -approximate solution of the problem on is an -approximate solution on the original graph with high probability. Specifically, we show that is sufficient to approximate densest subgraph and max cut, is sufficient to approximate -hypermatching, and is sufficient to approximate the average distance. Notice that these results directly imply (potentially exponential time) approximation algorithms with sublinear number of queries for each of the problems. Often our methodology can also yield sublinear time algorithms (since it uses a sublinear number of edges) with possibly worse approximation ratios.
We now briefly describe specific results for the various problems we consider, although we defer the formal problem definitions to Section 3. All of the algorithms discussed below work with high probability. We note that, throughout the paper, we use for .
For average distance, we provide a -approximation algorithm that simply finds the sum of the weights of the edges in for , and hence our algorithm runs in time . For a metric graph, this improves the running time of the previous result of Indyk  that runs in time, with constant probability.
For densest subgraph, the greedy algorithm yields a -approximate solution in time quasilinear in the number of edges . The expected number of edges of can be bounded by for the densest subgraph on -metric graphs. Therefore, our result implies a -approximation algorithm for densest subgraph in -metric spaces requiring time.
A sublinear time algorithm for a approximation for metric max cut is already known . The previous result uses queries, while we use only queries. (We note that this result does not improve the running time, but remains interesting from an information theoretic point of view. Indeed, there are several interesting results on sublinear space algorithms that ignore the computational complexity e.g., max cut [16, 38, 37, 40], set cover [9, 33], vertex cover and hypermatching [24, 22].)
Finally, on the hardness side, in Section 4 we show that queries are necessary even if one just wants to approximate the size of the solution for densest subgraph, -hypermatching, max cut, and average distance.
1.2 Other Related Work
Metric spaces are natural in their own right. For example, they represent geographic information, and hence graph problems such as the densest subgraph problem often have a natural interpretation in metric spaces. It also is often reasonable to manage large data sets by embedding objects within a suitable metric space. In networks, for example, the idea of finding network coordinates consistent with latency measurements to predict latency has been widely studied [25, 44, 52, 53, 54, 55].
There are several works on designing sublinear algorithms for different variants of clustering problems in metric spaces due to their application to machine learning[7, 11, 26, 27, 35]. We briefly summarize some of these papers. Alon et al. studies the efficiency of uniform sampling of vertices to check for given parameters and if the set of points can be clustered into subsets each with diameter at most , ignoring up to an fraction of the vertices . Czumaj and Sohler studies the efficiency of uniform sampling of vertices for -median, min-sum -clustering, and balanced -median . Badoiu et al. consider the facility location problem in metric space . They compute the optimal cost of the minimum facility location problem, assuming uniform costs and demands, and assuing every point can open a facility. Moreover, they show that there is no time algorithm that approximates the optimal solution of general case of metric facility location problem to within any factor.
A basic and natural difference between these previous works on clustering problems and the densest subgraph problem that we consider here is that all previous problems aim to decompose the graph into two or more subsets, where each subset consists of points that are close to each other. However, densest subgraph in a metric space aims to pick a diverse, spread out subset of points. (While perhaps counterintuitive, this is clear from the definition, which we provide shortly.) The application of metric densest subgraph in diversity maximization and feature selection is well studied[17, 56].
Sublinear algorithms may also refer to sublinear space algorithms such as streaming algorithms. A related, well-studied setting is semi-streaming , often used for graph problems. In the semi-streaming setting the input is a stream of edges and we take one (or a few) passes over the stream, while only using space. Semi-steaming algorithms have been extensively studied [1, 2, 29, 32, 39, 43].
For the densest subgraph problem, there have been a number of recent papers showing the efficiency of uniform edge sampling in unweighted graphs [18, 31, 46, 47]. Initially, Bhattacharya et al. provided a approximation semi-streaming algorithm for this problem . They extended their approach to obtain a approximation algorithm for this problem for dynamic streams with update time and space. McGregor et al. and Esfandiari et al. independently provide a -approximation semi-streaming algorithm for this problem [31, 46]. Esfandiari et al. extend the analysis of uniform sampling of edges to several other problem. Mitzenmacher et al. study the efficiency of uniform edge sampling for densest subgraph in hypergraphs .
For the max cut problem, Kapralov, Khanna, and Sudan  and independently Kogan and Krauthgamer  showed that a streaming -approximation algorithm to estimate the size of max cut requires space. Later, Kapralov, Khanna and Sudan  show that for some small any streaming -approximation algorithm to estimate the size of max cut requires space. Very recently, Bhaskara et al.  provide a -pass -approximation streaming algorithm using space for graphs with average degree .
Finally, when considering matching algorithms, there are numerous works on maximum matching in streaming and semi-streaming setting [10, 14, 23, 24, 22, 42, 41, 30, 36]. Note that a maximal matching is a approximation to the maximum matching, and it is easy to provide one in the semi-streaming setting. However, improving this approximation factor in one pass is yet open. There are several works that improve this approximation factor in a few passes [3, 14, 42, 45]. Maximum matching in hypergraphs has also been considered in the streaming setting .
While a -approximation for unweighted matching in the semi-streaming setting is trivial, such an approximation for weighted matching appears nontrivial. There is a sequence of works improving the approximation factor of weighted matching in the semi-streaming setting [19, 29, 51], and just recently Paz and Schwartzman provide a semi-streaming -approximation algorithm.
There are, of course, many, many other related problems; see , for example, for a survey on sublinear algorithms.
2 Providing a Linear Sampling
In this section we provide a technique to construct the desired sampled graph from a metric graph . We first provide a useful decomposition of the graph. We show this decomposition allows us to obtain a graph that satisfies the first property of , namely that edge weights are scaled down (in expectation, for edges with scaled weights less than 1) by a factor of . We then show how to determine a proper value so that expected sum of the edge weight is between and as desired.
2.1 A Graph Decomposition
We start with a decomposition for a metric graph , assuming an upper bound on the weight of the edges. For an suitable number determined later we define the following sequences.
A sequence of graphs .
A sequence of vertex sets .
A sequence of weights, .
We denote the vertex set and edge set of by and respectively. We begin with , and is constructed from by removing vertices in , i.e. . However, defining , which depends on , requires the following additional definitions. For any and , define to be the graph obtained by removing all edges with weight less than from , i.e., if and only if and .
We now define and iteratively as follows. We define to be the set of vertices in with degree at least . We let be an arbitrary subset such that . As mentioned, and . We define . Note that is the set of edges neighboring in .
For any , the set of vertices in with degree at least (i.e., ) is a vertex cover for .
Proof : Let be an edge in , and let and be the degrees of and in respectively. Next we show that . Hence we have or . This means that covers as desired.
Notice that, means that . Hence, by the -triangle inequality, for any we have . Now we are ready to bound .
|Extra is for|
which completes the proof.
For any , is an upper bound on weight of the edges in , i.e., we have .
Proof : For , which is an upper bound on weight of the edges in . For , is a vertex cover of , by Lemma 1. Moreover, by definition we have . Hence is a vertex cover of . This means that every edge with weight at least has a neighbor in . Recall that , and hence, has no edge with weight at least .
The following theorem compares the average weight of the edges in with . We later use this in Theorem 7 to bound the number of queries.
For any , we have
Proof : We start by proving the upper bound. Recall that is the set of edges neighboring in . Hence the number of edges in is upper bounded by sum of the degrees of the vertices of in . The degree of each vertex in is , and there are vertices in . Thus, we have . Moreover, by Lemma 2, for each we have . Therefore we have
Next we prove the lower bound. Recall that we have . Thus, for each , the degree of in is at least . Thus, for any fixed we have
|Definition of for||(1)|
Note that each edge in intersects at most two vertices in . Therefore, we have
|By Inequality 1|
which completes the proof of the lemma.
Lemma 5 provides a technique to construct using queries, with high probability. This to prove this lemma we sample some edges. Notice that these sampled edges are different from the edges that we sample to keep in . We use the following standard version of the Chernoff bound (see e.g. ) in Lemma 5 as well as the rest of the paper.
Lemma 4 (Chernoff Bound)
Let be a sequence of independent binary (i.e., or ) random variables, and let
) random variables, and let. For any , we have
As we are now moving to doing sampling, we briefly remark on some noteworthy points. First, there is some probability of failure in our results. We therefore refer to the success probability in our results, and note that our algorithms may fail “silently”; that is, we may not realize the algorithm has failed (because of a low probability event in the sampling). Also, we emphasize that in general, in what follows, when referring to the number of queries required, we mean the expected number of queries. However, using expectations is for convenience; all of our results throughout the paper could instead be turned into results bounding the number of queries required with high probability (say probability using Chernoff bounds at the cost of at most constant factors in the standard way. Finally, in some places we may sample which edges we decide to query from a set of edges with a fixed probability
. In such situations, instead of iterating through each edge (which could take time quadratic in the number of vertices) we can generate the number of samples from a binomial distribution and then generate the samples without replacement; alternatively, we could determine which sample is the next sample at each step using by calculating a geometrically distributed random variable. We assume this work can be done in constant time per sample. For this reason, our time depends on the number of queries, and not the total number of edges.
For any , given and , one can construct using expected queries, succeeding with probability at least .
Proof : If , we have . Hence in this case we query all the edges and construct . In what follows we assume . To construct we sample each edge in with probability . We add a vertex to if and only if at least of its sampled neighbors has weight . The number of sampled edges is .
We denote the degree of a vertex in by . Let be a binary random variable that is if we sample and otherwise. Let us define and . Recall that we add to if and only if . Notice that, for any we have
As is the sum of independent binary random variables, by the Chernoff bound we have
By applying the union bound we have
This means that with probability at least , simultaneously for all vertices we have
Next assuming that for all vertices we have we show that the that we pick satisfy the property .
This means that if we have and hence . Therefore we have . Similarly, we have
This means that if we have and hence . Therefore we have .
Finally, for completeness we use the following lemma to find a good upper bound on in order to start our construction of the graph decomposition (which required an upper bound on the weight of the edges).
For any -metric graph , one can compute a number such that using queries.
Proof : Let be an arbitrary vertex. We set . Note that, one can simply query all the neighbors of and calculate . Clearly, we have . Next, we show that .
Let be an edge such that . If we have which directly implies as desired. Otherwise, note that by the -triangle inequality we have . Thus, we have . Therefore, we have
We know show how to construct what we call , which is derived from our original metric graph . Recall has the property that for each original edge of weight , independently, if , then contains edge with weight , and if , then contains edge with weight 1 with probability .
The following theorem constructs using an expected queries.
For any one can construct using queries in expectation, succeeding with probability at least .
Proof : By Lemma 6 we find an upper bound on the weight of the edges, using queries. Recall that is the number of graphs in our decomposition. We set . Given , by definition we have
Recall that, using Lemma 5, one can construct and thus using queries, succeeding with probability at least . We start with and iteratively apply Lemma 5 to construct the sequence and . We apply Lemma 5 times, and hence using a union bound, all of the were successfully constructed with probability at least . Next, we show how to construct assuming the sequences and are valid. Note that constructing the graph decomposition we use at most queries.
Recall that . Also, we have . Thus, the sequence is a decomposition of . Also, note that . Therefore, given and we can decompose the edge set into .
Let be the smallest index such that . Notice that by Inequality 2.2. For each we query each edge . If , add edge with weight to . If , we add edge with weight to with probability independently.
For each we query each edge with probability . We add a queried edge to with probability and withdraw it otherwise. Note that . Also is an upper bound on the edge weights in , and thus . Therefore, the probabilities and are valid. Also, notice that we add each edge to with probability as desired.
For each edge we query with probability . We add a queried edge to with probability and withdraw it otherwise. Recall is an upper bound on the edges edges weights in , and by Inequality 2.2 we have . Thus, we have
Therefore, is a valid probability. Again, notice that we add each edge to with probability as desired. Next we bound the total number of edges that we query.
Let be a random variable that is if we query and otherwise. We bound the expected number of edges that we query by
|is a decomposition of|
|By Lemma 3|
We used queries to construct the sequences and , and used queries to construct based on these sequences. Therefore, in total we used queries in expectation.
The following lemma relates with . We use this to construct using .