# Optimization from Structured Samples for Coverage Functions

We revisit the optimization from samples (OPS) model, which studies the problem of optimizing objective functions directly from the sample data. Previous results showed that we cannot obtain a constant approximation ratio for the maximum coverage problem using polynomially many independent samples of the form {S_i, f(S_i)}_i=1^t (Balkanski et al., 2017), even if coverage functions are (1 - ϵ)-PMAC learnable using these samples (Badanidiyuru et al., 2012), which means most of the function values can be approximately learned very well with high probability. In this work, to circumvent the impossibility result of OPS, we propose a stronger model called optimization from structured samples (OPSS) for coverage functions, where the data samples encode the structural information of the functions. We show that under three general assumptions on the sample distributions, we can design efficient OPSS algorithms that achieve a constant approximation for the maximum coverage problem. We further prove a constant lower bound under these assumptions, which is tight when not considering computational efficiency. Moreover, we also show that if we remove any one of the three assumptions, OPSS for the maximum coverage problem has no constant approximation.

## Authors

• 234 publications
• 28 publications
• 21 publications
• 11 publications
04/21/2020

### Competing Optimally Against An Imperfect Prophet

Consider a gambler who observes the realizations of n independent, non-n...
11/06/2017

### Prophet Secretary: Surpassing the 1-1/e Barrier

In the Prophet Secretary problem, samples from a known set of probabilit...
03/08/2021

### T-SCI: A Two-Stage Conformal Inference Algorithm with Guaranteed Coverage for Cox-MLP

It is challenging to deal with censored data, where we only have access ...
10/02/2020

### Tight Approximation Guarantees for Concave Coverage Problems

In the maximum coverage problem, we are given subsets T_1, …, T_m of a u...
03/09/2022

### Neighborhood persistency of the linear optimization relaxation of integer linear optimization

For an integer linear optimization (ILO) problem, persistency of its lin...
05/02/2019

### Tight Approximation Bounds for Maximum Multi-Coverage

In the classic maximum coverage problem, we are given subsets T_1, ..., ...
03/18/2021

### Sensor Placement for Globally Optimal Coverage of 3D-Embedded Surfaces

We carry out a structural and algorithmic study of a mobile sensor cover...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Traditional optimization problems in the textbook are often formulated as mathematical models with specified parameters. The computational task is to optimize an objective function given parameters of the model. One such example is the maximum coverage problem. Given a family of subsets of a ground set and a positive integer , the problem asks to find subsets whose union contains the most number of elements in . In practice, however, parameters of the model are often hidden in the complex real world and we cannot observe them directly. Instead, we can only learn information about the model from the passively observed sample data. Back to the maximum coverage problem, in this case we may not know the exact elements contained in every subset , but only observe samples of subsets ’s, and for each sample we only observe the number of elements it covers. An immediate question, recently raised by Balkanski et al. (2017), asks to what extent we can optimize objective functions based on the sample data that we use to learn them. More specifically, given samples where ’s are drawn i.i.d. from some distribution on the subsets of , is an unknown objective function, and , can we solve ? For maximum coverage, would be a collection of some subsets ’s, function would be the number of elements covered by such collections. Such problems form a new approach to optimization called optimization from samples (OPS) (Balkanski et al., 2017).

A reasonable and perhaps the most natural approach is to first learn a surrogate function which approximates well the original function and then optimize instead of . One may expect that if we can approximate a function well, then we can also optimize it well. Standard frameworks of learnability in the literature include PAC learnability for boolean functions due to Valiant (1984) and PMAC learnability for real-valued set functions due to Balcan and Harvey (2011).

Unfortunately, the learning-and-then-optimization approach does not work in general. Indeed, Balkanski et al. (2017) show the striking result that the maximum coverage problem cannot be approximated within a ratio better than using only polynomially many samples drawn i.i.d. from any distribution, even though (a) for any constant , coverage functions are -PMAC learnable over any distribution (Badanidiyuru et al., 2012), which means most of the function values can be approximately learned very well with high probability; and (b) maximum coverage problem as a special case of submodular function maximization has a approximation given a value oracle to the coverage function (Nemhauser et al., 1978).

The impossibility result by Balkanski et al. (2017) uses coverage functions defined over a partition of the ground set, which ensure the “good” and “bad” parts of the partition cannot be distinguished from the samples. In other words, the impossibility result arises because the samples do not provide information on the structure of coverage functions.

To circumvent the above impossibility result, we propose a stronger model called optimization from structured samples (OPSS) for coverage functions, which encodes structural information of the coverage functions into the samples. In many real-world applications, such structural information are often revealed in the data, for example, a crowd-sourcing platform records the crowd-workers’ coverage on the tasks they took, a document analysis application records the keywords coverage on the documents they appear, etc. Thus the OPSS model is reasonable in practice. However, even in the stronger OPSS model, not all sample distributions will allow a constant approximation for the maximum coverage problem. In this paper, we study the assumptions that enable constant approximation in the OPSS model and its related algorithmic and hardness results. We now state our model and results in more detail.

### 1.1 Model

For sake of comparison, we first state the definition of optimization from samples (Balkanski et al., 2017) for general set functions.

###### Definition 1 (Optimization from samples (OPS)).

Let be a class of set functions defined on the ground set . is -optimizable from samples in constraint over distribution on , if there exists a (not necessarily polynomial time) algorithm such that, given any parameter and sufficiently large , there exists some integer , for all , for any set of samples with and ’s drawn i.i.d. from , the algorithm takes samples as the input and returns such that

 PrS1,⋯,St∼D[E[f(S)]≥α⋅maxT∈Mf(T)]≥1−δ,

where the expectation is taken over the randomness of the algorithm.

Next we state the definition of coverage functions in terms of bipartite graphs as well as the definition of optimization from structured samples for coverage functions.

###### Definition 2 (Coverage functions).

Assume there is a bipartite graph . For node , let denote its neighbors in . The neighbors of a subset or is . The coverage function is the number of neighbors covered by a set , i.e. .

###### Definition 3 (Optimization from structured samples (OPSS)).

Let be the class of coverage functions defined on all bipartite graphs with two components and . is -optimizable under OPSS in constraint over distribution on , if there exists a (not necessarily polynomial time) algorithm such that, given any parameter and sufficiently large , there exists some integer , for all , for any set of samples with and ’s drawn i.i.d. from , the algorithm takes samples as the input and returns such that

 PrS1,⋯,St∼D[E[fG(S)]≥α⋅maxT∈MfG(T)]≥1−δ,

where the expectation is taken over the randomness of the algorithm.

Samples in OPSS are structured in that the exact members covered by a set are revealed, instead of only the number of covered members being revealed as in OPS. In this paper we focus on the cardinality constraint . Maximizing coverage functions under this constraint is known as the maximum coverage problem.

Our OPSS model is defined so far only for coverage functions. One reason is that the impossibility of OPS given by Balkanski et al. (2017) is on the coverage functions, which is striking because coverage functions admit a simple constant approximation algorithm with the value oracle and is -PMAC learnable as mentioned before. Thus coverage function is the first to consider for circumventing the impossibility result for OPS. Another reason is that coverage functions exhibit natural structures via the bipartite graph representation. Other set functions may exhibit different combinatorial structures and thus the OPSS problem may need to be defined accordingly to reflect the specific structural information for other set functions.

### 1.2 Our Results

One of our main results is to provide a set of three general assumptions on the sample distribution together with an algorithm and show that the algorithm achieves a constant approximation ratio for the maximum coverage problem in OPSS under the assumption. The general assumption is summarized below.

###### Assumption 1.

We assume that the distribution on satisfy the following three assumptions:

1. Feasibility. A sample is always feasible, i.e. .

2. Polynomial bounded sample complexity. For any , the probability satisfies for some constant .

3. Negative correlation.

are “negatively correlated” (see Definition 4) over distribution .

All three assumptions above are natural. In particular, Assumption 1.2 means that all elements in the ground set have sufficient probability to be sampled, and Assumption 1.3 means informally that the appearance of one element in the sampled set would reduce the probability of the appearance of another element in . In fact, typical distributions over

, such as uniform distribution

over all subsets in or uniform distribution over all subsets of exact size , all satisfy these assumptions. Our result based on the above assumption is summarized by the following theorem.

###### Theorem 1.

If a distribution satisfies Assumption 1, given any -approximation algorithm for the standard maximum coverage problem, coverage functions are -optimizable under OPSS in the cardinality constraint over for any . Furthermore, the OPSS algorithm uses a polynomial number of arithmetic operations and one call of algorithm .

The general approximation ratio is to cover both polynomial-time and non-polynomial-time algorithms. If we need a polynomial-time algorithm, then we know that the best ratio we can achieve is if NP(Nemhauser et al., 1978; Feige, 1998). Thus our OPSS algorithm achieves approximation. If running time is not our concern, then we can use by an exhaustive search algorithm, and our OPSS algorithm achieves approximation.

We further show that if the distribution is , i.e. the uniform distribution over all subsets of exact size , we have another OPSS algorithm to achieve approximation, as shown below. This implies that our OPSS algorithm (almost) matches the approximation ratio of any algorithm for the standard maximum coverage problem.

###### Theorem 2.

For any constant , given any -approximation algorithm for the standard maximum coverage problem, coverage functions are -optimizable under OPSS in the cardinality constraint over , assuming that and . Furthermore, the OPSS algorithm uses a polynomial number of arithmetic operations and one call of algorithm .

Next, we prove a hardness result showing that the approximation ratio of is unavoidable for some distributions, which means that when efficiency is not the concern, our upper and lower bounds are tight.

###### Theorem 3.

There is a distribution satisfying Assumption 1 such that coverage functions are not -optimizable under OPSS in the cardinality constraint over for any .

Finally, we also show that the three conditions given in Assumption 1 are necessary, in the sense that dropping any one of them would result in no constant approximation for the OPSS problem. This demonstrates that our three conditions need to work together to make OPSS solvable.

###### Theorem 4.

By dropping any one of the conditions in Assumption 1, there is a distribution such that coverage functions are not -optimizable under OPSS for any constant in the cardinality constraint over .

To summarize, in this paper we investigate the structural information on coverage functions that could allow us to circumvent the impossibility result in (Balkanski et al., 2017). We show that when the samples could reveal the covered elements rather than just the count, under certain reasonable assumptions on the sample distribution (Assumption 1), we could design an OPSS algorithm that achieves approximation, where is the approximation ratio of a standard maximum coverage problem. Moreover, for the uniform distribution on subsets of size , we provide an efficient algorithm that achieves tight approximation, matching the performance of any algorithm for the standard maximum coverage problem. On the lower bound side, we show that the approximation ratio of is unavoidable, which matches the upper bound when not considering computational complexity. Finally, we show that removing any one of the three conditions in Assumption 1, we cannot achieve constant approximation for OPSS. Our study opens up the possibility of studying structural information for achieving optimization from samples, which is needed in many applications in the big data era.

### 1.3 Related Work

The study of optimization from samples (OPS) was initiated by Balkanski et al. (2017). They proved that no algorithm can achieve an approximation ratio better than for the maximum coverage problem under OPS. The same set of authors showed there is an optimal approximation algorithm for maximizing monotone submodular functions with curvature subject to a cardinality constraint over uniform distributions under OPS (Balkanski et al., 2016). For submodular function minimization, it was proved in (Balkanski & Singer, 2017) that no algorithm can obtain an approximation strictly better than under OPS. And this is tight via a trivial -approximation algorithm. Rosenfeld et al. (2018) defined a weaker variant of OPS called distributionally optimization from samples (DOPS). They showed that a class of set functions is optimizable under DOPS if and only if it is PMAC-learnable.

## 2 Concepts and Tools

We first discuss the definition of negative correlation. Negative dependence among random variables has been extensively studied in the literature and there are a lot of qualitative versions of this concept (Jogdeo & Patil, 1975; Karlin & Rinott, 1980; Ghosh, 1981; Block et al., 1982; Joag-Dev & Proschan, 1983). Among them, the most widely accepted one is the negative association (NA) defined in (Joag-Dev & Proschan, 1983). However, in this paper, we only use a weaker version of NA. Thus, more distributions satisfy our definition of negative correlation. It is also easy to see that the uniform distributions and both satisfy this definition.

###### Definition 4 (Negative correlation).

A set of - random variables is negative correlated, if for any disjoint subsets ,

 E[∏i∈I∪J(1−Xi)]≤E[∏i∈I(1−Xi)]E[∏j∈J(1−Xj)].

Then we prove the following lemma, which shows that the occurrence of an event would reduce the probability of occurrences of other events.

###### Lemma 1.

Assume that are negatively correlated - random variables. Then for any and ,

 Pr[∨i∈I(Xi=1)∣Xj=1]≤Pr[∨i∈I(Xi=1)].
###### Proof.

Since are negatively correlated,

 Pr[∧i∈I∪{j}(Xi=0)]≤Pr[∧i∈I(Xi=0)]Pr[Xj=0],

which is equivalent to

 Pr[∧i∈I(Xi=0)]−Pr[∧i∈I(Xi=0),Xj=1] ≤Pr[∧i∈I(Xi=0)]Pr[Xj=0].

Rearranging the last inequality, we have

 Pr[∧i∈I(Xi=0)]Pr[Xj=1] ≤Pr[∧i∈I(Xi=0),Xj=1],

which is equivalent to

 (1−Pr[∨i∈I(Xi=1)])Pr[Xj=1] ≤Pr[Xj=1]−Pr[∨i∈I(Xi=1),Xj=1].

Rearranging the last inequality, we have

 Pr[∨i∈I(Xi=1),Xj=1] ≤Pr[∨i∈I(Xi=1)]Pr[Xj=1].

This concludes the proof. ∎

Next is Chernoff bound used in the analysis of probability concentration.

###### Lemma 2 (Chernoff bound, (Mitzenmacher & Upfal, 2005)).

Let be independent random variables in with . Let and . Then, for ,

 Pr[X≤(1−δ)μL]≤e−μLδ2/2.

## 3 Constant Approximations for OPSS

In this section, we present two constant approximation algorithms for OPSS and their results: one for the general distributions satisfying Assumption 1 (Theorem 1) and the other for the uniform distribution (Theorem 2).

### 3.1 A Constant Approximation under Assumption 1

The algorithm is shown in Algorithm 1. It returns one of the two solutions and with equal probability, where is just the first sample, and is the solution of an -approximation algorithm on a constructed surrogate bipartite graph for the standard maximum coverage problem. The parameters of algorithm denote the graph and the constraint respectively. The surrogate graph is constructed from samples such that for each node , we construct ’s coverage in as

, which is an estimate of

. The intuition is as follows. If some singleton is drawn from , the knowledge about is completely revealed. However, it might be the case that always returns a large set , and the exact knowledge about for is hidden behind . Thus to reveal as much knowledge about as possible, it is natural to use the intersection of samples that contain as an estimate.

The difficulty in the analysis is that is always an overestimate of , and it is impossible to show that is a good approximation of . One extreme example is that suppose for some , , then we have that always contains all elements in , which might be much larger than itself. Thus itself might not be a good solution on the original graph . To circumvent this difficulty, the key step is to show that for any , with high probability (Lemma 3). Consequently, and we can obtain a constant approximation ratio by combining a random sample with as in Algorithm 1. Note that and may be correlated since they are both dependent on , but this is not an issue based on our analysis.

###### Lemma 3.

For a given , suppose that the number of samples , where is the constant in Assumption 1.2. Under Assumption 1, we have

 PrS1,⋯,St∼D[∪u∈L(N~G(u)∖NG(u))⊆NG(S1)]≥1−δ.

The proof of Lemma 3 is delayed to Section 3.1.1. For now, we use it to prove Theorem 5, which is a more concrete version of Theorem 1.

###### Theorem 5.

If a distribution satisfies Assumption 1, given any -approximation algorithm for the standard maximum coverage problem, coverage functions are -optimizable under OPSS in the cardinality constraint over for any . More precisely, for any , suppose that the number of samples , where is the constant in Assumption 1.2. Let be the solution returned by Algorithm 1 and be the optimal solution on the original graph . Then under Assumption 1, we have

 PrS1,⋯,St∼D[E[fG(ALG)]≥α2fG(OPT)]≥1−δ.
###### Proof.

By the construction of , for any . Therefore, is a subgraph of and . Since is an approximation algorithm,

 f~G(T2)≥αf~G(OPT)≥αfG(OPT).

On the other hand, it holds that . Since , by Lemma 3, it holds with probability that , and

 f~G(T2) =|NG(T2)∪(N~G(T2)∖NG(T2))| ≤|NG(T2)|+|N~G(T2)∖NG(T2)| ≤|NG(T2)|+|NG(T1)| =fG(T2)+fG(T1).

Therefore, with probability ,

 E[fG(ALG)] =E[12⋅fG(T1)+12⋅fG(T2)] ≥E[12f~G(T2)]≥α2fG(OPT).

For common distributions, the constant in Assumption 1.2 is usually small, thus Algorithm 1 requires moderately small number of samples. For instance, for distributions and , and . Thus both distributions require only samples.

#### 3.1.1 Proof of Lemma 3

We first introduce some notations. Let , and . For any node , let be the number of samples where appears. For any node , let be the probability that is covered by a sample . Our analysis starts with partitioning into two subsets and , where and . In general, we will show that nodes in will not appear in with high probability (Lemma 7) and will be covered by any sample with high probability (Lemma 8). These facts together suffice to prove Lemma 3.

###### Lemma 4.

Assume that . For fixed , .

###### Proof.

For fixed , let if and otherwise. Then . By Assumption 1.2, . Thus . By Chernoff bound (Lemma 2),

 Pr[tu≤¯t]=Pr[tu≤(1−12)⋅2¯t]≤e−¯t/4≤δ4mn.

The last inequality needs , which is satisfied for all nontrivial instances. ∎

###### Lemma 5.

For any and such that , .

###### Proof.

Just note that the event is equivalent to . The lemma follows directly from Lemma 1. ∎

###### Lemma 6.

For any and such that , , for any .

###### Proof.

By the law of total probability, the formula on the left-hand side is equal to

. Since ’s are independent samples, by construction of and Lemma 5, we have

 Pr[v∈N~G(u)∖NG(u),u∈∩i∈ISi,u∉∪j∉ISj] =Pr[v∈∩i∈ING(Si),u∈∩i∈ISi,u∉∪j∉ISj] =∏i∈IPr[v∈NG(Si),u∈Si]∏j∉IPr[u∉Sj] ≤∏i∈I(Pr[v∈NG(Si)]Pr[u∈Si])∏j∉IPr[u∉Sj] =∏i∈IPr[v∈NG(Si)]∏i∈IPr[u∈Si]∏j∉IPr[u∉Sj] =qℓv⋅Pr[u∈∩i∈ISi,u∉∪j∉ISj].

Thus

 Pr[v∈N~G(u)∖NG(u),tu=ℓ] ≤qℓv∑I⊆[t]:|I|=ℓPr[u∈∩i∈ISi,u∉∪j∉ISj] =qℓv⋅Pr[tu=ℓ].

###### Lemma 7.

Assume that . Then .

###### Proof.

For node and node such that , we have

 Pr[v∈N~G(u)∖NG(u)] =∑ℓ≥0Pr[v∈N~G(u)∖NG(u),tu=ℓ] ≤∑ℓ≥0Pr[tu=ℓ]⋅qℓv ≤∑ℓ≤¯tPr[tu=ℓ]⋅1+∑ℓ>¯tPr[tu=ℓ]⋅q¯tv =Pr[tu≤¯t]+Pr[tu>¯t]⋅q¯tv ≤δ4mn+(1−δ2m)2mδln4mnδ ≤δ4mn+δ4mn=δ2mn.

The first inequality holds due to Lemma 6. The second to last inequality holds due to Lemma 4, the fact that for all and . Finally, by union bound, we have

 Pr[R1∩(∪u∈L(N~G(u)∖NG(u)))≠∅] =Pr[∃v∈R1,u∈L s.t.~{}v∈N~G(u)∖NG(u)] ≤∑v∈R1,u∈LPr[v∈N~G(u)∖NG(u)] ≤∑v∈R1,u∈Lδ2mn≤δ2.

The proof is completed. ∎

.

###### Proof.

For a node , by definition, . By union bound, we have . That is, . ∎

Proof of Lemma 3. By Lemma 7, with probability , and therefore . On the other hand, by Lemma 8, with probability , . Finally, by union bound, with probability . ∎

### 3.2 A Tight Algorithm for OPSS under Dk

In this section, we present a tight algorithm for OPSS under distribution , the uniform distribution over all subsets of size . Compared with Algorithm 1, Algorithm 2 takes an additional input and has two other modifications. First, when constructing , the constraint is replaced by , which only incurs little loss in the approximation ratio. Second, instead of assigning a sample to , the algorithm picks a set uniformly at random from all subsets of size and assigns it to . The key observation is that under distribution , although is quite small, it suffices to cover nodes in with high probability. However, this is not true for general distributions. As a result, yields an approximation for the problem, and it is also feasible.

We begin the analysis with some notations. Let , and . In the analysis, we assume that and . This is a sufficient condition for a key inequality, as we will further explain after Theorem 6. For any node , let be the number of samples where appears. For any node , let be the probability that is covered by a sample . Let denote the number of ’s neighbors. Partition into two subsets and , where and . While in the general case discussed in previous section, is partitioned according to the value of , here we partition according to the value of . The reason is that is a uniform distribution. Thus for , the more neighbors it has, the higher probability it will be covered by a sample . The observation is further formulated as Lemma 9. Based on it, we can show that with high probability nodes in will not appear in (Lemma 10). Besides, increases exponentially with respect to . Thus instead of picking a sample from , drawing a set from suffices to cover nodes in (Lemma 11).

For any , .

###### Proof.

It is easy to verify that when and , we have . Thus for any , . Together with , we have

 1−qv =(n−d(v)k)(nk) =(n−d(v))⋯(n−d(v)−k+1)n⋯(n−k+1) ≥(1−d(v)n−k+1)k ≥(1−d(v)n/2)k ≥exp(−4kd(v)n) ≥exp(−(8/ϵ)ln(2m/ϵ)) =(ϵ/2m)8/ϵ.

The third inequality holds since for . The last inequality holds since for . ∎

Similar to Lemmas 7 and 8, we show the following lemmas. The proofs are included in Section 3.2.1.

###### Lemma 10.

Assume that . We have

 PrS1,⋯,St∼Dk[R1∩(∪u∈L(N~G(u)∖NG(u)))=∅]≥1−δ.
###### Lemma 11.

.

Now we prove Theorem 6, which is a more concrete version of Theorem 2.

###### Theorem 6.

For any constant , given any -approximation algorithm for the standard maximum coverage problem, coverage functions are -optimizable under OPSS in the cardinality constraint over , assuming that and . More precisely, for any , suppose that the number of samples . Let be the solution returned by Algorithm 2 and be the optimal solution on the original graph . Then

 PrS1,⋯,St∼Dk[E[fG(ALG)]≥(α−ϵ)fG(OPT)]≥1−δ.
###### Proof.

By the construction of , for any . Therefore, is a subgraph of and . Let be the optimal solution when selecting elements. Since is an approximation algorithm and ,

 f~G(T2)≥αf~G(OPT(1−ϵ/2)k) ≥α(1−ϵ/2)f~G(OPTk)≥α(1−ϵ/2)fG(OPT),

where the second inequality above utilizes the submodularity of the coverage functions.

Let be the event