 # Learning and Generalization for Matching Problems

We study a classic algorithmic problem through the lens of statistical learning. That is, we consider a matching problem where the input graph is sampled from some distribution. This distribution is unknown to the algorithm; however, an additional graph which is sampled from the same distribution is given during a training phase (preprocessing). More specifically, the algorithmic problem is to match k out of n items that arrive online to d categories (d≪ k ≪ n). Our goal is to design a two-stage online algorithm that retains a small subset of items in the first stage which contains an offline matching of maximum weight. We then compute this optimal matching in a second stage. The added statistical component is that before the online matching process begins, our algorithms learn from a training set consisting of another matching instance drawn from the same unknown distribution. Using this training set, we learn a policy that we apply during the online matching process. We consider a class of online policies that we term thresholds policies. For this class, we derive uniform convergence results both for the number of retained items and the value of the optimal matching. We show that the number of retained items and the value of the offline optimal matching deviate from their expectation by O(√(k)). This requires usage of less-standard concentration inequalities (standard ones give deviations of O(√(n))). Furthermore, we design an algorithm that outputs the optimal offline solution with high probability while retaining only O(k n) items in expectation.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Matching is the bread-and-butter of many real-life problems from the fields of computer science, operations research, game theory, and economics. Some examples include job scheduling where we assign jobs to machines, economic markets where we allocate products to buyers, online advertising where we assign advertisers to ad slots, assigning medical interns to hospitals, and many more.

Let us now discuss a particular motivating example from labor markets in detail. Imagine a firm that is planning a large recruitment. Candidates arrive one-by-one and the HR department immediately decides whether to summon them for an interview. Moreover, the firm has multiple departments, each requiring different skills and having a different target number of hires. Different employees have different subsets of the required skills, and thus fit only certain departments and with a certain quality. The firm’s HR department, following the interviews, decides which candidates to recruit and to which departments to assign them. The HR department has to maximize the total quality of the hired employees such that each department gets its required number of hires with the required skills. In addition, the HR uses data from the previous recruitment season in order to minimize the number of interviews while not compromising the quality of the solution.

To formulate the example above, we study the following problem. We receive items, where each item has a subset of properties denoted by . We select items out of the , subject to constraints of the form

exactly of the selected items must satisfy a property ,

where and we assume that . Furthermore, if item possesses property , then it has a value associated with this property. Our goal is to compute a matching of maximum value that associates items to the properties subject to the constraints above.

We consider matching algorithms in the following online setting. Before the matching process begins, there is a preprocessing phase in which the algorithm learns an online policy from a training set. The training set is a single problem instance that consists of items drawn independently from an unknown distribution . Following the preprocessing phase, the algorithm receives an additional items online, also drawn independently from , and uses the learned policy to either reject or retain each item. Finally, the algorithm utilizes the retained items and outputs an (approximately-)optimal feasible solution.

We address the statistical aspects of this problem and develop efficient learning algorithms. In particular, we define a class of thresholds-policies. Each thresholds-policy is a simple rule for deciding whether to retain an item. We present uniform convergence rates for both the number of items retained by a thresholds policy and the value of the resulting solution. We show that these quantities deviate from their expected value by order of (rather than an easier bound; recall that we assume ) which we prove using non-trivial concentration inequalities and tools from VC-theory.

Lastly, using these concentration inequalities, we analyze an efficient online algorithm that returns the optimal offline solution with high probability, and retains a near-optimal number of items in expectation. We show that this is an improvement over a naive greedy algorithm that always returns the optimal solution and retains items in expectation while ignoring the training set.

#### Related work.

Celis et al. (2017, 2018) studies similar problems of ranking and voting with fairness constraints. In fact, the optimization problem that they consider allows more general constraints and the value of a candidate is determined from votes/comparisons. The main difference with our framework is that they do not consider a statistical setting (i.e. there is no distribution over the items and no training set for preprocessing) and focus mostly on approximation algorithms for the optimization problem.

Our model is related to the online secretary problem in which one needs to select the best secretary in an online manner (see Ferguson, 1989). Our setting differs from this classical model due to the two-stage process and the complex feasibility constraints. Nonetheless, we remark that there are few works on the secretary model that allow delayed selection (see Vardi, 2015; Ezra et al., 2018) as well as matroid constraints (Babaioff et al., 2007). These works differ from ours in the way the decision is made, the feasibility constraints and the learning aspect of receiving a single problem instance as a training example.

Another related line of work in algorithmic economics studies the statistical learnability of pricing schemes (see e.g., Morgenstern and Roughgarden, 2015, 2016; Hsu et al., 2016; Balcan et al., 2018

). The main difference of these works from ours is that our training set consists of a single “example” (namely the set of items that are used for training), and in their setting (as well as in most typical statistical learning settings) the training set consists of many i.i.d examples. This difference also affects the technical tools used for obtaining generalization bounds. For example, some of our bounds exploit Talagrand’s concentration inequality rather than the more standard Chernoff/McDiarmid/Bernstein inequalities. We note that Talagrand’s inequality and other advanced inequalities were applied in machine learning in the context of learning combinatorial functions

(Vondrák, 2010; Blum et al., 2017). See also the survey by Bousquet et al. (2004) or the book by Boucheron et al. (2013) for a more thorough review of concentration inequalities.

Furthermore, there is a large body of work on online matching in which the vertices arrive in various models (see Mehta et al., 2013; Gupta and Molinaro, 2016). We differ from this line of research, by allowing a two-stage algorithm, and requiring to output the optimal matching is the second stage.

## 2 Our model and results

Let be a domain of items, where each item can possess any subset of properties denoted by (we view as the set of items having property ). Each item has a value associated with each property such that .

We are given a set of items as well as counts such that . Our goal is to select exactly items in total, constrained on selecting exactly items with property . We assume that these constraints are exclusive, in the sense that each item in can be used to satisfy at most one of the constraints. Formally, a feasible solution is a subset , such that and there is partition into disjoint subsets , such that and . We aim to compute a feasible subset that maximizes .

Furthermore, we assume that . Namely, the number of constraints is much smaller than the number of items that we have to select, which is much smaller than the total number of items in . In order to avoid feasibility issues we assume that there is a set that contains dummy 0-value items with all the properties (we assume that the algorithm has always access to and do not view them as part of ).

#### The offline problem

We first discuss the offline versions of these allocation problems. That is, we assume that and the capacities are all given as an input before the algorithm starts. We are interested in an algorithm for computing an optimal set . That is a set of items of maximum total value that satisfy the constraints. This problem is equivalent to a maximum matching problem in a bipartite graph  defined as follows.

• is the set of vertices in one side of the bipartite graph. It contains k vertices, where each constraint is represented by of these vertices.

• is the set of vertices in the other side of the bipartite graph. It contains a vertex for each item and for each dummy item .

• is the set of edges. Each vertex in is connected to each vertex of each of the constraints that it satisfies.

• The weight of edge is : the value of item associated with property .

There is a natural correspondence between saturated-matchings in this graph, that is matchings in which every is matched, and between feasible solutions (i.e., solutions that satisfy the constraints) to the allocation problem. Thus, a saturated-matching of maximum value corresponds to an optimal solution. It is well know that the problem of finding such a maximum weight bipartite matching can be solved in polynomial time (see e.g., Lawler, 2001).

### 2.1 Our results

In our work, we consider the following online learning model. We assume that items are sequentially drawn i.i.d. from an unknown distribution over . Upon receiving each item, we decide whether to retain it, or reject it irrevocably (the first stage of the algorithm). Thereafter, we select a feasible solution111In addition to the retained items, the algorithm has access to . consisting only of retained items (the second stage of the algorithm). Most importantly, before accessing the online sequence and take irreversible online decisions of which items to reject, we have access a training set consisting of independent draws from . We design online algorithms that use to learn a thresholds-policy such that with high probability: (i) the number of items that are retained in the online phase is small, and (ii) there is a feasible solution consisting of retained items whose value is optimal (or close to optimal).

Thresholds-policies are studied in Section 3 and are defined as follows.

###### Definition 1 (Thresholds-policies).

A threshold-policy is parametrized by a vector

of thresholds, where corresponds to property for . The semantics of is as follows: given a sample of items, each item is retained if and only if there exists a property satisfied by , such that its value passes the threshold . More formally, is retained if and only if such that and .

Thresholds policies are highly attractive. In fact, the optimal solution in hindsight is a thresholds-policy in itself. This is formalized by the following theorem.

###### Theorem 2 (Existence of a thresholds-policy that retains an optimal solution).

For any set of items , there exists a thresholds vector that retains exactly items that participate in an optimal solution for .

For a sample and a thresholds-policy , we denote by the set of items that are retained by the threshold , and we denote its expected size by . Similarly we denote by the items retained by , and by its expectation. We prove that the sizes of and are concentrated around their expectations uniformly for all thresholds policies.

###### Theorem 3 (Uniform convergence of the total number of retained items).

With probability at least over , the following holds for all policies simultaneously:

1. If , then  , and

2. if , then  ,

where

 ϵ=O⎛⎝√dlog(d)log(n/k)+log(1/δ)k⎞⎠ .
###### Theorem 4 (Uniform convergence of the number of retained items per constraint).

With probability at least over , the following holds for all policies and all simultaneously:

1. If , then  , and

2. if , then  ,

where

 ϵ=O⎛⎝√log(d)log(n/k)+log(1/δ)k⎞⎠ .

Furthermore, we denote by the value of the optimal solution among the items retained by the thresholds-policy , and we denote its expectation by . We show that is also concentrated uniformly for all thresholds policies.

###### Theorem 5 (Uniform convergence of values).

With probability at least over , the following holds for all policies  simultaneously:

 ∣∣νT−VT(C)∣∣≤ϵk,whereϵ=O(√dlogk+log(1/δ)k).

We note that a bound of (rather than ) on the additive deviation of from its expectation can be derived using the McDiarmid’s inequality (McDiarmid, 1989). However, this bound is meaningless when (because upper bounds the value of the optimal solution). We use Talagrand’s concentration inequality (Talagrand, 1995) to derive the upper bound on the additive deviation. Talagrand’s concentration inequality allows us to utilize the fact that an optimal solution uses only items, and therefore replacing an item that does not participate in the solution does not affect its value.

We next use these uniform convergence results to design our learning algorithms. In Section 4 we prove the following.

###### Theorem 6.

There exists an algorithm that learns a thresholds-policy from a single training sample , such that when processing online the “test sample” using , then

• It outputs an optimal solution with probability at least .

• Its expected number of retained items in the first phase is

We compare this result to an oblivious greedy online algorithm that ignores the training set. In the first phase, this greedy algorithm acts greedily by keeping an item if it participates in the best solution thus far. In the second phase, the algorithm computes an optimal matching among the retained items. We have the following guarantee for this greedy algorithm proven in Section A.1.

###### Theorem 7.

The greedy algorithm always outputs the optimal solution and retains items in expectation.

Thus, with the additional information given by the training set, the algorithm presented in Theorem 6 improves the dependence from to .

Finally, in Section 5 we show a lower bound implying that our algorithm is nearly-optimal in the following sense.

###### Theorem 8.

Consider the case where and . There exists a universe and a distribution over such that for the following holds: Any online learning algorithm that retains a subset of items that contains an optimal solution must satisfy that .

## 3 Thresholds-policies

We next discuss a framework to design algorithms that exploit the training set to learn policies that are applied in the first phase of the matching process. We would like to frame this in standard ML formalism by phrasing this problem as learning a class of policies such that:

• is not too small: The policies in should yield solutions with high values (optimal, or near-optimal).

• is not too large: should satisfy some uniform convergence properties; i.e. the performance of each policy in on the training set is close, with high probability, to its expected real-time performance on the sampled items during the online selection process.

Indeed, as we now show these demands are met by the class of thresholds policies (Definition 1). We first show that the class of thresholds-policies contains an optimal policy, and in the sequel we show that it satisfies attractive uniform convergence properties.

#### An assumption (values are unique).

We assume that for each constraint , the marginal distribution over the value of conditioned on is atomless; namely for every . This assumption can be removed by adding artificial tie-breaking rules, but making it will simplify some of the technical statements.

###### Theorem (There is a thresholds policy that retains an optimal solution – restatement of Theorem 2).

For any set of items , there exists a thresholds vector that retains exactly items that form an optimal solution for .

###### Proof.

Let denote the set of items in an optimal solution for , and let be the subset of that is assigned to the constraint . Define , for , Clearly, retains all the items in . Assume towards contradiction that retains an item , and assume that is a constraint such that and . Since by our assumption on all the values are distinct it follows that . Thus, we can modify by replacing with the item of minimum value in  and increase the total value. This contradicts the optimality of . ∎

We next establish generalization bounds for the class of thresholds-policies.

### 3.1 Uniform convergence of the number of retained items

The following theorems establish uniform convergence results for the number of retained items. Namely, with high probability we have , simultaneously for all and .

###### Theorem (Uniform convergence of the number of retained items – restatement of Theorem 3).

With probability at least over , the following holds for all policies simultaneously:

1. If , then  , and

2. if , then  ,

where

 ϵ=O⎛⎝√dlog(d)log(n/k)+log(1/δ)k⎞⎠ .
###### Theorem (Uniform convergence of the number of retained items per constraint – restatement of Theorem 4).

With probability at least over , the following holds for all policies and all simultaneously:

1. If , then  , and

2. if , then  ,

where

 ϵ=O⎛⎝√log(d)log(n/k)+log(1/δ)k⎞⎠ .

The proofs of Theorem 3 and Theorem 4 are based on standard VC-based uniform convergence results, and technically the proof boils down to bounding the VC-dimension of the families

 R={RT:T∈T}   and   Q={RTi:T∈T, i≤d}.

#### Technical notation.

For , the set is denoted by . Given a family of sets over a domain , and , the family is denoted by . Recall that the VC dimension of is the maximum size of such that contains all subsets of .

###### Proof.

Let be a set of items shattered by  and denote its size by ; since is arbitrary, an upper bound on  implies an upper bound on . To this end we upper bound the number of subsets in . Now, there are items in with at most different values. Therefore, we can restrict our attention to thresholds-policies where each threshold is picked from a fixed set of meaningful locations (one location in between values of two consecutive items when we sort the items by value). Thus , but, as is shattered, and we get . This implies from which we conclude that . ∎

###### Proof.

For , let . Note that . We claim that for all . Indeed, let be two items. Note that if or then is not contained by and therefore not shattered by it. Therefore, assume that and . Now, it follows that any threshold that retains must also retain , and so it follows that also in this case  is not shattered.

The bound on the VC dimension of follows from the next lemma.

###### Lemma 11.

Let and let be classes with VC dimension at most . Then, the VC dimension of  is at most .

###### Proof.

We show that does not shatter a set of size . Let of size . Indeed, by the Sauer’s Lemma (Sauer, 1972):

 ∣∣(∪iFi)|Y∣∣≤m((10logm0)+(10logm1))=m(1+10logm)

and therefore, is not shattered by . ∎

This finishes the proof of Lemma 10. ∎

Using Lemma 9, we can now apply standard uniform convergence results from VC-theory to derive Theorem 3 and Theorem 4.

###### Definition 12 (Relative (p,ϵ)-approximation; Har-Peled and Sharir, 2011).

Let be a family of subsets over a domain , and let be a distribution on . is a -approximation for if for each we have,

1. If , then ,

2. If , then ,

where is the (“empirical”) measure of with respect to .

The proof of Theorems 4 and 3 now follows by plugging in Har-Peled and Sharir (2011, Theorem 2.11), which we state in the next proposition.

###### Proposition 13 (Har-Peled and Sharir, 2011).

Let and like in Definition 12. Suppose has VC dimension . Then, with provability at least , a random sample of size

 Ω(mlog(1/p)+log(1/δ)ϵ2p)

is a relative -approximation for .

### 3.2 Uniform convergence of values

We now prove a concentration result for the value of an optimal solution among the retained items. Unlike the number of retained items, the value of an optimal solution corresponds to a more complex random variable, and analyzing the concentration of its empirical estimate requires more advanced techniques. We prove the following concentration result for this random variable.

###### Theorem (Uniform convergence of values – restatement of Theorem 5).

With probability at least over , the following holds for all policies  simultaneously:

 ∣∣νT−VT(C)∣∣≤ϵk,whereϵ=O(√dlogk+log(1/δ)k).

Note that unlike most uniform convergence results that guarantee simultaneous convergence of empirical averages to expectations, here is not an average of the samples, but rather a more complicated function of them. To prove the theorem we need the following concentration inequality for the value of the optimal selection in hindsight. Note that by Theorem 2 this value equals to for some .

###### Lemma 14.

Let denote the value of the optimal solution for a sample . We have that

 PrC∼Dn[|OPT(C)−Ex[OPT(C)]|≥α]≤2exp(−α2/2k).

So, for example, it happens that with probability at least .

To prove this lemma we use the following version of Talagrand’s inequality (that appears for example in lecture notes by van Handel (2014)).

###### Proposition 15 (Talagrand’s Concentration Inequality).

Let be a function, and suppose that there exist such that for any

 f(x)−f(y)≤n∑i=1gi(x)1[xi≠yi]. (1)

Then, for independent random variables we have

 Pr[|f(X)−Ex[f(X)]|>α]≤2exp(−α22supx∑ni=1g2i(x)).
###### Proof of Lemma 14.

We apply Talagrand’s concentration inequality to the random variable . Our ’s are the items in the order that they are given. We show that Eq. 1 holds for where is a fixed optimal solution for  (we use some arbitrary tie breaking among optimal solutions). We then have, , thus completing the proof.

Now, let , be two samples of items. Recall that we need to show that

 OPT(C)−OPT(C′)≤n∑i=1gi(C)1[ci≠c′i] .

We use to construct a solution for as follows. Let the subset of matched to . For each , if for some , and , then we add to . Otherwise, we add a dummy item from to (with value zero). Let denote the value of . Note that the difference between the values of and is the total value of all items such that . Since the item values are bounded in we get that

 OPT(C)−V(S′)=d∑j=1∑ci∈Sjvj(ci)1[ci≠c′i]≤d∑j=1∑ci∈Sj1[ci≠c′i]=n∑i=1gi(C)1[ci≠c′i] .

The proof is complete by noticing that . ∎

We also require the following construction of a bracketing of which is formally presented in Section A.2.

###### Lemma 16.

There exists a collection of thresholds-policies such that , and for every thresholds-policy there are such that

1. for every sample of items ; note that by taking expectations this implies that , and

2. .

###### Proof of Theorem 5.

The items in that are retained by are independent samples from a distribution that is sampled as follows: (i) sample , and (ii) if is retained by then keep it, and otherwise discard it. This means that is in fact the optimal solution of with respect to . Since Lemma 14 applies to every distribution we can apply it to and get that for any fixed

 PrC∼Dn[|νT−VT(C)|≥α]≤2exp(−α2/2k) .

Now, by the union bound for be as in Lemma 16 we get that the probability that there is such that is at most . Thus, since , it follows that with probability at least ,

 (∀T∈N): |νT−VT(C)|≤O(√k(dlogk+log(1/δ))) . (2)

We now show why uniform convergence for implies uniform convergence for . Combining Lemma 16 with Equation 2 we get that with probability at least , every  satisfies:

 |νT−VT(C)| ≤max{|νT+−VT−(C)|,|νT−−VT+(C)|} (by Item 1 of Lemma 16) ≤max{|νT−−VT−(C)|,|νT+−VT+(C)|}+10 (by Item 2 of Lemma 16) ≤10+O(√k(dlogk+log(1/δ))).

Here the first inequality follows from Item 1 by noticing that if , are intervals on the real line and , then , and plugging in .

This finishes the proof, by setting such that . ∎

## 4 Algorithms based on learning thresholds-policies

We next exemplify how one can use the above properties of thresholds-policies to design algorithms. A natural algorithm would be to use the training set to learn a threshold-policy that retains an optimal solution with items from the training set as specified in Theorem 2, and then use this online policy to retain a subset of the items in the first phase. Theorem 3 and Theorem 5 imply that with probability , the number of retained items is at most and that the value of the resulting solution is at least .

We can improve this algorithm by combining it with the greedy algorithm of Theorem 7 described in Section A.1. During the first phase, we retain an item only if (i) is retained by , and (ii) participates in the optimal solution among the items that were retained thus far. Theorem 7 then implies that out of these items greedy keeps a subset of

 O(klogmk)=O(k(loglog(nk)+loglog(1δ))).

items in expectation that still contains a solution of value at least .

We can further improve the value of the solution and guarantee that it will be optimal (with respect to all items) with probability . This is based on the observation that if the set of retained items contains the top items of each property then it also contains an optimal solution. Thus, we can compute a thresholds-policy that retains the top items of each property from the training set (if the training set does not have this many items with some property then set the corresponding threshold to ). Then, it follows from Theorem 4, that with probability , will retain the top  items of each property in the first online phase and therefore will retain an optimal solution. Now, Theorem 4 implies that with probability the total number of items that are retained by in real-time is at most . By filtering the retained elements with the greedy algorithm of Theorem 7 as before it follows that the total number of retained items is at most

 k+klog(mk)=O(k(logd+loglog(nk)+loglog(1δ)))

with probably . This proves Theorem 6.

## 5 A lower bound

In the previous section we have presented an algorithm that with probability at least outputs an optimal solution while retaining at most items in expectation during the first phase.

We now present a proof of Theorem 8. We start with the following lemma that shows the dependence on cannot be improved in general, even for , when there are no constraints, and the distribution over the items is known to the algorithm (so there is no need to train it on a sample from the distribution):

###### Lemma 17.

Let be drawn uniformly and independently, let and let be an algorithm that retains the maximal value among the ’s with probability at least . Then,

 Ex[|S|]=Ω(loglog(1δ)),

where is the set of values retained by the algorithm.

Thus, it follows that for and the bound in Theorem 6 is tight.

###### Proof.

Define . Let denote the event that and is the largest among . We have that

 Ex[|S|]≥∑tPr[vt is % picked and Et]=∑t(Pr[Et]−Pr[vt is rejected % and Et]) . (3)

We show that since errs with probability at most then is small.

 δ≥Pr[ A rejects vmax] ≥∑tPr[ A rejects vt and Et and vt=vmax] =∑tPr[vt=vmax  |  A rejects vt % and Et]⋅Pr[A rejects vt and Et] ≥∑tPr[vi≤1−α for all  i>t  |  A rejects vt and Et]⋅Pr[A rejects vt % and Et] =∑tPr[vi≤1−α for all  i>t]⋅Pr[A rejects vt and Et] ≥∑t(1−α)n−t⋅Pr[A rejects vt % and Et] ≥(1−α)n∑tPr[A rejects vt and Et].

The crucial part of the above derivation is in third line. It replaces the event “” by the event “” (which is contained in the event “” under the above conditioning). The gain is that the events “” and “ rejects and ” are independent (the first depends only on for and the latter on for ). This justifies the “” in the fourth line.

Rearranging, we have . Substituting this bound in Eq. 3,

 Ex[|S|] ≥∑tPr[vt is picked and Et] =∑t(Pr[Et]−Pr[vt is rejected and Et]) =∑tPr[Et]−δ(1−α)n ≥14ln(αn)−δ⋅exp(2αn) (explained below) =14ln(ln(1/δ)2)−δexp(ln(1/δ)) (by the definition of α) =14lnln(1/δ)−14ln2−1=Ω(loglog(1/δ)),

which is what we needed to prove. The last inequality follows because

• (as is explained next), and

• for every (which can be verified using basic analysis).

To see (i), note that

Let . Since the ’s are uniform in then by the same argument as in the proof of Lemma 19 we get that

 Ex[∑t1Et∣z]=z∑i=11i≥∫z+111x=ln(z+1),

and therefore

Let , and therefore we need to lower bound for . To this end, we use the assumption that , and therefore (see Greenberg and Mohri, 2013 for a proof of this basic fact). In particular, this implies that , which finishes the proof. ∎

Lemma 17 implies Theorem 8 as follows: set , , , and . Pick a distribution which is uniform over items, each satisfying exactly one of properties, and with value drawn uniformly from .

It suffices to show that with probability of at least , the algorithm retains an expected number of items from a constant fraction, say , of the properties . This follows from Lemma 17 as we argue next. Let denote the number of observed items of property . Then, since , the multiplicative Chernoff bound implies that with high probability (probability suffices). Therefore, the expected number of properties ’s for which is at least . Now, consider the random variable which counts for how many properties we have . Since is at most and , then a simple averaging argument implies that with probability of at least we have that . Conditioning on this event (which happens with probability ), Lemma 17 implies222Note that to apply Lemma 17 on we need , which is equivalent to . that for each of these ’s.

## References

• Babaioff et al.  M. Babaioff, N. Immorlica, and R. Kleinberg. Matroids, secretary problems, and online mechanisms. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 434–443. Society for Industrial and Applied Mathematics, 2007.
• Balcan et al.  M. Balcan, T. Sandholm, and E. Vitercik. A general theory of sample complexity for multi-item profit maximization. In EC, pages 173–174. ACM, 2018.
• Blum et al.  A. Blum, I. Caragiannis, N. Haghtalab, A. D. Procaccia, E. B. Procaccia, and R. Vaish. Opting into optimal matchings. In SODA, pages 2351–2363. SIAM, 2017.
• Boucheron et al.  S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 9780191747106.
• Bousquet et al.  O. Bousquet, U. von Luxburg, and G. Rätsch, editors. Advanced Lectures on Machine Learning, ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science, 2004. Springer.
• Celis et al.  L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with fairness constraints. arXiv preprint arXiv:1704.06840, 2017.
• Celis et al.  L. E. Celis, L. Huang, and N. K. Vishnoi. Multiwinner voting with fairness constraints. In IJCAI, pages 144–151, 2018.
• Ezra et al.  T. Ezra, M. Feldman, and I. Nehama. Prophets and secretaries with overbooking. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 319–320. ACM, 2018.
• Ferguson  T. S. Ferguson. Who solved the secretary problem? Statistical Science, 4(3):282–289, 1989.
• Greenberg and Mohri  S. Greenberg and M. Mohri. Tight lower bound on the probability of a binomial exceeding its expectation. CoRR, abs/1306.1433, 2013.
• Gupta and Molinaro  A. Gupta and M. Molinaro. How the experts algorithm can help solve lps online. Math. Oper. Res., 41(4):1404–1431, 2016.
• Har-Peled and Sharir  S. Har-Peled and M. Sharir. Relative (p, )-approximations in geometry. Discrete & Computational Geometry, 45(3):462–496, 2011.
• Hsu et al.  J. Hsu, J. Morgenstern, R. M. Rogers, A. Roth, and R. Vohra. Do prices coordinate markets? In STOC, pages 440–453. ACM, 2016.
• Lawler  E. L. Lawler. Combinatorial optimization: networks and matroids. Courier Corporation, 2001.
• McDiarmid  C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics 1989. Cambridge University Press, Cambridge, 1989.
• Mehta et al.  A. Mehta et al. Online matching and ad allocation. Foundations and Trends® in Theoretical Computer Science, 8(4):265–368, 2013.
• Morgenstern and Roughgarden  J. Morgenstern and T. Roughgarden. On the pseudo-dimension of nearly optimal auctions. In NIPS, pages 136–144, 2015.
• Morgenstern and Roughgarden  J. Morgenstern and T. Roughgarden. Learning simple auctions. In COLT, volume 49 of JMLR Workshop and Conference Proceedings, pages 1298–1318. JMLR.org, 2016.
• Sauer  N. Sauer. On the density of families of sets. J. Combinatorial Theory Ser. A, 13:145–147, 1972.
• Talagrand  M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathématiques de l’Institut des Hautes Etudes Scientifiques, 81(1):73–205, 1995.
• van Handel  R. van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.
• Vardi  S. Vardi. The returning secretary. In 32nd International Symposium on Theoretical Aspects of Computer Science, page 716, 2015.
• Vondrák  J. Vondrák. A note on concentration of submodular functions. CoRR, abs/1005.2791, 2010.

## Appendix A Deferred Proofs

### a.1 The Greedy Online Algorithm

A simple way to collect a small set of items that contains the optimal solution is to select the  largest items of each property. This set clearly contains the optimal solution. A simple argument, as in the proof of Lemma 19, shows that this implementation of the first stage keeps items on average. In the following we present a greedy algorithm that retains an average number of items in the first phase.

The greedy algorithm works as follows: when we process the ’th item, , the algorithm computes the optimal solution of the first items (recall that we assume the algorithm has access to , a large enough pool of zero valued items so there is always a feasible solution). The greedy algorithm retains if and only if participates in . We assume that is unique for every (we can achieve this with an arbitrary consistent tie breaking rule, say among matchings of the same value we prefer the one that maximizes the sum of the indices of the matched items.). Since the optimal solutions correspond to maximum-weighted bipartite-matchings between the items and the constraints, we have the following lemma.

###### Lemma 18.

The optimal solution, denoted by , is a subset of the retained items.

###### Proof.

Consider an item matched by and assume by contradiction that is not matched in . Consider (we take the symmetric difference of and as sets of edges). Since and do not necessarily match the same items then the edges in induce a collection of alternating paths and cycles where each path has an item matched by and not by at one end, and an item matched by and not by at the other hand. Except for its two ends, an alternating path contains items that are matched by both and . From the optimality and the uniqueness of follows that for each path the value of is larger than the value of .

Since is matched by and not by there is a path in that starts at and ends at some item that is matched by and not by .

It follows that all the items in are in and if we match them according to then the value that we gain from them increases. This contradicts the optimality of .

(Note that, in fact, there are no cycles in , since they will imply that there are multiple optimal solutions, contradicting the uniqueness of and .) ∎

Lemma 18 implies that if we collect all items that are in the optimal solution of the subset of items that precedes them then the set of items that we have at the end contains the optimal solution. The next question is: how large is the subset of the items which we retain? The next lemma answers this question in an average sense.

###### Lemma 19.

Assume that the first stage the algorithm receives the items in a random order. Then the expected number of items that the first stage keeps is .

###### Proof.

Let be an indicator that is one if and only if the ’th item belongs to . Condition the probability space on the set of the first items (but not on their order). Each element of is equally likely to arrive last. So since , then the probability that the element arriving last in is in is at most if or at most otherwise. It follows that . Since this holds for any , it also holds unconditionally as well. The lemma now follows by linearity of expectation and the fact that . ∎

### a.2 Generalization and concentration

###### Lemma (restatement of Lemma 16).

There exists a collection of thresholds-policies such that , and for every thresholds-policy there are such that

1. for every sample of items . (By taking expectations this also implies that .)

2. .

For every