Sublinear Algorithms for MAXCUT and Correlation Clustering

by   Aditya Bhaskara, et al.

We study sublinear algorithms for two fundamental graph problems, MAXCUT and correlation clustering. Our focus is on constructing core-sets as well as developing streaming algorithms for these problems. Constant space algorithms are known for dense graphs for these problems, while Ω(n) lower bounds exist (in the streaming setting) for sparse graphs. Our goal in this paper is to bridge the gap between these extremes. Our first result is to construct core-sets of size Õ(n^1-δ) for both the problems, on graphs with average degree n^δ (for any δ >0). This turns out to be optimal, under the exponential time hypothesis (ETH). Our core-set analysis is based on studying random-induced sub-problems of optimization problems. To the best of our knowledge, all the known results in our parameter range rely crucially on near-regularity assumptions. We avoid these by using a biased sampling approach, which we analyze using recent results on concentration of quadratic functions. We then show that our construction yields a 2-pass streaming (1+ϵ)-approximation for both problems; the algorithm uses Õ(n^1-δ) space, for graphs of average degree n^δ.



There are no comments yet.


page 1

page 2

page 3

page 4


Near-Quadratic Lower Bounds for Two-Pass Graph Streaming Algorithms

We prove that any two-pass graph streaming algorithm for the s-t reachab...

Graph Streaming Lower Bounds for Parameter Estimation and Property Testing via a Streaming XOR Lemma

We study space-pass tradeoffs in graph streaming algorithms for paramete...

New Streaming Algorithms for High Dimensional EMD and MST

We study streaming algorithms for two fundamental geometric problems: co...

A Quantum Advantage for a Natural Streaming Problem

Data streaming, in which a large dataset is received as a "stream" of up...

Vertex Ordering Problems in Directed Graph Streams

We consider directed graph algorithms in a streaming setting, focusing o...

Testable Properties in General Graphs and Random Order Streaming

We present a novel framework closely linking the areas of property testi...

Deterministic Heavy Hitters with Sublinear Query Time

This paper studies the classic problem of finding heavy hitters in the t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sublinear algorithms are a powerful tool for dealing with large data problems. The range of questions that can be answered accurately using sublinear (or even polylogarithmic) space or time is enormous, and the underlying techniques of sketching, streaming, sampling and core-sets have been proven to be a rich toolkit.

When dealing with large graphs, the sublinear paradigm has yielded many powerful results. For many NP-hard problems on graphs, classic results from property testing [22, 7] imply extremely efficient sublinear approximations. In the case of dense graphs, these results (and indeed older ones of [10, 18]) provide constant time/space algorithms. More recently, graph sketching techniques have been used to obtain efficient approximation algorithms for cut problems on graphs [2, 3] in a streaming setting. These algorithms use space that is nearly linear in (the number of vertices) and are sublinear in the number of edges as long as (this is called the “semi-streaming” setting).

By way of lower bounds, recent results have improved our understanding of the limits of sketching and streaming. In a sequence of results [24, 25, 27], it was shown that for problems like matching and MaxCut in a streaming setting, space is necessary in order to obtain any approximation better than a factor 2 in one round. (Note that a factor 2 is trivial by simply counting edges.) Furthermore, Andoni et al. [9] showed that any sketch for all the cuts in a graph must have size .

While these lower bounds show that space is the best possible for approximating problems like MaxCut in general, the constructions used in these bounds are quite specialized. In particular, the graphs involved are sparse, i.e., have edges. Meanwhile, as we mentioned above, if a graph is dense ( edges), random sampling is known to give space and time algorithms. The question we study in this paper is if there is a middle ground: can we get truly sublinear (i.e., ) algorithms for natural graph problems in between (easy) dense graphs and (hard) sparse graphs?

Our main contribution is to answer this in the affirmative. As long as a graph has average degree for some , truly sub-linear space approximation algorithms are possible for problems such as MaxCut and correlation clustering. Note that we consider the max-agreement version of correlation clustering (see Section 2) Indeed, we show that a biased sample of vertices forms a “core-set” for these problems. A core-set for an optimization problem (see [1]), is a subset of the input with the property that a solution to the subset provides an approximation to the solution on the entire input.

Our arguments rely on understanding the following fundamental question: given a graph , is the induced subgraph on a random subset of vertices a core-set for problems such as MaxCut? This question of sub-sampling and its effect on the value of an optimization problem is well studied. Results from property testing imply that a uniformly random sample of constant size suffices for many problems on dense graphs. [18, 6] generalized these results to the case of arbitrary -CSPs. More recently, [12], extending a result in [16], studied the setting closest to ours. For graphs, their results imply that when the maximum and minimum degrees are both , then a random induced subgraph with acts as a core-set for problems such as MaxCut. Moreover, they showed that for certain lifted relaxations, subsampling does not preserve the value of the objective. Finally, using more modern techniques, [33] showed that the cut norm of a matrix (a quantity related to the MaxCut) is preserved up to a constant under random sampling, improving on [18, 6]. While powerful, we will see that these results are not general enough for our setting. Thus we propose a new, conceptually simple technique to analyze sub-sampling, and present it in the context of MaxCut and correlation clustering.

1.1 Our Results

As outlined above, our main result is to show that there exist core-sets of size for MaxCut and correlation clustering for graphs with edges (where ). This then leads to a two-pass streaming algorithm for MaxCut and correlation clustering on such graphs, that uses space and produces a approximation.

This dependence of the core-set size on is optimal up to logarithmic factors, by a result of [17]. Specifically, [17] showed that any approximation algorithm for MaxCut on graphs of average degree must have running time , assuming the exponential time hypothesis (ETH). Since a core-set of size would trivially allow such an algorithm (we can perform exhaustive search over the core-set), our construction is optimal up to a logarithmic factor, assuming ETH.

Our streaming algorithm for correlation clustering can be viewed as improving the semi-streaming (space ) result of Ahn et al. [4], while using an additional pass over the data. Also, in the context of the lower bound of Andoni et al. [9], our result for MaxCut can be interpreted as saying that while a sketch that approximately maintains all cuts in a graph requires an size, one that preserves the MaxCut can be significantly smaller, when the graph has a polynomial average degree.

At a technical level, we analyze the effect of sampling on the value of the MaxCut and correlation clustering objectives. As outlined above, several techniques are known for such an analysis, but we give a new and conceptually simple framework that (a) allows one to analyze non-uniform sampling for the first time, and (b) gets over the assumptions of near-regularity (crucial for [16, 12]) and density (as in [18, 6]). We expect the ideas from our analysis to be applicable to other settings as well, especially ones for which the ‘linearization’ framework of [10] is applicable.

The formal statement of results, an outline of our techniques and a comparison with earlier works are presented in Section 4.

1.2 Related Work

MaxCut and correlation clustering are both extremely well-studied problems, and thus we will only mention the results most relevant to our work.

Dense graphs. A graph is said to be dense if its average degree is . Starting with the work of Arora et al. [10], many NP hard optimization problems have been shown to admit a PTAS when the instances are dense. Indeed, a small random induced subgraph is known to be a core-set for problems such as MaxCut, and indeed all -CSPs [22, 6, 18, 31]. The work of [10] relies on an elegant linearization procedure, while [18, 6]

give a different (and more unified) approach based on “cut approximations” of a natural tensor associated with a CSP.

Polynomial density. The focus of our work is on graphs that are in between sparse (constant average degree) and dense graphs. These are graphs whose density (i.e., average degree) is , for some . Fotakis et al.  [17] extended the approach of [10] to this setting, and obtained approximation algorithms with run-time . They also showed that it was the best possible, under the exponential time hypothesis (ETH). By way of core-sets, in their celebrated work on the optimality of the Goemans-Williamson rounding, Feige and Schechtman [16] showed that a random sample of is a core-set for MaxCut, if the graphs are almost regular and have an average degree . This was extended to other CSPs by [12]. These arguments seem to use near-regularity in a crucial way, and are based on restricting the number of possible ‘candidates’ for the maximum cut.

Streaming algorithms and lower bounds. In the streaming setting, there are several algorithms [2, 29, 20, 3, 21, 28] that produce cut or spectral sparsifiers with edges using space. Such algorithms preserves every cut within -factor (and therefore also preserve the max cut). Andoni et al. [9] showed that such a space complexity is essential; in fact, [9] show that any sketch for all the cuts in a graph must have bit complexity (not necessarily streaming ones). However, this does not rule out the possibility of being able to find a maximum cut in much smaller space.

For MaxCut, Kapralov et al. [26] and independently Kogan et al. [30] proved that any streaming algorithm that can approximate the MaxCut value to a factor better than requires space, even if the edges are presented in random order. For adversarial orders, they showed that for any , a one-pass -approximation to the max cut value must use space. Very recently, Kapralov et al. [27] went further, showing that there exists an such that every randomized single-pass streaming algorithm that yields a -approximation to the MAXCUT size must use space.

Correlation clustering. Correlation clustering was formulated by Bansal et al. [11] and has been studied extensively. There are two common variants of the problem – maximizing agreement and minimizing disagreement. While these are equivalent for exact optimization (their sum is a constant), they look very different under an approximation lens. Maximizing agreement typically admits constant factor approximations, but minimizing disagreement is much harder. In this paper, we focus on the maximum-agreement variant of correlation clustering and in particular we focus on -approximations. Here, Ailon and Karnin [5] presented an approximation scheme with sublinear query complexity (which also yields a semi-streaming algorithm) for dense instances of correlation clustering. Giotis and Guruswami [19] described a sampling based algorithm combined with a greedy strategy which guarantees a solution within additive error. (Their work is similar to the technique of Mathieu and Schudy [31].) Most recently, Ahn et al. [4] gave a single-pass semi-streaming algorithm for max-agreement. For bounded weights, they provide an -approximation streaming algorithm and for graphs with arbitrary weights, they present a -approximation algorithm. Both algorithms require space. The key idea in their approach was to adapt multiplicative-weight-update methods for solving the natural SDPs for correlation clustering in a streaming setting using linear sketching techniques.

2 Definitions

Definition 2.1 (MaxCut).

Let be a graph with weights . Let be a partition of and let denote the sum of weights of edges between and . Then .

For ease of exposition, we will assume that the input graph for MaxCut is unweighted. Our techniques apply as long as all the weights are . Also, we denote by the average degree, i.e., .

Moving now to correlation clustering, let be a graph with edge weights and where for every edge we have and only one of them is nonzero. For every edge , we define and for each vertex, . We will also assume that all the weights are bounded by an absolute constant in magnitude (for simplicity, we assume it is ). We define the “average degree” (used in the statements that follow) of a correlation clustering instance to be .

Definition 2.2 (MAX-AGREE correlation clustering).

Given as above, consider a partition of into clusters , and let be an indicator that is if an are in the same cluster and otherwise. The MAX-AGREE score of this clustering is given by . The goal is to find a partition maximizing this score. The maximum value of the score over all partitions of will be denoted by .

Note that the objective value can be simplified to , where denotes the sum .

We will also frequently use concentration bounds, which we state next.

3 Preliminaries

We will frequently appeal to Bernstein’s inequality for concentration of linear forms of random variables. For completeness, we state it here.

Theorem 3.1 (Bernstein’s inequality[15]).

Let the random variables be independent with for each . Let and let

be the variance of

. Then, for any ,

A slightly more non-standard concentration inequality we use is from Boucheron, Massart and Lugosi [13]. It can be viewed as an exponential version of the classic Efron-Stein lemma.

Theorem 3.2 ([13]).

Assume that are random variables, and

is the vector of these

random variables. Let , where is a measurable function. Define , where denote the independent copies of . Then, for all and ,

where is the random variable defined as

4 Technical overview

We now present an outline of our main ideas. Suppose we have a graph . First, we define a procedure vertex sample

. This takes as input probabilities

for every vertex, and produces a random weighted induced subgraph.

Procedure vertex sample . Sample a set of vertices by selecting each vertex with probability independently. Define to be the induced subgraph of on the vertex set . For , define .111In correlation clustering, we have edge weights to start with, so the weight in will be (or ).

Intuitively, the edge weights are chosen so that the total number of edges remains the same, in expectation. Next, we define the notion of an importance score for vertices. Let denote the degree of vertex .

Definition 4.1.

The importance score of a vertex is defined as , where is an appropriately chosen parameter (for MaxCut, we set it to , and for correlation clustering, we set it to , where is an absolute constant).

The main result is now the following:

Theorem 4.2 (Core-set).

Let have an average degree . Suppose we apply vertex sample with probabilities to obtain a weighted graph . Then has vertices and the quantities and are within a factor of the corresponding quantities and , w.p. at least .

While the number of vertices output by the vertex sample procedure is small, we would like a core-set of small “total size”. This is ensured by the following.

Procedure edge sample . Given a weighted graph with total edge weight , sample each edge independently with probability , to obtain a graph . Now, assign a weight to the edge in .

The procedure samples roughly edges, with probability proportional to the edge weights. The graph is then re-weighted in order to preserve the total edge weight in expectation, yielding:

Theorem 4.3 (Sparse core-set).

Let be a graph vertices and average degree . Let be the graph obtained by first applying vertex sample and then applying edge sample. Then is a -core-set for MaxCut and , having size .

We then show how to implement the above procedures in a streaming setting. This gives:

Theorem 4.4 (Streaming algorithm).

Let be a graph on vertices and average degree , whose edges arrive in a streaming fashion in adversarial order. There is a two-pass streaming algorithm with space complexity for computing a -approximation to and .

Of these, Theorem 4.2 is technically the most challenging. Theorem 4.3 follows via standard edge sampling methods akin to those in [2] (which show that w.h.p., every cut size is preserved). It is presented in Section 7, for completeness. The streaming algorithm, and a proof of Theorem 4.4, are presented in Section 8. In the following section, we give an outline of the proof of Theorem 4.2.

4.1 Proof of the sampling result (theorem 4.2): an outline

In this outline we will restrict ourselves to the case of MaxCut as it illustrates our main ideas. Let be a graph as in the statement of the theorem, and let be the output of the procedure vertex sample.

Showing that is at least up to an additive term is easy. We simply look at the projection of the maximum cut in to (see, for instance, [16]). Thus, the challenge is to show that a sub-sample cannot have a significantly larger cut, w.h.p. The natural approach of showing that every cut in is preserved does not work as cuts is too many for the purposes of a union bound.

There are two known ways to overcome this. The first approach is the one used in [22, 16] and [12]. These works essentially show that in a graph of average degree , we need to consider only roughly cuts for the union bound. If all the degrees are roughly , then one can show that all these cuts are indeed preserved, w.h.p. There are two limitations of this argument. First, for non-regular graphs, the variance (roughly , where is the sampling probability) can be large, and we cannot take a union bound over cuts. Second, the argument is combinatorial, and it seems difficult to generalize this to analyze non-uniform sampling.

The second approach is via cut decompositions, developed in [18, 6]. Here, the adjacency matrix is decomposed into rank-1 matrices, plus a matrix that has a small cut norm. It turns out that solving many quadratic optimization problems (including MaxCut) on is equivalent (up to an additive ) to solving them over the sum of rank-1 terms (call this ). Now, the adjacency matrix of is an induced square sub-matrix of , and since we care only about (which has a simple structure), [6] could show that , w.h.p. To the best of our knowledge, such a result is not known in the “polynomial density” regime (though the cut decomposition still exists).

Our technique. We consider a new approach. While inspired by ideas from the works above, it also allows us to reason about non-uniform sampling in the polynomial density regime. Our starting point is the result of Arora et al. [10]

, which gives a method to estimate the


using a collection of linear programs (which are, in turn, derived using a sample of size

). Now, by a double sampling trick (which is also used in the approaches above), it turns out that showing a sharp concentration bound for the value of an induced sub-program of an LP as above, implies Theorem 4.2. As it goes via a linear programming and not a combinatorial argument, analyzing non-uniform sampling turns out to be quite direct. Let us now elaborate on this high level plan.

Induced sub-programs. First, we point out that an analysis of induced sub-programs is also an essential idea in the work of [6]. The main difference is that in their setting, only the variables are sub-sampled (and the number of constraints remains the same). In our LPs, the constraints correspond to the vertices, and thus there are fewer constraints in the sampled LP. This makes it harder to control the value of the objective. At a technical level, while a duality-based argument using Chernoff bounds for linear forms suffices in the setting of [6], we need the more recent machinery on concentration of quadratic functions.

We start by discussing the estimation technique of [10].

Estimation with Linear Programs. The rough idea is to start with the natural quadratic program for MaxCut: , subject to .222This is a valid formulation, because for every that is an edge contributes to the objective, and contribute . This is then “linearized” using a seed set of vertices sampled from . We refer to Section 5 for details. For now, is a procedure that takes a graph and a set of probabilities , samples a seed set using , and produces an estimate of MaxCut .

Now, suppose we have a graph and a sample . We can imagine running and to obtain good estimates of the respective MaxCut values. But now suppose that in both cases, we could use precisely the same seed set. Then, it turns out that the LPs used in would be ‘induced’ sub-programs (in a sense we will detail in Section 6) of those used in , and thus proving Theorem 4.2 reduces to showing a sufficiently strong concentration inequality for sub-programs.

The key step above was the ability to use same seed set in the Est procedures. This can be formalized as follows.

Double sampling. Consider the following two strategies for sampling a pair of subsets of a universe (here, for all ):

  • Strategy A: choose , by including every w.p. , independently; then for , include them in w.p. , independently.

  • Strategy B: pick , by including every w.p. ; then iterate over once again, placing with a probability equal to if , and if .

Lemma 4.5.

Suppose . Then the distribution on pairs obtained by strategies A and B are identical.

The proof is by a direct calculation, which we state it here.


Let us examine strategy A. It is clear that the distribution over is precisely the same as the one obtained by strategy B, since in both the cases, every is included in independently of the other , with probability precisely

. Now, to understand the joint distribution

, we need to consider the conditional distribution of given . Firstly, note that in both strategies, , i.e., . Next, we can write as

Noting that (by definition) concludes the proof. ∎

Proof of Theorem 4.2. To show the theorem, we use as in the statement of the theorem, and set

to be the uniform distribution

. The proof now proceeds as follows. Let be a set sampled using the probabilities . These form the vertex set of . Now, the procedure Est on (with sampling probabilities ) samples the set (as in strategy A). By the guarantee of the estimation procedure (Corollary 5.1.2), we have , w.h.p. Next, consider the procedure Est on with sampling probabilities . Again, by the guarantee of the estimation procedure (Corollary 5.1.1), we have , w.h.p.

Now, we wish to show that . By the equivalence of the sampling strategies, we can now take the strategy B view above. This allows us to assume that the Est procedures use the same , and that we pick after picking . This reduces our goal to one of analyzing the value of a random induced sub-program of an LP, as mentioned earlier. The details of this step are technically the most involved, and are presented in Section 6. This completes the proof of the theorem. (Note that the statement also includes a bound on the number of vertices of . This follows immediately from the choice of .) ∎

5 Estimation via linear programming

We now present the estimation procedure Est used in our proof. It is an extension of [10] to the case of weighted graphs and non-uniform sampling probabilities.

Let be a weighted, undirected graph with edge weights , and let denote sampling probabilities. The starting point is the quadratic program for MaxCut: , subject to . The objective can be re-written as , where is the weighted degree, . The key idea now is to “guess” the value of , by using a seed set of vertices. Given a guess, the idea is to solve the following linear program, which we denote by .

subject to

The variables are . Note that if we fix the , the optimal will satisfy . Also, note that if we have a perfect guess for ’s (coming from the MaxCut), the objective can be made .

Estimation procedure. The procedure Est is the following: first sample a set where each is included w.p. independently. For every partition of , set , and solve (in what follows, we denote this LP by , as this makes the partition and the sampling probabilities clear). Return the maximum of the objective values.

Our result here is a sufficient condition for having .

Theorem 5.1.

Let be a weighted graph on vertices, with edge weights that add up to . Suppose the sampling probabilities satisfy the condition


Then, we have , with probability at least (where the probability is over the random choice of ).

The proof of the Theorem consists of claims showing the upper and lower bound separately.

Claim 1. The estimate is not too small. I.e., w.h.p. over the choice of , there exists a cut of such that .

Claim 2. The estimate is not much larger than an optimal cut. Formally, for any feasible solution to the LP (and indeed any values ), there is a cut in of value at least the LP objective.

Proof of Claim 1..

Let be the max cut in the full graph . Now consider a sample , and let be its projection onto . For any vertex , recall that , where is the indicator for . Thus

We will use Bernstein’s inequality to bound the deviation in from its mean. To this end, note that the variance can be bounded as

In what follows, let us write and . Then, for every , our assumption on the implies that . Thus, summing over , we can bound the variance by . Now, using Bernstein’s inequality (Theorem 3.1),


Setting , and simplifying, we have that the probability above is . Thus, we can take a union bound over all , and conclude that w.p. ,


For any that satisfies the above, consider the solution that sets for and otherwise. We can choose , by the above reasoning (eq. (3)). Thus the LP objective can be lower bounded as

This is precisely . This completes the proof of the claim. ∎

Proof of Claim 2..

Suppose we have a feasible solution to the LP, of objective value , and we wish to move to a cut of at least this value. To this end, define the quadratic form

The first observation is that for any , and any real numbers , we have

This is true simply because , and the fact that the second term is at least , as .

Next, note that the maximum of the form over has to occur at a boundary point, since for any fixing of variables other than a given , the form reduces to a linear function of , which attains maximum at one of the boundaries. Using this observation repeatedly lets us conclude that there is a such that . Since any such corresponds to a cut, and corresponds to the cut value, the claim follows.333We note that the proof in [10] used randomized rounding to conclude this claim, but this argument is simpler; also, later papers such as [17] used such arguments for derandomization.

Finally, to show Theorem 4.2 (as outlined in Section 4.1), we need to apply Theorem 5.1 with specific values for and . Here we state two related corollaries to Theorem 5.1 that imply good estimates for the MaxCut.

Corollary 5.1.1.

Let in the framework be the original graph , and let for all . Then the condition holds for all , and therefore , w.p. .

The proof is immediate (with a slack of ), as , , and all are equal.

Corollary 5.1.2.

Let be the weighted sampled graph obtained from vertex sample, and let . Then the condition (1) holds w.p. , and therefore w.p. .


In this case, we have . Thus, simplifying the condition, we need to show that

Now, for sampled via probabilities , we have (in expectation) , and . A straightforward application of Bernstein’s inequality yields that and , w.p. at least . This completes the proof. ∎

6 Random induced linear programs

We will now show that the Est on has approximately the same value as the estimate on (with appropriate values). First, note that is , where . To write the LP, we need the constants , defined by the partition as . For the graph , the estimation procedure uses an identical program, but the sampling probabilities are now , and the estimates , which we now denote by , are defined by . Also, by the way we defined , . The degrees are now . The two LPs are shown in Figure 1.

(a) The LP on the full graph
(b) The sampled LP
Figure 1: The two LPs.

Our aim in this section is to show the following:

Theorem 6.1.

Let be an input graph, and let be sampled as described in Section 4.1. Then, with probability , we have

Proof outline. To prove the theorem, the idea is to take the “strategy B” viewpoint of sampling , i.e., fix , and sample using the probabilities . Then, we only need to understand the behavior of an “induced sub-program” sampled with the probabilities . This is done by considering the duals of the LPs, and constructing a feasible solution to the induced dual whose cost is not much larger than the dual of the full program, w.h.p. This implies the result, by linear programming duality.

Let us thus start by understanding the dual of given , shown in Figure 1(a). We note that for any given , the optimal choice of is ; thus we can think of the dual solution as being the vector . The optimal may thus be bounded by , a fact that we will use later. Next, we write down the dual of the induced program, , as shown in Figure 1(b).

(a) The dual of
(b) The dual of the induced program .
Figure 2: The dual LPs

Following the outline above, we will construct a feasible solution to LP (1(b)), whose cost is close to the optimal dual solution to LP (1(a)). The construction we consider is very simple: if is the optimal dual solution to (1(a)), we set for as the candidate solution to (1(b)). This is clearly feasible, and thus we only need to compare the solution costs. The dual objective values are as follows


Note that there is a in (5), as is simply one feasible solution to the dual (which is a minimization program). Next, our goal is to prove that w.p. at least ,

Note that here, the probability is over the choice of given (as we are taking view-B of the sampling). The first step in proving the above is to move to a slight variant of the quantity , which is motivated by the fact that is not quite , but (as we have conditioned on ). Let us define (recall that is ), and . So also, let . Then, define


A straightforward lemma, which we use here, is the following. Here we bound the difference between the “corrected” dual we used to analyze, and the value we need for the main theorem. Specifically, we bound .

Lemma 6.2.

Let be sampled as in Section 4.1. Then w.p. at least , we have that for all and for all partitions of ,444Note that the partition defines the . .


To prove the lemma, it suffices to prove that w.p. ,


This is simply by using the fact that are always in . Before showing this, we introduce some notation and make some simple observations. First, denote by the indicator vector for and by the indicator for .

Observation 6.3.

With probability over the choice of , we have:

  1. For all , .

  2. For all , .

  3. .

  4. .

All the inequalities are simple consequences of Bernstein’s inequality (and our choice of parameters ,