## 1 Introduction

Statistical inference over structured instances of dependent variables (e.g., labeled sequences, trees, or general graphs) is a fundamental problem in many areas. Examples include computer vision

(Nowozin et al., 2011; Dollár & Zitnick, 2013; Chen et al., 2018)(Huang et al., 2015; Hu et al., 2016), and computational biology (Li et al., 2007). In many practical setups (Shin et al., 2015; Rekatsinas et al., 2017; Sa et al., 2019; Heidari et al., 2019b), inference problems involve noisy observations of discrete labels assigned to the nodes and edges of a given structured instance and the goal is to infer a labeling of the vertices that achieves low disagreement rate between the correct ground truth labels and the predicted labels , i.e., low Hamming error. We refer to this problem as statistical recovery.Our motivation to study the problem of statistical recovery stems from our recent work on data cleaning (Rekatsinas et al., 2017; Sa et al., 2019; Heidari et al., 2019b). This work introduces HoloClean, a state-of-the-art inference engine for data curation that casts data cleaning as a structured prediction problem (Sa et al., 2019): Given a dataset as input, it associates each of its cells with a random variable, and uses logical integrity constraints over this dataset (e.g., key constraints or functional dependencies) to introduce dependencies over these random variables. The labels that each random variable can take are determined by the domain of the attribute associated with the corresponding cell. Since we focus on data cleaning, the input dataset corresponds to a noisy version of the latent, clean dataset. Our goal is to recover the latter. Hence, the initial value of each cell corresponds to a noisy observation of our target random variables. HoloClean employs approximate inference methods to solve this structured prediction problem. While its inference procedure comes with no rigorous guarantees, HoloClean achieves state-of-the-art results in practice. Our goal in this paper is to understand this phenomenon.

Recent works have also studied the problem of approximate inference in the presence of noisy vertex and edge observations. However, they are limited to the case of binary labeled variables: Globerson et al. focused on two-dimensional grid graphs and show that a polynomial time algorithm based on MaxCut can achieve optimal Hamming error for planar graphs for which a weak expansion property holds (Globerson et al., 2015). More recently, Foster et al. introduced an approximate inference algorithm based on tree decompositions that achieves low expected Hamming error for general graphs with bounded tree-width (Foster et al., 2018). In this paper, we generalize these results to the case of categorical labels.

Problem and Challenges We study the problem of statistical recovery over categorical data. We consider structured instances where each variable takes a ground truth label in the discrete set . We assume that for all variables , we observe a noisy version of its ground truth labeling such that

with probability

. We also assume that for all variable pairs , we observe noisy measurements of the indicator such that with probability . Given these noisy measurements, our goal is to obtain a labeling of the variables such that the expected Hamming error between and is minimized. We now provide some intuition on the challenges that categorical variables pose and why current approximate inference methods not applicable:First, in contrast to the binary case, negative edge measurements do not carry the same amount of information: Consider a simple uniform noise model. In the case of binary labels, observing an edge measurement and a binary label

allows us to estimate that

is correct with probability when and are bounded away from 1/2. However, in the categorical setup, can take any of the labels, hence the probability of estimate being correct is up to a factor of smaller than the binary case. Our main insight is that while the binary case leverages edge labels for inference, approximate inference methods for categorical instances need to rely on the noisy node measurements and the positive edge measurements.Second, existing approximate inference methods for statistical recovery (Globerson et al., 2015; Foster et al., 2018)

rely on a “Flipping Argument” that is limited to binary variables to obtain low Hamming error: for binary node and edge observations, if all nodes in a maximal connected subgraph

are labeled incorrectly with respect to the ground truth, then at least half of the edge observations on the boundary of are incorrect, or else the inference method would have flipped all node labels in to obtain a better solution with respect to the total Hamming error. As we discuss later, in the categorical case a naive extension implies that one needs to reason about all possible label permutations over the labels.Contributions We present a new approximate inference algorithm for statistical recovery with categorical variables. Our approach is inspired by that of Foster et al. (2018) but generalizes it to categorical variables.

First, we show that, when a variable is assigned one of the erroneous labels with uniform probability , the optimal Hamming error for trees with nodes is , when

. This is obtained by solving a linear program using dynamic programming. Here, we derive a tight upper bound on the number of erroneous edge measurements, which we use to restrict the space of solutions explored by the linear program.

Second, we extend our method to general graphs using a tree decomposition of the structured input. We show how to combine our tree-based algorithm with correlation clustering over a fixed number of clusters (Giotis & Guruswami, 2006) to obtain a non-trivial error rate for graphs with bounded treewidth and a specified number of classes. Our method achieves an expected Hamming error of where is the maximum degree of graph . We show that local pairwise label swaps are enough to obtain a globally consistent labeling with low expected Hamming error.

Finally, we validate our theoretical bounds via experiments on tree graphs and image data. Our empirical study demonstrates that our approximate inference algorithm achieve low Hamming error in practical scenarios.

## 2 Preliminaries

We introduce the problem of statistical recovery, and describe concepts, definitions, and notation used in the paper. We consider a structured instance represented by a graph with and . Each vertex represents a random variable with ground truth label in the discrete set . Edges in represent dependencies between random variables and each edge has a ground truth measurement where if and otherwise.

Uniform Noise Model and Hamming Error We assume access to noisy observations over the nodes and edges of . For each variable , we are given a noisy label observation , and for each edge we are given a noisy edge observation . These noisy observations are assumed to be generated from , and by the following process: We are given and two parameters, edge noise and node noise with . For each edge , the observation is independently sampled to be with probability (a good edge) and with probability (a bad edge). For each node , the node observation is independently sampled to be with probability (a good node) and can take any other label in with a uniform probability . The uniform noise model is a direct extension of that considered by prior work (Globerson et al., 2015; Foster et al., 2018), and a first natural step towards studying statistical recovery for categorical variables.

Given the noisy measurements and over graph , a labeling algorithm is a function : . We follow the setup of Globerson et al. (2015) to measure the performance of . We consider the expectation of the Hamming error (i.e., the number of mispredicted labels) over the observation distribution induced by . We consider as error the worst-case (over the draw of ) expected Hamming error, where the expectation is taken over the process generating the observations from . Our goal is to find an algorithm such that with high probability it yields bounded worst-case expected Hamming error. In the remainder of the paper, we will refer to the worst-case expected Hamming error as simply Hamming error.

Categorical Labels and Edge Measurements When is close to , one needs to leverage the edge measurements to predict the node labels correctly. For binary labels, the structure of the graph alone determines if one can obtain algorithms with a small error for low constant edge noise (Globerson et al., 2015; Foster et al., 2018). We argue that this is not the case for categorical labels. Beyond the structure of the graph , the number of labels also determines when we can obtain labeling algorithms with non-trivial error bounds.

We use the next example to provide some intuition on how affects the amount of information in the edge measurements of : Let nodes take labels in . We fix a vertex , and for each vertex in its neighborhood set the estimate label to if and to one of uniformly at random if . For a correct negative edge measurement and a correct label assignment to , we are not guaranteed to obtain the correct label for as we would be able in the binary case.

Given the above setup, the probability that node is labeled correctly is where is the probability of an edge being negative in the ground truth labeling of . Two observations emerge from this expression: (1) As the number of colors increases, the probability decreases, hence, for a fixed graph as increases, statistical recovery becomes harder; (2) For a fixed graph , as increases the probability of obtaining a negative edge in the ground truth labeling of increases— this holds for a fixed graph and under the assumption that each label should appear at least once in the ground truth—and the term approaches zero. This implies that for to be meaningful the term should be maximized for fixed , and hence, the edge noise should approach zero as a function of . In other words, should be upper bounded by a function such that as increases goes to zero. We leverage these two observations to specify when statistical recovery is possible.

Statistical Recovery Statistical recovery is possible for the family of structured instances with categories, if there exists a function with such that for every that is upper bounded by a function with , the Hamming error of a labeling algorithm on graph with vertices is at most .

## 3 Approach Overview

We consider a graph with node labels in . The space of all possible labelings of defines a *hypothesis space* . In this space, we denote the latent, ground truth labeling of . In the absence of any information the size of this space is . Access to any side information allows us to identify a subspace of that is close to .

First, we consider access only to noisy node labels of and denote the point in for this labeling. If we have no side information on the edges of , the information theoretic optimal solution to statistical recovery is (because we assume ). Second, we assume access only to edge measurements for . We denote the observed edge measurements. If the edge measurements are accurate (i.e., ) the size of reduces to . We assume that is such that one can obtain a labeling for that is edge-compatible with by traversing . Under this assumption, the number of edge-compatible labelings is equal to all possible label permutations, i.e., . Finally, in the presence of both node and edge observations the information theoretic optimal solution to statistical recovery corresponds to a point that is obtained by running exact marginal inference (Globerson et al., 2015). However, exact inference can be intractable, and even when it is efficient, it is not clear what is the optimal Hamming error that yields with respect to .

To address these issues, we propose an approximate inference scheme and obtain a bound on the worst-case expected Hamming error that it obtains. We start with the noisy edge observations and use them to find a subspace that contains node labelings which induce edge labelings that are close to (in terms of Hamming distance). We formalize this in the next two sections. Intuitively, we have that noisy edge measurements partition the space in a collection of *edge classes*.

###### Definition 1.

The edge class of a point is a set such that for all , induces the same edge measurements as . All points in can be derived via a label permutation of . In general, for any labeling , set is the set of all labelings that can be generated by a label permutation of .

The restricted subspace contains those edge classes that are close to the noisy edge observations .

Given the restricted subspace , we design an algorithm to find a point such that the Hamming error between and is minimized. We define the Hamming error with respect to an edge class as:

###### Definition 2.

Point might not be in and the distance between and is the approximation error we have due to approximate inference. Finally, we prove that the expected Hamming error between and is bounded. A schematic diagram of our approximate inference method is shown in Figure 1. In the following sections, we study statistical recovery for trees (in Section 4) and general graphs (in Section 5). All proofs can be found in the supplementary material of our paper (Heidari et al., 2019a).

## 4 Recovery in Trees

We focus on trees and introduce a linear program for statistical recovery over -categorical random variables. We prove that under a uniform noise model the optimal Hamming error is .

### 4.1 A Linear Program for Statistical Recovery

We follow the steps described in Section 3. First, we use the noisy edge observations to restrict the search for to a subspace . We describe via a constraint on the number of edge disagreements between the edge labeling implied by and the noisy edge observations . Second, we form an optimization problem to find a point with minimum Hamming distance from that satisfies the aforementioned constraint.

The ground truth edge labeling (corresponding to the ground truth node labeling ) has bounded Hamming distance from the observed noisy labeling . Hence, we can restrict the space of considered solutions to node labelings that induce an edge labeling with a bounded Hamming distance from the observed noisy labeling . We have: Under the uniform noise model, edge measurements are flipped independently. Thus, the total number of bad edges is a sum over independent and identically distributed (iid) random variables. The expected number of flipped edges is . Using the Bernstein inequality, we have:

###### Lemma 1.

Let be a graph with noisy edge observations with noise parameter . With probability at least over the draw of :

This lemma states that under the uniform noise model the ground truth edge labeling for Graph is in the neighborhood of with high probability. Given this bound, we use the following linear program to find :

[3] ^Y ∈[k]^—V— ∑_v ∈V 1{^Y_v ≠Z_v} ∑_(u,v) ∈E 1{φ(^Y_u,^Y_v) ≠X_u,v} ≤t

where is defined as in Lemma 1. This problem can be solved via a dynamic programming algorithm with cost . We describe this algorithm in the supplementary material of the paper (Heidari et al., 2019a).

Discussion Our approach is similar to that of Foster et al. (2018) for binary random variables. However, we use the Bernstein inequality to obtain a tighter concentration bound on the number of flipped edge measurements. In the case of categorical random variables, it is critical to obtain a tight description of the space of the possible labeling solutions as we have a larger hypothesis space.

Let be the size of hypothesis space with labels and nodes. If we increase by one, the rate of change for the hypothesis space is , which is multiplicative with respect to . Similarly, as we increase to the size of the hypothesis space changes by , which is exponential in the size of our input. We need a tight bound to obtain an efficient dynamic programming algorithm with respect to and .

### 4.2 Upper Bound on the Hamming Error for Trees

The Hamming error of obtained by Linear Program 1 is bounded by with high probability. For our analysis, we draw connections to statistical learning.

We define a hypothesis class that contains all points that satisfy the bound in Lemma 1:

From Lemma 1, we have that the edge class that corresponds to the ground truth labeling is contained in with high probability over the draw of . Moreover, since the node noise is bounded away from , we can use the noisy node measurements to find a labeling that is in the same edge class as and close to . Such a labeling is obtained by solving Linear Program 1. From a statistical learning perspective, corresponds to the *empirical risk minimizer* (ERM) over given . Thus, the Hamming error between and is associated with the *excess risk* over for Class . We have:

###### Lemma 2.

(Foster et al., 2018) Let be the empirical risk minimizer over given and let and a constant number, then with probability over the draw of ,

We now analyze how the Hamming error relates to excess risk for categorical random variables. We have:

###### Lemma 3.

The Hamming error is proportional to the excess risk: For fixed and distributed according to the uniform noise model we have that:

With we have that , which recovers the result of Foster et al. (2018) for binary random variables.

Using Lemma 2, we can bound the excess risk in terms of the size of the hypothesis class. We have:

###### Corollary 1.

When and , we have that with probability at least over the draw of :

We now combine these results with the complexity of class to obtain a bound for the Hamming error:

###### Theorem 1.

Let be the solution to Problem 1. Then with probability at least over the draw of and

## 5 Recovery in General Graphs

We now show how our tree-based algorithm can be combined with correlation clustering to obtain a non-trivial error rate for graphs with bounded treewidth and -categorical random variables. We first describe our approximate inference algorithm and then show that our algorithm achieves an expected Hamming error of where is the maximum degree of the structured instance .

### 5.1 Approximate Statistical Recovery

We build upon the concept of *tree decompositions* (Diestel, 2018). Let be a graph, be a tree, and be a family of vertex sets indexed by the nodes of . We denote a tree-decomposition with . The width of is defined as and the treewidth of is the minimum width among all possible decompositions of . We also denote with the edges connecting the bags in in and represent as .

Given a graph , a tree decomposition of defines a series of local subproblems whose solutions can be combined via dynamic programming to obtain a global solution for the original problem on . For graphs of bounded treewidth, this approach allows us to obtain efficient algorithms (Bodlaender, 1988). Our solution proceeds as follows: Let be a tree decomposition of . We first find a local labeling for each . Then, we design a dynamic programming algorithm that combines all local labelings to obtain a global labeling .

#### 5.1.1 Finding Local Labelings

We recover the labeling of the nodes in a bag as follows: (1) Given , we consider a superset of , defined as where is the one-hop neighborhood of node ; (2) Given , we use the edge observations in the edge subset induced by to find a restricted hypothesis space . We then find a labeling that has the minimum Hamming error with respect to for the nodes in . Let denote this subset of ; (3) For , we assign to be the restriction of on .

We consider two cases for Step 2 from above: (1) If , we can enumerate all labelings for and choose the one with minimum Hamming distance from . The complexity of this brute-force algorithm is ; (2) If , we use the MaxAgree[] algorithm of Giotis & Guruswami (2006) over the noisy edge measurements to restring the subspace in the neighborhood of . MaxAgree[] is a polynomial-time approximation scheme (PTAS) for solving the Max-Agreement version of correlation clustering for a fixed number of labels. In the worst case, MaxAgree[] obtains an approximation of 0.7666Opt[]. In our analysis, we account for the approximation factor 0.7666 by changing the probability to . A detailed discussion is provided in the supplementary material of the paper (Heidari et al., 2019a). Given the output of MaxAgree[], let be the restricted subspace of solutions for . We pick an arbitrary labeling and use Algorithm 1 to get a permutation that transforms to point that has minimum Hamming distance to .

Algorithm 1 greedily permutes the labels in to obtain a labeling with minimum Hamming distance to . The complexity of this algorithm is .

###### Lemma 4.

We combine all steps in Algorithm 2. The output of this algorithm is a collection of labelings for the local problems. Lemma 4 states that minimizes the Hamming distance to . We also show that remains a minimizer with respect to after the swaps due to .

###### Definition 3.

Given a graph , the function changes all node labels to , and all node labels to .

The swap operation enables us to switch between elements within an edge class. We show that a does not affect the disagreements between the node labeling and edge labeling of a graph.

###### Lemma 5.

Let be a set of labels . Consider a graph for which we are given a node labeling and an edge labeling . For any pair , let be the node labeling of after swapping label with . We have that: .

This lemma implies that is a minimizer of since minimizes this quantity, and is a permutation of .

#### 5.1.2 From Local Labelings to a Global Labeling

We now describe how to combine labelings into a global labeling . For binary random variables, the following procedure plays a central role in enforcing agreement across local labelings (Foster et al., 2018): Given a bag and a neighbor with conflicting node labels with respect to , we can maximize the agreement between and by flipping labeling to its mirror labeling. This operation leads to consistent solutions since for binary random variables there is only one mirror labeling. However, for categorical random variables we have possible mirror labelings for . We show that it suffices to consider only one label swap per bag instead of labelings.

We consider the swap operation (see Section 5.1.1) and two bags and with labelings and . We resolve conflicts in as follows: Let be the set of all permutations restricted to one pairwise color swap. Given a bag with labeling , we define a swap to be valid if color is present in . Given a valid swap for , we define to be the label assignment for all nodes in after applying to . Also, let be the labeling for a node after . Finally, we define as the set of all labelings for that can be obtained if we apply any valid pairwise label swap on . To resolve inconsistencies between and , we consider pairs in such that the labeling in the intersection of and is consistent and the number of nodes whose label is swapped is minimum.

The procedure we use is shown in Algorithm 3. The algorithm takes as input a tree decomposition of and the local labelings . For each with labeling , we compute the cost of swapping label with label for each . Then, we iterate over edges in to identify incompatibilities between local node labelings. Finally, we use all the computed costs to find the single swap to be applied locally to each bag such that global agreement is maximized. To this end, we solve a linear program similar to program 1. This program is shown in Algorithm 4.

#### 5.1.3 Discussion on Correlation Clustering

We use correlation clustering in our algorithm for practical reasons. If the cardinality of the bags is bounded by , we can find a local labeling for each that has minimum Hamming distance to efficiently. Obtaining such a decomposition is an NP-complete problem. This challenge is also highlighted by Foster et al. (2018). To address this issue they assume a sampling procedure for removing edges from

to obtain a subgraph for which a low-width tree decomposition is easy to find. This procedure is a graph-specific exercise and not easily generalizable to arbitrary graphs. We follow a different approach. Instead of using specialized procedures, we rely on heuristics to obtain a low-width decompositions

de Givry et al. (2006); Dermaku et al. (2008) and use correlation clustering for large bags. This scheme allows us to use our algorithm with arbitrary graphs.### 5.2 A Bound for Low Treewidth Graphs

We state our main theorem for statistical recovery over general graphs. We also provide a proof sketch.

###### Theorem 2.

We see that the Hamming error obtained by our approach goes to zero as . Theorem 2 allows us to understand when statistical recovery over a graph with categorical random variables is possible (i.e., when we can rely on edge observations to solve statistical recovery more accurately than the trivial solution of keeping the initially assigned node labels). Theorem 2 connects the level of edge-noise with the degree of the input graph, the number of labels , and the noise q on node labels. We have that for the edge noise it should be , where is the node noise parameter, for the side information in to be useful for statistical recovery. Otherwise, one should just use the initially observed node labels.

Proof Sketch Let denote a maximal connected subgraph of . Let be the boundary of , i.e., the set of edges with exactly one endpoint in . Let be the local labeling for nodes in . We say that is incorrectly labeled if for all we have . We have:

###### Lemma 6.

(Swapping lemma) Let be a maximal connected subgraph of with every node incorrectly labelled by . Then at least half the edges of are bad.

For a bag , let set be the largest connected component in such that for all nodes in it . It must be the case that at least half of the edges are incorrect or else there exists a different labeling that agrees with better than . This contradicts the fact that is a minimizer of . This result extends the Flipping Lemma of Globerson et al. (2015) from the binary to the categorical case.

We use this result to bound the probability that a local labeling (see Lemma 4) will fail to recover the ground truth node label for . The probability of local labelings having large Hamming error is upper bounded:

###### Lemma 7.

Let be the all label permutations on the set . We have for :

with .

We now build upon Lemma 7 and leverage the result introduced by Boucheron et al. (2003) to obtain an upper bound on the total number of mislabeled nodes across all bags in for any labeling permutation over the local labeling :

###### Lemma 8.

Let be the all label permutations on the set . For all , with probability at least over the draw of we have that:

where denotes the set of bags in that contain edge and denotes the set of edges in bag .

This lemma can be extended to as well. This lemma combined with Lemma 6 implies that the labeling disagreement across bags in the tree decomposition are bounded. The analysis continues in a way similar to that for trees (see Section 4). Given the local bag labelings, we seek to find the labeling swaps across bags such that the global labeling has minimum Hamming error with respect to . We use the inequality from Lemma 8 to restrict the space of all possible pairwise label swaps over the local bag labelings. Let be the optimal point in such that the global labeling has minimum Hamming error with respect to . Given the tree decomposition of . We define the hypothesis space:

with , , and and denote the pairwise swaps and labeling disagreements between bags from Algorithm 3. We show that the optimal permutation is a member of with high probability and also have that . Combining this with Lemma 2, we take is most correlated with , i.e., it is a minimizer for

. Directly from statistical learning theory we have that the Hamming error of this estimator

is which establishes our main theorem.## 6 Experiments

Experimental Setup We evaluate our approach on trees and grid graphs. For trees, we use Erdős–Rényi random trees to obtain ground truth instances. For grids, we use real images to obtain the ground truth. We create noisy observations via a uniform noise model. We compare our approach with two approximate inference baselines: (1) a Majority Vote algorithm, where we leverage the neighborhood of a node to predict its label, and (2) (Loopy) Belief Propagation. To evaluate performance we use the normalized Hamming distance . We provide more details in the Supplementary Material.

Hamming Error of Random Trees Our analysis suggests that Linear Program 1 yields a solution with Hamming error . We evaluate experimentally that the Hamming error increases at a logarithmic rate with respect to . Figure 2 shows the Hamming error for a fixed tree generative model with and as we increase the number of labels . We fix away from and generate trees for each . We report the average error. As shown, we observe the expected logarithmic behavior that we proved theoretically. The graph size is chosen randomly .

Hamming Error of Grids We have two experiments on grids. In the first experiment, we select grayscale images and compute the Hamming error obtained by our algorithm. We consider a uniform noise model with and . Figure 3 shows the Hamming error as increases. As expected we see that the Hamming error increases. This is because as increases negative edges carry lower information, and with non-zero edge error (p), the positive edges also provide low information observations (i.e., a wrong measurement). In the supplementary material of our paper, we present a qualitative evaluation of our results on the grey-scale images.

In the second experiment, we evaluate the effect of edge noise on the quality of solution obtained by our methods for a fixed number of labels and fixed node noise . In Figure 4, we show the effect of on the average of Hamming error when other parameters are fixed (). We vary from zero to . We repeat each experiment times. We find that our approximate inference algorithm is robust to small amounts of noise.

This experiment also validates Theorem 2 which states when the side information from edges helps with statistical recovery. For the setups we consider in this experiment, we have k = 128 and vary q in 0.1, 0.15, 0.2. If we keep the initial node labels the expected normalized Hamming error will be 0.1, 0.15, and 0.2 respectively. Theorem 2 states that to obtain a better Hamming error than the above one, the edge noise has to be less than , , respectively. Figure 4 shows that the normalized Hamming error obtained by our algorithm reaches the Hamming error of the trivial algorithm (and plateaus around it) at the expected edge-noise levels of 0.04, 0.05, and 0.06.

Our approximate inference algorithm is robust to small amounts of noise. As expected, when the noise increases the Hamming error increases.

## 7 Conclusion

We considered the problem of statistical recovery in structured instances with noisy categorical observations. We presented a new approximate algorithm for inference over graphs with categorical random variables. We showed a logarithmic dependency of the Hamming error to the number of categories the random variables can obtain. We also explored the connections between approximate inference and correlation clustering with a fixed number of clusters. There are several future directions suggested by this work. One interesting direction would be to understand under which noise models the problem of statistical recovery is solvable. Moreover, it is interesting to explore the direction of correlation clustering further and extend our analysis beyond small tree width graphs.

##### Acknowledgements

The authors thank Shai Ben David, Fereshte Heidari Khazaei, and Joshua McGrath for the helpful discussions. This work was supported by Amazon under an ARA Award, by NSERC under a Discovery Grant, and by NSF under grant IIS-1755676.

## 8 Analysis for Trees

### 8.1 Proof of Lemma 1

###### Proof.

In , for each edge , we have a random variable with distribution:

To apply the Bernstein inequality, we must consider . We have and . We must also have that the random variables are constrained. We know that and so . Now, we apply the Bernstein inequality:

Let . Solving for we obtain:

Now we have that:

We choose , and substituting for trees and , we have that with probability :