    # On Solving Linear Systems in Sublinear Time

We study sublinear algorithms that solve linear systems locally. In the classical version of this problem the input is a matrix S∈R^n× n and a vector b∈R^n in the range of S, and the goal is to output x∈R^n satisfying Sx=b. For the case when the matrix S is symmetric diagonally dominant (SDD), the breakthrough algorithm of Spielman and Teng [STOC 2004] approximately solves this problem in near-linear time (in the input size which is the number of non-zeros in S), and subsequent papers have further simplified, improved, and generalized the algorithms for this setting. Here we focus on computing one (or a few) coordinates of x, which potentially allows for sublinear algorithms. Formally, given an index u∈ [n] together with S and b as above, the goal is to output an approximation x̂_u for x^*_u, where x^* is a fixed solution to Sx=b. Our results show that there is a qualitative gap between SDD matrices and the more general class of positive semidefinite (PSD) matrices. For SDD matrices, we develop an algorithm that approximates a single coordinate x_u in time that is polylogarithmic in n, provided that S is sparse and has a small condition number (e.g., Laplacian of an expander graph). The approximation guarantee is additive | x̂_u-x^*_u | <ϵ x^* _∞ for accuracy parameter ϵ>0. We further prove that the condition-number assumption is necessary and tight. In contrast to the SDD matrices, we prove that for certain PSD matrices S, the running time must be at least polynomial in n. This holds even when one wants to obtain the same additive approximation, and S has bounded sparsity and condition number.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Solving linear systems is a fundamental problem in many areas. A basic version of the problem has as input a matrix and a vector , and the goal is to find such that . The fastest known algorithm for general is by a reduction to matrix multiplication, and takes time, where [Gal14] is the matrix multiplication exponent. When is sparse, one can do better (by applying the conjugate gradient method to the equivalent positive semidefinite (PSD) system , see for example [Spi10]), namely, time where is the number of non-zeros in . Note that this bound for exact solvers assumes exact arithmetic, and in practice, one seeks fast approximate solvers.

One interesting subclass of PSD matrices is that of symmetric diagonally dominant (SDD) matrices.111A symmetric matrix is called SDD if for all . Many applications require solving linear systems in SDD matrices, and most notably their subclass of graph-Laplacian matrices, see e.g. [Spi10, Vis13, CKM14]. Solving SDD linear systems received a lot of attention in the past decade after the breakthrough result by Spielman and Teng in 2004 [ST04], showing that a linear system in SDD matrix can be solved approximately in near-linear time , where and is an accuracy parameter. A series of improvements led to the state-of-the-art SDD solver of Cohen et al. [CKM14] that runs in near-linear time . Recent improvements extend to connection Laplacians [KLP16]. Obtaining similar results for all PSD matrices remains a major open question.

Motivated by fast linear-system solvers in alternative models, here we study which linear systems can be solved in sublinear time. We can hope for such sublinear times if only one (or a few) coordinates of the solution are sought. Formally, given a matrix , a vector , and an index , we want to approximate the coordinate of a solution to the linear system (assume for now the solution is unique), and we want the running time to be sublinear in .

Our main contribution is a qualitative separation between the class of SDD matrices and the larger class of PSD matrices, as follows. For well-conditioned SDD matrices , we develop a (randomized) algorithm that approximates a single coordinate fast — in time. In contrast, for some well-conditioned PSD (but not SDD) matrices , we show that the same task requires time. In addition, we justify the dependence on the condition number.

Our study is partly motivated by the advent of quantum algorithms that solve linear systems in sublinear time, which were introduced in [HHL09], and subsequently improved in [Amb12, CKS17]

, and meanwhile used for a number of (quantum) machine learning algorithms (see, e.g., the survey

[DHM18]). In particular, [HHL09] consider the system given: (1) oracle access to entries of (including fast access to the -th non-zero entry in the -th row), and (2) a fast black-box procedure to prepare a quantum state . Then, if the matrix has condition number , at most non-zeros per row/column, and , their quantum algorithm runs in time , and outputs a quantum state within -distance from . The runtime was later improved in [CKS17] to depend logarithmically on . (The original goal of [HHL09] was different — to output a “classical” value, a linear combination of — and for this goal the improved dependence on is not possible unless .) These quantum sublinear-time algorithms raise the question whether there are analogues classical algorithms for the same problems; for example, a very recent success story is a classical algorithm [Tan18] for a certain variant of recommendation systems, inspired by an earlier quantum algorithm [KP17]. Our lower bound precludes a classical analogue to the aforementioned linear-system solver, which works for all matrices and in particular for PSD ones.

##### Problem Formulation.

To formalize the problem, we need to address a common issue for linear systems — they may be underdetermined and thus have many solutions , which is a nuissance when solving for a single coordinate. We require that the algorithm approximates a single solution , in the sense that invoking the algorithm with different indices will output coordinates that are all consistent with one “global” solution. This formulation follows the concept of Local Computation Algorithms, see Section 1.3.

Formally, given a matrix , a vector in the range (column space) of , and an accuracy parameter , there exists satisfying , such that upon query the (randomized) algorithm outputs that satisfies

 ∀u∈[n],Pr[|^xu−x∗u|≤ϵ||x∗||∞]≥34. (1)

This guarantee corresponds (modulo amplification of the success probability) to finding a solution

with . We remark that the guarantee in [ST04] is different, that where .

##### Basic Notation.

Given a (possibly edge-weighted) undirected graph , we assume for convenience . Its Laplacian is the matrix , where is the (weighted) adjacency matrix of , and is the diagonal matrix of (weighted) degrees in . It is well-known that all Laplacians are SDD matrices, which in turn are always PSD.

The sparsity of a matrix is the maximum number of non-zero entries in a single row/column. The condition number of a PSD matrix , denoted

, is the ratio between its largest and smallest non-zero eigenvalues.

222Our definition is in line with the standard one, for a general matrix

, which uses singular values instead of eigenvalues. If

is singular, one could alternatively define , which would only make the problem simpler (it becomes easier to be linear in ), see e.g. [Spi10].
For example, for the Laplacian of a -regular graph , let denote its eigenvalues, then the condition number is . This follows from two well-known facts, that , and that if is connected ( is called the spectral gap). Throughout, denotes the spectral norm of a matrix , and denotes the Moore-Penrose pseudo-inverse of .333For a PSD matrix , let its eigen-decomposition be , then the Moore-Penrose pseudo-inverse of is .

### 1.1 Our Results

Below we describe our results, which include both algorithms and lower bounds. First, we present a polylogarithmic-time algorithm for the simpler case of Laplacian matrices, and then we generalize it to all SDD matrices. We further prove two lower bounds, which show that our algorithms cannot be substantially improved to handle more general inputs or to run faster. The first lower bound shows that general PSD matrices require polynomial time, thereby showing a strong separation from the SDD case. The second one shows that our SDD algorithm’s dependence on the condition number is necessary and in fact near-tight.

##### Algorithm for Laplacian matrices.

We first present our simpler algorithm for linear systems in Laplacians with a bounded condition number.

###### Theorem 1.1 (Laplacian Solver, see Section 2).

There exists a randomized algorithm, that given input , where

• is a connected -regular graph given as an adjacency list,

• is in the range of (equivalently, orthogonal to the all-ones vector),

• , , and

• is an upper bound on the condition number ,

the algorithm outputs with the following guarantee. If satisfies then

 ∀u∈[n],Pr[|^xu−x∗u|≤ϵ⋅∥x∗∥∞]≥1−1s,

and the algorithm runs in time , for suitable .

A few extensions of the theorem follow easily from our proof. First, if the algorithm is given also an upper bound on , then the expression for can be refined by replacing with . Second, we can improve the runtime to whenever the representation of allows to sample a uniformly random neighbor of a vertex in constant time. Third, the algorithm has an (essentially) cubic dependence on the condition number , which can be improved to quadratic if we allow a preprocessing of (or, equivalently if we only count the number of probes into ). Later we show that this quadratic dependence is near-optimal.

##### Algorithm for SDD matrices.

We further design an algorithm for SDD matrices with bounded condition number. The formal statement appears in Theorem 5.1 and is a natural generalization of Theorem 1.1 with two differences. One difference is that a natural solution to the system is , but our method requires to have normalized diagonal entries, and thus we aim at another solution , construed as follows. Define

 Ddef=diag(S11,...,Snn) and ~Sdef=D−1/2SD−1/2, (2)

then our linear system can be written as , which has a solution

 x∗def=D−1/2~S+D−1/2b, (3)

that is expressed using the pseudo-inverse of , rather than of .

A second difference is that Theorem 5.1 makes no assumptions about the multiplicity of the eigenvalue of , e.g., if is a graph Laplacian, then the graph need not be connected. The assumptions needed to achieve a polylogarithmic time, beyond having a bounded condition number, are only that a random “neighbor” in the graph corresponding to can be sampled quickly, and that , which holds if has polynomially-bounded entries.

##### Lower Bound for PSD matrices.

Our first lower bound shows that the above guarantees cannot be obtained for a general PSD matrix, even if we are allowed to preprocess the matrix , and only count probes into . The proof employs a PSD matrix that is invertible (i.e., positive definite), in which case the linear system has a unique solution .

###### Theorem 1.2 (Lower Bound for PSD Systems, see Section 3).

For every large enough , there exists an invertible PSD matrix with uniformly bounded sparsity and condition number , and a distinguished index , with the following property. Every randomized algorithm that, given as input , outputs satisfying

 Pr[|^xu−x∗u|≤15∥x∗∥∞]≥67,

where , must probe coordinates of (in the worst case).

##### Dependence on Condition Number.

The second lower bound shows that our SDD algorithm has a near-optimal dependence on the condition number of , even if we are allowed to preprocess the matrix S, and only count probes into . The lower bound holds even for Laplacian matrices.

###### Theorem 1.3 (Lower Bound for Laplacian Systems, see Section 4).

For every large enough and , there exist an unweighted graph with maximum degree and whose Laplacian has condition number , and an edge in , which satisfy the following. Every randomized algorithm that, given input in the range of , succeeds with probability to approximate within additive error for , where is any solution to , must probe coordinates of (in the worst case).

##### Applications.

An example application of our algorithmic results is computing the effective resistance between a pair of vertices in a graph (given , and as input). It is well known that the effective resistance, denoted , can be expressed as , where solves . The spectral-sparsification algorithm of Spielman and Srivastava [SS11] relies on a near-linear time algorithm (that they devise) for approximating the effective resistances of all edges in . For unweighted graphs, there is also a faster algorithm [Lee14] that runs in time , which is sublinear in the number of edges, and approximates effective resistances within a larger factor . In a -regular expander , the effective resistance between every two vertices is , and our algorithm in Theorem 1.1 can quickly compute an arbitrarily good approximation (factor ). Indeed, observe that we can use , hence the runtime is , independently of . The additive accuracy is ; in fact, each is the potential difference between and when a potential difference of is imposed between and , thus , and hence with high probability the output actually achieves a multiplicative guarantee .

### 1.2 Technical Outline

##### Algorithms.

Our basic technique relies on a classic idea of von Neumann and Ulam [FL50, Was52]

for estimating a matrix inverse by a power series; see Section

1.3 for a discussion of related work. Our starting point is the identity

 ∀X∈Rn×n,∥X∥<1,(I−X)−1=∞∑t=0Xt.

(Recall that denotes the spectral norm of a matrix .) Now given a Laplacian of a -regular graph , observe that , where is the adjacency matrix of

. Assume for a moment that

; then by the above identity, , and the solution of the linear system would be . The point is that the summands decay exponentially because . Therefore, we can estimate using the first terms, i.e., , where is logarithmic (with base ). In order to compute each term , observe that is exactly the probability that a random walk of length starting at will end at vertex . Thus, if we perform a random walk of length starting at , and let be its (random) end vertex, then

 Ez[bz]=∑w∈VeTu(1dA)tewbw=eTu(1dA)tb.

If we perform several random walks (specifically, walk suffice), average the resulting ’s, and then multiply by , then with high probability, we will obtain a good approximation to .

As a matter of fact, we have a non-strict inequality , because of the all-ones vector . Nevertheless, we can still get a meaningful result if all eigenvalues of except for the largest one are smaller than (equivalently, the graph is connected). First, we get rid of any negative eigenvalues by the standard trick of considering instead of , which is equivalent to adding self-loops at every vertex. We may assume is orthogonal to (otherwise the linear system has no solution), hence the linear system has infinitely many solutions, and since is PSD, we can aim to estimate the specific solution by . Indeed, the idealized analysis above still applies by restricting all our calculations to the subspace orthogonal to . This is carried out in Theorem 1.1.

To generalize the above approach to SDD matrices, we face three issues. First, due to the irregularity of general SDD matrices, it is harder to properly define the equivalent random walk matrix. We resolve this by normalizing the SDD matrix into defined in (2), and solving the equivalent (normalized) system . Second, general SDD matrices can have positive off-diagonal elements, in constrast to the Laplacians. To address this, we interpret such entries as negative-weight edges, and employ random walks that “remember” the signs of the traversed edges. Third, diagonal elements may strictly dominate their row, which we address by terminating the random walk early with some positive probability.

##### Lower Bound: Polynomial Time for PSD Matrices.

We first discuss our lower bound for PSD matrices, which is one of the main contributions of our work. The goal is to exhibit a matrix for which estimating a coordinate of the solution requires probes into the input .

Without the sparsity constraint on , one can deduce such a lower bound via a reduction to the communication complexity of the Vector in Subspace Problem (VSP), in which Alice has an -dimensional subspace , Bob has a vector , and their goal is to determine whether or . The randomized communication complexity of this promise problem is between [KR11] and [Raz99] (while for quantum communication it is ). To reduce this problem to linear-system solvers, let be the projection operator onto the subspace , and set . Consider the system , and note that Alice knows and Bob has . It is easy to see that the unique solution is either or , depending on whether or . Alice and Bob could use a solver that makes few probes to , as follows. Bob would pick an index that maximizes (and thus also ), and send it to Alice. She would then apply the solver, receiving from Bob only a few entries of , to estimate within additive error , which suffices to distinguish the two cases. This matrix is PSD with condition number . However it is dense.

We thus revert to a different approach of proving it from basic principles. Our high-level idea is to take a -regular expander and assign its edges with signs () that are random but balanced everywhere (namely, at every vertex, the incident edges are split evenly between positive and negative). The signed adjacency matrix should have spectral norm , and then instead of the (signed) Laplacian , we consider , which is PSD with condition number , as well as invertible and sparse. Now following similar arguments as in our algorithm, we can write as a power series of the matrix , and express coordinate of the solution via where is the (random) end vertex of a random walk that starts at and its length is bounded by some (performed in the “signed” graph corresponding to ). Now if the graph around looks like a tree (e.g., it has high girth), then not-too-long walks are highly symmetric and easy to count. We now let be non-zero only at vertices at distance exactly from , and for these vertices is set to or at random but with a small bias towards one of the values. Some calculations show that , and consequently , will be according to our bias (with high probability), however discovering this via probes to is essentially the problem of learning a biased coin, which requires coin observations. An additional technical obstacle is to prove that the solution has a small -norm, so that we can argue that an -additive error to will not change its sign. Overall, we show we can set and , thus concluding that the algorithm must observe entries of .

It is instructive to ask where in the above argument is it crucial to have , because if it were also valid for , in which case the matrix is SDD, then it would contradict our own algorithm for SDD matrices. The answer is that is required to bound , specifically in the analysis that follows immediately after Eqn. (10).

##### Lower Bound: Quadratic Dependence on Condition Number.

We now outline the ideas to prove the lower bound even for Laplacian systems with condition number . First we note that it is relatively straight-forward to prove that a linear dependence on the condition number is necessary. Namely, consider a dumbbell graph (two 3-regular expanders connected by a bridge edge ), for which we need to estimate . For defined as , the value of will be non-zero iff vertices are on the opposite sides of the bridge. To determine the latter, one requires queries into . Since this graph has a condition number of , we obtain an lower bound.

The quadratic lower bound requires both a different graph and a different distribution over . We use the following graph with condition number : take two 3-regular expanders and connect them with edges (“bridge edges”). The vector will be dense and in particular it is either: 1) balanced, i.e., on each expander is zero, or 2) unbalanced, i.e., each is chosen with a bias towards on the first expander, and towards on the second one. Now, as above, it is simple to prove that on average over the bridge edges : 1) in the balanced case, the average of must be zero, and 2) in the unbalanced case, the average must be . However, the main challenge is that the actual values may differ from the average — e.g., even in the balanced case, each bridge edge will likely have non-zero value of . Nonetheless, we manage to prove an upper bound on the maximum value of over all edges (as in the previous lower bound, we need to bound as well). For the latter, we need to again analyze where is the endpoint of a random walk of some fixed length starting from in the graph . Since the vector is not symmetric over the graph , a direct analysis seems hard — instead we estimate via a coupling of such walks in with random walks in an expander, which is amenable to a direct analysis.

### 1.3 Related Work

The idea of approximating the inverse (for ) by random walks dates back to von Neumann and Ulam [FL50, Was52]. While we approximate each power by separate random walks of length and truncate the tail (powers above some ), their method employs random walks whose length is random and whose expectation gives exactly the infinite sum, achieved by assigning some probability to terminate the walk at each step, and weighting the contributions of the walks accordingly (to correct the expectation).

The idea of approximating a generalized inverse of by the truncated series on directions that are orthogonal to the all-ones vector was recently used by Doron, Le Gall, and Ta-Shma [DGT17] to show that can be approximated in probabilistic log-space. However, since they wanted to output explicitly, they could not ignore the all-ones direction and they needed to relate to by “peeling off” the all-ones direction, inverting using the infinite sum formula, and then adding back the all-ones direction.

The idea of estimating powers of a normalized adjacency matrix

(or more generally, a stochastic matrix) by performing random walks is well known, and was used also in

[DGT17] mentioned above, and in [DSTS17]. Chung and Simpson [CS15] used it in a context that is related to ours, of solving a Laplacian system but with a boundary condition, namely, a constraint that for all in the support of . Their algorithm solves for a subset of the coordinates , i.e., it approximates (the restriction of to coordinates in ) where solves under the boundary condition. They relate the solution to the Dirichlet heat-kernel PageRank vector, which in turn is related to an infinite power series of a transition matrix (specifically, to where is the transition matrix of the graph induced by , , and ), and their algorithm uses random walks to approximate the not-too-large powers of the transition matrix, proving that the remainder of the infinite sum is small enough.

Recently, Shyamkumar, Banerjee and Lofgren [SBL16] considered a related matrix-power problem, where the input is a matrix , a power , a vector , and an index , and the goal is to compute coordinate of . They devised for this problem a sublinear (in ) algorithm, under some bounded-norm conditions and assuming is uniformly random. Their algorithm relies, in part, on von Neumann and Ulam’s technique of computing matrix powers using random walks, but of prescribed length. It can be shown that approximately solving positive definite systems for a particular coordinate is reducible to the matrix-power problem.444Let be a linear system where is positive definite. Let be the largest eigenvalue of . Let and . Consider the equivalent system . As the eigenvalues of are in , the eigenvalues of are in . Thus, the solution to the linear system is given by . Therefore, we can approximate by truncating the infinite sum at some and approximating each power by the algorithm for the matrix-power problem. However, in contrast to our results, their expected runtime is polynomial in the input size, namely , and holds only for a random .

##### Comparison with PageRank.

An example application of our results is computing quickly the PageRank (defined in [BP98]) of a single node in an undirected -regular graph. Recall that the PageRank vector of an -vertex graph with associated transition matrix is the solution to the linear system , where is a given parameter. In personalized PageRank, one replaces

(the uniform distribution) with some

, e.g., a standard basis vector. Equivalently, solves the system where is an SDD matrix with ’s on the diagonal. As all eigenvalues of are of magnitude at most 1 (recall is a transition matrix), all eigenvalues of are of magnitude at most , and the running time guaranteed by Theorem 5.1 is logarithmic (with base ).

Algorithms for the PageRank model were studied extensively. In particular, the sublinear algorithms of [BPP18] approximate the PageRank of a vertex using queries to an arbitrary graph, or using queries when the maximum degree is . Another example is the heavy-hitters algorithm of [BBCT14], which reports all vertices whose approximate PageRank exceeds a threshold in sublinear time , when PageRanks are viewed as probabilities and sum to . Other work explores connections to other graph problems, including for instances using PageRank algorithms to approximate effective resistances [CZ10], the PageRank vector itself, and computing sparse cuts [ACL07].

##### Local Algorithms.

Our algorithms in Theorems 1.1 and 5.1 are local in the sense that they query a small portion of their input, usually around the input vertex, when viewed as graph algorithms. Local algorithms for graph problems were studied in several contexts, like graph partitioning [ST04, AP09], Web analysis [CGS04, ABC08], and distributed computing [Suo13]. Rubinfeld, Tamir, Vardi, and Xie [RTVX11] introduced a formal concept of Local Computation Algorithms that requires consistency between the local outputs of multiple executions (namely, these local outputs must all agree with a single global solution). As explained earlier, our problem formulation (1) follows this consistency requirement.

### 1.4 Future Work

One may study alternative ways of defining the problem of solving a linear system in sublinear time, in particular if the algorithm can access in a different way. For example, similarly to assumptions and guarantees in [Tan18], the goal may be to produce an -sample from the solution (i.e., report a random index in such that the probability of each coordinate is proportional to ) assuming oracle access to an -sampler from , i.e., use an -sampler for to construct an -sampler for . Another version of the problem may ask to produce heavy hitters in , assuming, say,555This kind of oracle seems necessary even when . heavy hitters in (which may be useful for the PageRank application). We leave these extensions as interesting open questions, focusing here on the classical access mode to , via queries to its coordinates.

## 2 Laplacian Solver (for Regular Graphs)

In this section we prove Theorem 1.1. The ensuing description deals mostly with a slightly simplified scenario, where the algorithm is given not one but two vertices , and returns an approximation to with a slightly different error bound, see Theorem 2.5 for the precise statement. We will then explain the modifications required to prove Theorem 1.1 (which actually follows also from our more general Theorem 5.1).

Let be a connected -regular graph with adjacency matrix . Let the eigenvalues of be

, and let their associated orthonormal eigenvectors be

. Then , and we can write where is unitary and . For , let where is the -th standard basis vector. Then the Laplacian of is given by

 Ldef=∑uv∈Eχu,vχTu,v=dI−A=U(dI−Λ)UT.

Observe that does not depend on the orientation of each edge , and that is the smallest non-zero eigenvalue of . The Moore-Penrose pseudo-inverse of is

 L+def=U⋅diag(0,(d−λ2)−1,…,(d−λn)−1)⋅UT.

We assume henceforth that all eigenvalues of are non-negative. At the end of the proof, we will remove this assumption (by adding self-loops).

The idea behind the next fact is that , and has norm strictly smaller than one when operating on the subspace that is orthogonal to the all-ones vector, and hence, the formula for is applicable for the span of .

###### Fact 2.1.

For every that is orthogonal to the all-ones vector, .

###### Proof.

It suffices to prove the claim for each of as the fact will then follow by linearity. Fix . Then since ,

We now describe an algorithm that on input that is orthogonal to the all-ones vector, and two vertices , returns an approximation to , where solves . As is connected, the null space of is equal to span and hence is uniquely defined, and can be written as .

###### Claim 2.2.

For that is orthogonal to the all-ones vector, .

###### Proof.

Using Fact 2.1,

 χTu,vL+b−χTu,v1ds−1∑t=0(1dA)tb=χTu,v1d∞∑t=s(1dA)tb,

and thus

 |χTu,vL+b−χTu,v1ds−1∑t=0(1dA)tb|≤||χTu,v||2⋅||1d∞∑t=s(1dA)tb||2.

We know that , so it remains to bound . Decomposing we get that and

 ∞∑t=s(1dA)tb=n∑i=2ciui∞∑t=s(λid)t=n∑i=2(λid)s1−λidciui=dn∑i=2(λid)sd−λiciui.

Hence,

 ||1d∞∑t=s(1dA)tb||22=n∑i=2⎛⎜⎝(λid)sd−λi⎞⎟⎠2c2i||ui||22≤⎛⎜⎝(λ2d)sd−λ2⎞⎟⎠2n∑i=2c2i=⎛⎝(1−μ2d)sμ2⎞⎠2||b||22,

where the first equality is because the ’s are orthogonal. Altogether,

 |χTu,vL+b−χTu,v1ds−1∑t=0(1dA)tb|≤√2(1−μ2d)sμ2||b||2≤√2(1−μ2d)sμ2√||b||0⋅||b||∞=ϵ2d||b||∞,

as claimed. ∎

.

###### Proof.

Observe that is a probability vector over , and is exactly the probability that a random walk of length starting at will end at . Thus, for every and , we have

 E[bu(t)i]=∑w∈[n]eTu(1dA)tewbw=eTu(1dA)tb,

and similarly . By a union bound over Hoeffding bounds, with probability at least , for every , we have and . Recalling that , with probability at least we have , as claimed. ∎

Combining Claim 2.2 and Claim 2.3 we get that (with probability )