 # The Power of Many Samples in Query Complexity

The randomized query complexity R(f) of a boolean function f{0,1}^n→{0,1} is famously characterized (via Yao's minimax) by the least number of queries needed to distinguish a distribution D_0 over 0-inputs from a distribution D_1 over 1-inputs, maximized over all pairs (D_0,D_1). We ask: Does this task become easier if we allow query access to infinitely many samples from either D_0 or D_1? We show the answer is no: There exists a hard pair (D_0,D_1) such that distinguishing D_0^∞ from D_1^∞ requires Θ(R(f)) many queries. As an application, we show that for any composed function f∘ g we have R(f∘ g) ≥Ω(fbs(f)R(g)) where fbs denotes fractional block sensitivity.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Randomized query complexity (see [Buhrman2002] for a classic survey) is often studied using Yao’s minimax principle [Yao1977]. The principle states that for every boolean function

Yao’s minimax: .

• Here is the randomized -error query complexity of . More precisely,

equals the least number of queries a randomized algorithm (decision tree) must make to the input bits

of an unknown input in order to output

with probability at least

(where the probability is over the coin tosses of the algorithm). We often set and omit from notation, as it is well known that this choice only affects constant factors in query complexity.

• is a distribution over the inputs . We may assume wlog that is balanced: where is a distribution over .

• is the distributional -error query complexity of relative to . More precisely, equals the least number of queries a deterministic algorithm must make to an input in order to output with probability at least (where the probability is over ).

### 1.1 Correlated samples problem

One way to think about the distributional complexity of relative to is as the following task: A deterministic algorithm is given query access to a sample from either or and it needs to decide which is the case. In this work, we ask: Does this task become easier if we allow query access to an unlimited number of independent samples from either or ? In short,

Is it easier to distinguish from than it is to distinguish from ?

More formally, we define the correlated samples problem for relative to by

 Corrϵ(f,D) \coloneqq mink≥1Dϵ(fk,12Dk0+12Dk1).

Here is the function that evaluates copies of on disjoint inputs. We also use the notation ( times) for the -fold product distribution. In particular, under , the function outputs either or ; the correlated samples problem is to decide which is the case. We note that the expression to be minimized on the right side is a non-increasing function of (access to more samples is only going to help). We may also assume wlog that (when an algorithm queries a sample for the first time, we may assume it is the first unqueried sample so far).

#### Shaltiel examples.

It is not hard to give examples of input distributions where access to multiple correlated samples does help. Such examples were already discussed by Shaltiel [Shaltiel2004] in the context of direct product theorems. For instance, consider the -bit function. It is well known that for all . Define a balanced input distribution (here is a uniform random bit in )

 D \coloneqq {0Un−1with probability 99\%,1U0n−2with probability 1\%.

This distribution is hard 99% of the time: if the first bit is 0, an algorithm has to compute relative to , which requires queries. For the remaining 1%, the distribution is easy: if the first bit is 1, the output can be deduced from the second bit. Here multiple correlated samples help a lot (for ):

 D(Xorn,D) = Ω(n), Corr(Xorn,D) = O(1).

Indeed, given a single sample from , an algorithm is likely to have to solve the hard case of the distribution. By contrast, given multiple correlated samples, we can query the first bit for a large constant number of samples. This will give us a high chance to encounter at least one easy sample.

#### Error reduction.

An important fact (which fails in the single-sample setting!) is that we can amplify the success probability of any algorithm for correlated samples. This is achieved by a variant of the usual trick: repeatedly run the algorithm on fresh samples to gain more confidence about the output.111 In more detail: An algorithm with error has where for . Reducing error below  boils down to distinguishing two random coins with heads-probabilities and . Given multiple samples from one of the coins, Chernoff bounds state that samples are enough to tell which coin the samples came from.

###### Fact 1.

for every . ∎

The aforementioned Shaltiel example can alternatively be computed as follows: By querying the first two bits of a single sample one can predict to within error . Now apply Fact 1 to reduce the error below at the cost of a constant-factor blowup in query cost.

### 1.2 Main result

We study whether Shaltiel examples can be avoided if we restrict our attention to the hardest possible input distribution. Namely, we define a distribution-free complexity measure by

 Corrϵ(f) \coloneqq maxDCorrϵ(f,D).

Our main result is that multiple correlated samples do not help for the hardest distribution.

###### Theorem 1.

for any (partial) boolean function .

The main challenge in proving Theorem 1 is precisely the existence of Shaltiel examples: How to construct hard distributions that do not contain any hidden easy parts? We resolve it by building decision trees that can exploit the easy parts not only in its own input distribution, but in various other distributions as well.

### 1.3 Application 1: Selection problem

Next we describe a consequence of our main result to a natural query task that we dub the selection problem. A similar problem, called choose, was studied by [Beimel2014] in communication complexity.

Fix an -bit function together with an input distribution . In the -selection problem for the input is a random -bit string , and the goal is to output for some . That is, the algorithm gets access to independent samples from and it selects one of them to solve. We define

 k-Selϵ(f,D) \coloneqq ϵ-error query % complexity of k-selection for (f,D), Selϵ(f,D) \coloneqq mink≥1k-Selϵ(f,D), Selϵ(f) \coloneqq maxDSelϵ(f,D).

The selection problem is interesting because it, too, is subject to Shaltiel examples: for as described in Section 1.1, we have using the same idea of searching for an easy sample.

The following relates selection to correlated samples; see Section 5 for the proof.

###### Theorem 2.

The correlated samples problem is easier than selection:

1. for every .

2. There exists an -bit such that but .

3. Selection does not admit efficient error reduction (as in Fact 1).

Combining the first item of Theorem 2 with our main result (Theorem 1) we conclude that multiple samples do not help in the selection problem for the hardest distribution.

###### Corollary 1.

for any (partial) boolean function . ∎

### 1.4 Application 2: Randomized composition

We give another application of our main result to the randomized composition conjecture studied in [Ben-David2016, Anshu2018, Gavinsky2019, Ben-David2020]. In fact, this application is what originally motivated our research project!

For an -bit function and an -bit function we define their composition

 f∘g:({0,1}m)n→{0,1}such that(f∘g)(x1,…,xn) \coloneqq f(g(x1),…,g(xn)).

A composition theorem aims to understand the query complexity of in terms of and . Such theorems are known for deterministic query complexity,  [Savicky2002, Tal2013, Montanaro2014], and quantum query complexity,  [Hoyer2007, Reichardt2011]. The conjecture in the randomized case is:

###### Conjecture 1.

for all boolean functions and .

Gavinsky et al. [Gavinsky2019] have shown that the conjecture fails if is allowed to be a relation. They also show for any relation and partial function . In a very recent work (concurrent to ours) Ben-David and Blais [Ben-David2020] have found a counterexample to the randomized conjecture for partial  and , albeit with a tiny query complexity compared to input length; see also Section 1.5. The conjecture is still open for total functions.

#### Fractional block sensitivity.

We show a new composition theorem in terms of fractional block sensitivity , introduced by [Tal2013, Gilmer2016]; see also [Kulkarni2016, Ambainis2018]. This measure is at most randomized query complexity, , and it is equivalent to randomized certificate complexity [Aaronson2008].

Let us define for an -bit . We say that a block is sensitive on input iff where is  but with bits in flipped. Fix an input and introduce a real weight for each sensitive block of . Define

as the optimum value of the following linear program

max ∑BwB subject to ∑B∋iwB≤1, ∀i∈[n], wB≥0, ∀B.

Finally, define . For comparison, the more usual block sensitivity  [Nisan1991] is defined the same way except with the integral constraint . In particular , and moreover a polynomial gap (power ) between the two is known for a total function [Gilmer2016].

We make progress towards the composition conjecture; see Section 6 for the proof.

###### Theorem 3.

for any (partial) boolean functions and .

The previous best comparable composition theorem was , a proof of which is virtually the same as for the result that ; see [Goos2018, §5.1]. In fact, we were originally motivated to consider the correlated samples problem when trying to strengthen this composition result from block sensitivity to fractional block sensitivity.

### 1.5 Independent work by Ben-David and Blais

In an independent and concurrent work, Ben-David and Blais [Ben-David2020] have also studied the randomized composition conjecture and ways of circumventing Shaltiel examples via improved minimax theorems. They develop a powerful framework for constructing hard Shaltiel-free distributions, which is general enough to apply not only to query complexity but also, for instance, to communication complexity. In particular, their framework is able to give an alternative proof of our main result (Theorem 1) as well as our -based composition theorem (Theorem 3). Their proof techniques involve information theory and analysis; by contrast, our techniques are more elementary and directly tailored to the correlated samples problem (which does not explicitly appear in their work).

We will prove our main theorem (Theorem 1) in Section 3 and Section 4. Before that, we introduce our basic notions regarding decision trees in Section 2. In Section 3, we characterize decision trees as likelihood boosters, emphasizing that a good query algorithm must make significant progress in terms of boosting the likelihood of one of the outputs (0 or 1) to much higher than the other, and vice versa. This characterization frees us from considering inputs from both and simultaneously: if an algorithm is certain about the output on , then it must also make few errors on . We thus reduce the proof of Theorem 1 to bootstrapping decision trees that can make overall progress across multiple samples to a decision tree that makes uniform progress. In Section 4, we build such a bootstrapping algorithm and show that it makes satisfactory progress with a careful analysis. Proofs for our two applications are in Section 5 and Section 6.

## 2 Preliminaries

Let be a partial function for some alphabet (typically ). Let , be distributions supported on , respectively. For each , let (resp. ) denote the probability mass on in distribution (resp. ). For a subset , we define for . If , we define the conditional distribution by when , and when . We define the likelihood-ratio of as

 LR(S):=D1(S)D0(S).

Let be a deterministic decision tree that takes as input a sample drawn from either or . For every vertex in , we use to denote the set of strings that can reach , or equivalently, the set of strings that agree with all the queries made so far. Typically, every non-leaf vertex in corresponds to a query to a certain position in the sample, but we will allow non-leaf vertices in that do not make any query, each of them having only a single child with . We abuse notation slightly and use as a shorthand for , so we have , and

 LR(v)=D1(v)D0(v).

Note that the likelihood-ratio is non-negative, but could be zero or infinite. We can eliminate the undefined case () by trimming the unreachable parts of the decision tree.

Now if the decision tree takes as input samples from , it is not hard to see that can be written as a Cartesian product , where is the set of strings that agree with all the queries made to the -th sample so far. Again, we abuse notation slightly and use as a shorthand for , so we will often write . We define the overall likelihood ratio of as the product

 OLR(v):=LR(v1)⋯LR(vk)=D1(v1)D0(v1)⋯D1(vk)D0(vk).

It is often more convenient to consider the logarithm of likelihood ratios. We will use natural logarithm throughout the paper, i.e. .

## 3 Query Algorithms as Likelihood Boosters

Our overarching goal (Theorem 1) is to construct an efficient deterministic query algorithm that distinguishes from , assuming the existence of one that distinguishes from . As the starting point, we introduce the notion of likelihood boosters as a way of measuring the progress made by a query algorithm in distinguishing from . The key idea is that, as more queries are being made, the algorithm narrows down the possibilities of the unknown input, driving the likelihood of one of the output (0 or 1) much higher than the other. In fact, we show that can distinguish from well if and only if a sample drawn from has a high probability of arriving at a leaf of where most of the remaining possibilities produce output 1. (Lemma 1 and Lemma 2).

In the multiple-sample setting, we use the notions of overall likelihood boosters and uniform likelihood boosters

, which have different levels of guarantees, to measure the progress of a query algorithm on simultaneously classifying each of the samples in the input. We show that an efficient query algorithm that distinguishes

from is an efficient overall likelihood booster (Corollary 2). Moreover, we show that an efficient uniform likelihood booster on multiple samples induces an efficient likelihood booster on a single sample (Lemma 3), which in turn implies an efficient query algorithm that distinguishes from (Lemma 1). These results will enable us to reduce proving Theorem 1 to relating overall likelihood boosters to uniform likelihood boosters, which is the focus of Section 4 (see Theorem 4).

We now formally define the three types of likelihood boosters mentioned above:

###### Definition 1.

We say a deterministic decision tree is a ()-likelihood booster for if, with probability at least , an input sample drawn from reaches a leaf of with likelihood ratio .

###### Definition 2.

We say a deterministic decision tree is a ()-overall likelihood booster for if, with probability at least , an input drawn from consisting of samples reaches a leaf of with overall likelihood ratio .

###### Definition 3.

We say a deterministic decision tree is a -uniform likelihood booster for and if, with probability at least , an input drawn from consisting of samples reaches a leaf of with the property that at least different samples satisfy .

Note that the above definitions do not depend on the actual output of the decision tree . We now show in the following two lemmas that likelihood boosters are in some sense equivalent to query algorithms that distinguish from .

###### Lemma 1.

Suppose is a -likelihood booster for . Consider the deterministic decision tree that makes exactly the same queries as and accepts if and only if a leaf with is reached. Then distinguishes from with the following guarantees:

1. (Completeness) accepts with probability at least .

2. (Soundness) accepts with probability at most .

###### Proof.

Completeness follows directly from the definition of likelihood booster. To prove soundness, consider the set of leaves with . For all , we have . Therefore, . This means that a sample from reaches leaves in with probability at most , which is exactly the desired soundness. ∎

###### Lemma 2.

Suppose a deterministic decision tree can distinguish from with the following guarantees: accepts with probability at most , and accepts with probability at least . Then is a -likelihood booster for any .

###### Proof.

Let denote the set of leaves with . We can partition as , where corresponds to the leaves at which accepts. Since accepts with probability at most on , we have . Similarly, we have . Therefore,

 ∑ℓ∈UD1(ℓ)=∑ℓ∈U0D1(ℓ)+∑ℓ∈U1D1(ℓ)≤∑ℓ∈U0D1(ℓ)+M∑ℓ∈U1D0(ℓ)=δ1+Mδ0.

In other words, a sample from has probability at most of reaching a leaf in , which means that is a -likelihood booster. ∎

In the multiple-sample setting, if we view the pair and as and in the single-sample setting with input length multiplied by , the definition of overall likelihood ratio coincides with the definition of likelihood ratio in the single-sample setting. Therefore, we have the following corollary of Lemma 2, which essentially shows that an efficient query algorithm for the correlated samples problem is an efficient overall likelihood booster:

###### Corollary 2.

Suppose a deterministic decision tree can distinguish from in that accepts with probability at most , and accepts with probability at least . Then is a -overall likelihood booster for any .

To conclude this section, we show that an efficient uniform likelihood booster in the multiple-sample setting implies an efficient likelihood booster in the single-sample setting.

###### Lemma 3.

For any -uniform likelihood booster for and and any , there is a -likelihood booster for and with .

###### Proof.

Define . We first build a randomized query algorithm for and , and later derandomize it as . On input , generates random samples , selects a uniformly random index , replaces with ’s own input , and finally simulates on the modified samples . If attempts to make the -th query to the -th (modified) sample, halts.

It is easy to see that the maximum number of queries made by is at most . Moreover, by Markov’s inequality, if the input to is drawn from , the probability that halts early because of making more than queries to the -th sample is at most , since the average number of queries makes to the -th sample for a uniformly random is at most .

We now show that with probability at least , reaches a leaf of with when its own input is drawn from . By a union bound, we only need to show that this holds with probability at least for the extended version of that doesn’t halt early. If we switch the order of randomness so that is chosen after a leaf of is reached, this follows easily from the definition of uniform likelihood boosters (Definition 3).

Finally, we derandomize . Note that the randomness in only comes from the randomness in and in all the generated samples except the -th sample. We can simply fix them so that the probability of reaching a leaf of with is maximized, assuming that the -th sample is from . Since and all generated samples other than the -th sample have been fixed, the decision tree now “shrinks” to a decision tree with only the first queries to the -th sample remaining, and every leaf of that is reachable when we run now becomes a leaf of . Shrinking the tree doesn’t affect the computation history regarding the -th sample, so we have and . This proves that is a ()-likelihood booster. ∎

## 4 Bootstrapping Overall Booster to Uniform Booster

The results from the previous section (Section 3) reduce proving our main result (Theorem 1) to proving relations between overall likelihood boosters and uniform likelihood boosters. In this section, we complete this step with the following result:

###### Theorem 4.

Assume that there is a depth- -overall likelihood booster for every distribution pair . Then there is a depth- -uniform likelihood booster for every distribution pair whenever .

We first show how to derive Theorem 1 from Theorem 4:

###### Proof of Theorem 1.

We prove the inequality (the converse inequality is trivial). Suppose we have a depth- deterministic decision tree that solves the correlated samples problem on with success probability at least (recall that the success probability can be amplified by Fact 1). That is, the decision tree accepts inputs drawn from with probability at least and accepts inputs drawn from with probability at most . By Corollary 2, it is a -overall likelihood booster for and .

By Theorem 4, for any pair of distributions , there is a -uniform likelihood booster with depth . Then by Lemma 3, there is a -likelihood booster with depth for and , which by Lemma 1 implies a query algorithm for and with success probability at least . By the arbitrariness of and , we have via Yao’s minimax, as desired. ∎

The rest of this section is dedicated to proving Theorem 4. We construct the desired uniform likelihood booster , described in Section 4.1, by applying different overall likelihood boosters to appropriate sets of samples at different phases of computation. To quantify the progress made by , we design a measure based on a “truncated” log likelihood ratio which handles samples that is confident about with special care. As the technical core of the proof, we show that under our carefully constructed measure, in expectation makes positive and constant progress during each phase of computation (Lemmas 4 and 5). Therefore, is able to achieve the desired guarantees after sufficiently many phases.

### 4.1 Bootstrapping algorithm

We describe our depth- -uniform likelihood booster taking samples. Recall that each vertex of can be written as a Cartesian product , where is the set of strings that are consistent with the queries made to the -th sample so far. We say that the -th sample is settled at if

 LR(vj)=D1(vj)D0(vj)∉[e−100,e100].

Note that it is possible for a sample to be settled in the wrong direction (e.g.  on input drawn from ), but we will show that this is not a serious issue.

The query algorithm proceeds in at most phases (for some large constant ). Each phase consists of at most queries and is described as follows:

Phase :

1. If fewer than out of the samples are unsettled, halt.

2. Else, since each is determined by a string in recording the queries made so far to the -th sample, by the Pigeonhole Principle there exist unsettled samples with .

3. Run the depth- -overall likelihood booster , assumed in Theorem 4 to exist, relative to the input-distribution pair

 (D0|v∗)k ,(D1|v∗)k

on the samples

 (xj1,…,xjk) .

If any query causes one of these samples to become settled (i.e. for some ), halt and go to the next Phase. Otherwise we proceed to the next Phase after terminates. If fewer than queries are made in the current phase, insert dummy vertices that do not make any query (see Section 2) to so that each phase corresponds to a path in with length exactly .

### 4.2 Sub-martingale property of progress measure

It’s not hard to see that the overall likelihood ratio () is not an effective measure of progress for : can rocket to infinity even when there is only one settled sample. In this subsection, we introduce a better progress measure: overall truncated log likelihood ratio (), and show that it is a sub-martingale along the computation path of any decision tree (Lemma 4). In other words, always makes non-negative progress in expectation. We will show that each phase of makes positive expected progress in the next subsection (Section 4.3).

Let be a deterministic decision tree that takes as input samples. For every vertex of , we define the truncated log likelihood ratio of as

 TLLR(vj):={log(LR(vj)),if|log(LR(vj))|≤100,500,otherwise.

Note that if slightly exceeds the upper threshold 100, we set to a much higher value 500. Also, when drops below the lower threshold -100, we also set to 500. Thus, the -th sample is settled at if and only if .

We define the overall truncated log-likelihood-ratio of as the sum

 OTLLR(v):=K∑j=1TLLR(vj).

The input to determines a computation path from the root of to a leaf: . The randomness in transfers to the randomness in the path, so the path is a stochastic process. We now show that along the path is a sub-martingale when is drawn from :

###### Lemma 4.

Assume that never queries a settled sample. Assume that the input to is drawn from , is a non-leaf vertex with distance from the root, and is reachable (i.e. on ). Define . Then we have

 E[Δt|vt=v]≥0.001⋅E[(Δt)2|vt=v]≥0.
###### Proof.

Let us condition on in the whole proof. If is a dummy vertex that does not make any query, then deterministically and the lemma holds trivially. We assume that is not a dummy vertex henceforth.

Suppose sample is queried at vertex . We have . Since never queries a settled sample, we know .

Let denote the random outcome of the query, and let denote the probability that the outcome to the query is under , respectively. Let denote the set of with . Note that and , so

 TLLR(vt+1j)=⎧⎨⎩TLLR(vtj)+logp1(σ)p0(σ),σ∉H,500,σ∈H.

Thus, is precisely the set of outcomes that make sample settled at . Let denote the difference . Our goal is to prove .

Note that when and when . We have

 E[W]≥ 400∑σ∈Hp1(σ)+∑σ∉Hp1(σ)logp1(σ)p0(σ) = 400∑σ∈Hp1(σ)+∑σ∉Hp0(σ)⋅p1(σ)p0(σ)logp1(σ)p0(σ). (1)

By a helper lemma (Lemma 6) proved in Section 4.4, we know that

 p1(σ)p0(σ)logp1(σ)p0(σ)≥(p1(σ)p0(σ)−1)+1400⋅p1(σ)p0(σ)(logp1(σ)p0(σ))2.

Plugging this into (1), we have

 E[W]≥ 400∑σ∈Hp1(σ)+∑σ∉Hp1(σ)−∑σ∉Hp0(σ)+1400∑σ∉Hp1(σ)(logp1(σ)p0(σ))2 ≥ 400∑σ∈Hp1(σ)+⎛⎝∑σ∉Hp1(σ)−1⎞⎠+1400∑σ∉Hp1(σ)(logp1(σ)p0(σ))2 = 400∑σ∈Hp1(σ)−∑σ∈Hp1(σ)+1400∑σ∉Hp1(σ)(logp1(σ)p0(σ))2 = 399∑σ∈Hp1(σ)+1400∑σ∉Hp1(σ)(logp1(σ)p0(σ))2 = 399∑σ∈Hp1(σ)+1400∑σ∉Hp1(σ)(W(σ))2 ≥ 11000∑σ∈Hp1(σ)(W(σ))2+1400∑σ∉Hp1(σ)(W(σ))2 ≥ 11000E[W2].

### 4.3 Bounding the conditional expectation of progress

In the previous subsection, we showed that , as a progress measure, is a sub-martingale. Now we refine our progress measure to also include the natural measure number of settled samples, and show that each phase of makes positive progress in expectation.

Recall that we inserted dummy vertices in to ensure that each phase corresponds to a computation path of length exactly . Therefore, an entire computation path of must have length divisible by : . The sub-path is the computation path of the -th phase.

Define as the number of settled samples at vertex . Our new measure of progress is

 P(vt):=S(vt)+OTLLR(vt).
###### Lemma 5.

Assume that the input to is drawn from , is a non-leaf vertex with distance from the root, and is reachable (i.e. on ). Then we have

 E[P(v(t+1)L)−P(vtL)|vtL=v]≥0.001.

Before proving the lemma, we first show how it implies Theorem 4.

###### Proof of Theorem 4.

We consider an extended version of that always halts after exactly phases: whenever it would halt at line 1, it instead enters dummy phases and increases its total progress by 0.001 per phase (so that now