 # A Tight Composition Theorem for the Randomized Query Complexity of Partial Functions

We prove two new results about the randomized query complexity of composed functions. First, we show that the randomized composition conjecture is false: there are families of partial Boolean functions f and g such that R(f∘ g)≪ R(f) R(g). In fact, we show that the left hand side can be polynomially smaller than the right hand side (though in our construction, both sides are polylogarithmic in the input size of f). Second, we show that for all f and g, R(f∘ g)=Ω(noisyR(f)· R(g)), where noisyR(f) is a measure describing the cost of computing f on noisy oracle inputs. We show that this composition theorem is the strongest possible of its type: for any measure M(·) satisfying R(f∘ g)=Ω(M(f)R(g)) for all f and g, it must hold that noisyR(f)=Ω(M(f)) for all f. We also give a clean characterization of the measure noisyR(f): it satisfies noisyR(f)=Θ(R(f∘ gapmaj_n)/R(gapmaj_n)), where n is the input size of f and gapmaj_n is the √(n)-gap majority function on n bits.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In any computational model, one may ask the following basic question: is computing independent copies of a function roughly times as hard as computing one copy? If so, a natural followup question arises: how hard is computing some function of these copies? Can this be characterized in terms of the complexity of the function ?

Query complexity is one of the simplest models of computation in which one can study these joint computation questions. In query complexity, a natural conjecture is that for any such functions and , the cost of computing of copies of is roughly the cost of computing times the cost of computing . Indeed, using to denote the composition of with copies of

, it is known that the deterministic query complexity (also known as the decision tree complexity) of composed functions satisfies

[Tal13, Mon14]. It is also known that the quantum query complexity (to bounded error) of composed functions satisfies [Rei11, LMR11, Kim12].

However, despite significant interest, the situation for randomized query complexity is not well understood, and it is currently unknown whether holds for all Boolean functions and . It is known that the upper bound of holds. This follows follows from running an algorithm for on the outside, and then using an algorithm for to answer each query made by the algorithm for

. (The log factor in the bound is due to from the need to amplify the success probability of the algorithm for

so that it has small error.) The randomized composition conjecture in query complexity posits that there is a lower bound that matches this upper bound up to logarithmic factors; this conjecture is the focus of the current work.

###### Main Question.

Do all Boolean functions and satisfy ?

Note that there are actually two different versions of this question, depending on whether and are allowed to be partial functions. A partial function is a function where is a subset of , and a randomized algorithm computing it is only required to be correct on the domain of . (Effectively, the input string is promised to be inside this domain.) When composing partial functions and , we get a new partial function , whose domain is the set of strings for which the computation of and of each copy of are all well-defined. Since partial functions are a generalization of total Boolean functions, it is possible that the composition conjecture holds for total functions but not for partial functions. In this work, we will mainly focus on the more general partial function setting; when we do not mention anything about or , they should be assumed to be partial Boolean functions.

### 1.1 Previous work

##### Direct sum and product theorems.

Direct sum theorems and direct product theorems study the complexity of , where is an arbitrary Boolean function but is the identity function. These are not directly comparable to composition theorems, but they are of a similar flavor.

Jain, Klauck, and Santha [JKS10] showed that randomized query complexity satisfies a direct sum theorem. Drucker [Dru12] showed that randomized query complexity also satisfies a direct product theorem, which means that cannot be solved even with small success probability. More recently, Blais and Brody [BB19] proved a strong direct sum theorem, showing that computing copies of can be even harder for randomized query complexity than times the cost of computing (due to the need for amplification).

##### Composition theorems for other complexity measures.

Several composition theorems are known for measures that lower bound ; as such, these theorems can be used to lower bound in terms of some smaller measure of and .

First, though it is not normally phrased this way, the composition theorem for quantum query complexity [Rei11, LMR11] can be viewed as a composition theorem for a measure which lower bounds , since for all . Interestingly, as a lower bound technique for , turns out to be incomparable to the other lower bounds on randomized query complexity for which composition is known, meaning that this composition theorem can sometimes be stronger than everything we know how to do using classical techniques.

Tal [Tal13] and independently Gilmer, Saks, and Srinivasan [GSS16] studied the composition behavior of simple measures like sensitivity, block sensitivity, and fractional block sensitivity. The behavior turns out to be somewhat complicated, but is reasonably well characterized in these works.

Göös and Jayram [GJ15] studied the composition behavior of conical junta degree, also known as approximate non-negative degree. This measure is a powerful lower bound technique for randomized algorithms and seems to be equal to for all but the most artificial functions; however, Göös and Jayram were only able to prove a composition for a variant of conical junta degree, and the variant appears to be weaker in some cases (or at least harder to use).

Ben-David and Kothari [BK16] showed a composition theorem for a measure they defined called randomized sabotage complexity, denoted . They showed that this measure is larger than fractional block sensitivity, and incomparable to quantum query complexity and conical junta degree. It is also nearly quadratically related to for total functions.

##### Composition theorems with a loss in g.

Some composition theorems are known that lower bound in terms of and some smaller measure of . For example, Ben-David and Kothari [BK16] also showed that , for the same measure mentioned above.

Anshu et al. [AGJ17] showed that , where is the randomized query complexity of to bias .

The above two results can also be used to give composition theorems of the form , where and are arbitrary Boolean functions but is a fixed small gadget designed to break up any “collusion” between and . [BK16] proved such a theorem when is the index function, while [AGJ17] proved it when is the parity function of size .

Finally, Gavinsky, Lee, Santha, and Swanyal [GLSS18] showed that , where is a measure they define. They showed that and that (even for partial functions ), which means their theorem also shows .

##### Composition theorems with a loss in f.

There have been very few composition theorems of the form for some measure . Göös, Jayram, Pitassi, and Watson [GJPW15] showed that , which can be generalized to , where denotes the sensitivity of .

Extremely recently, in work concurrent with this one, Bassilakis, Drucker, Göös, Hu, Ma, and Tan [BDG20] showed that , where is the fractional block sensitivity of . (This result also follows from our independent work in this paper.)

##### A relational counterexample to composition.

Gavinsky, Lee, Santha, and Swanyal [GLSS18] showed that the randomized composition conjecture is false when is allowed to be a relation. Relations are generalizations of partial functions, in which has non-Boolean output alphabet and there can be multiple allowed outputs for each input string. The authors exhibited a family of relations and a family of partial functions such that , , but .

This counterexample of Gavinsky, Lee, Santha, and Swanyal does not directly answer the randomized composition conjecture (which usually refers to Boolean functions only), but it does place restrictions on the types of tools which might prove it true, since it appears that most or all of the composition theorems mentioned above do not use the fact that has Boolean outputs and apply equally well when is a relation—meaning those techniques cannot be used to prove the composition conjecture true without major new ideas.

### 1.2 Our results

Our first result shows that the randomized composition conjecture is false for partial functions.

###### Theorem 1.

There is a family of partial Boolean functions and a family of partial Boolean functions such that and as , but

 R(fn∘gn)=O(R(fn)2/3R(gn)2/3log2/3R(fn)).

In this counterexample, is polynomially smaller than what it was conjectured to be in the randomized composition conjecture. However, this counterexample actually uses functions and for which and are logarithmic in the input size of . Therefore, the following slight weakening of the original randomized composition conjecture is still viable.

###### Conjecture 2.

For all partial Boolean functions and ,

 R(f∘g)=Ω(R(f)R(g)logn),

where is the input size of .

Hence, even for partial functions, the composition story is far from complete. This is in contrast to the setting in which is a relation, where in the counterexample of [GLSS18], the query complexity is smaller than by a polynomial factor even relative to the input size.

Our second contribution is a new composition theorem for randomized algorithms with a loss only in terms of .

###### Theorem 3.

For all partial functions and ,

 R(f∘g)=Ω(noisyR(f)R(g)).

Here is a measure we introduce, which is defined as the cost of computing when given noisy oracle access to the input bits; for a full definition, see Definition 18. As it turns out, has a very natural interpretation, as the following theorem shows.

###### Theorem 4.

For all partial functions , we have

 noisyR(f)=Θ(R(f∘\textscGapMajn)n),

where is the input size of and is the majority function on bits with the promise that the Hamming weight of the input is either or . Note that .

In other words, characterizes the cost of computing when the inputs to are given as -gap majority instances (divided by , so that ). This means that our composition theorem reduces the randomized composition problem on arbitrary and to the randomized composition problem of with .

###### Corollary 5.

For all partial functions and , we have

 R(f∘g)=Ω(R(f∘\textscGapMajn)R(\textscGapMajn)⋅R(g)),

where is the input size of .

These results hold even when is a relation. We also note that the counterexamples to composition theorems—the one for partial functions in Theorem 1 and the relational one in [GLSS18]—use the same function GapMaj as the inner function (or close variants of it). Therefore, there is a strong sense in which function is the only interesting case for studying the randomized composition behavior of .

Next, we observe that our composition theorem is the strongest possible theorem of the form for any complexity measure of . Formally, we have the following.

###### Lemma 6.

Let be any positive-real-valued measure of Boolean functions. Suppose that for all (possibly partial) Boolean functions and , we have . Then for all , we have .

###### Proof.

By Theorem 4, we have

 n⋅noisyR(f)=Ω(R(f∘\textscGapMajn)),

where in the input size of . Now, by our assumption on , taking we obtain

 R(f∘\textscGapMajn)=Ω(M(f)R(\textscGapMajn))=Ω(M(f)⋅n).

Hence , as desired. ∎

The natural next step is to study the measure . We observe in Lemma 38 that . However, we believe that a much stronger lower bound should be possible. The following conjecture is equivalent to Conjecture 2.

###### Conjecture 7 (Equivalent to Conjecture 2).

For all (possibly partial) Boolean functions ,

 noisyR(f)=Ω(R(f)logn).

The equivalence of the two conjectures follows from Theorem 3 in one direction, and from Lemma 6 in the other direction (taking ).

One major barrier for proving Conjecture 7 is that it is false for relations. Indeed, the family of relations from [GLSS18] has and . Any lower bound for must therefore either be specific to functions (and not work for relations), or else must satisfy for that family of relations, even though (which means is a poor lower bound on , at least for some relations).

### 1.3 Our techniques

#### 1.3.1 Main idea for the counterexample

The main idea for the counterexample to composition is to take and to construct a function that only requires some of its bits to be computed to bias instead of exactly. Achieving bias will be disproportionately cheap for an input to compared to an input to .

This is the same principle used for the relational counterexample of [GLSS18]. There, the authors took to be the relational problem of taking an input and returning an output with the property that . This can be done using either exact queries to , or using queries to with bias each. When is composed with and , it’s not hard to verify that , even though and .

To convert into a partial Boolean function, we use the indexing trick. We let the first bits of represent a string , and we want to force an algorithm to find a string that’s within Hamming weight of . To do so, we can try adding an array of length to the input of , with entries indexed by . We’ll fill the array with on positions indexed by strings that are far from . On positions corresponding to strings within of , we’ll put either all s or all s, and we’ll require the algorithm to output in the former case and in the latter case (promised one of the two cases hold).

The above construction doesn’t quite work, because a randomized algorithm can cheat: instead of finding a string close to , it can simply search the array for a non- bit and output that bit. Since a constant fraction of the Boolean hypercube is within of , this strategy will succeed after a constant number of queries. To fix this, all we need to do is increase the gap from to , so that is required to be within of . Now the non- positions in the array will fill only a fraction of the array, and a randomized algorithm has no hope of finding one of those positions with a small number of random guesses. The input size of will be . Then we have , , but as we can solve by querying each of the first copies of times each, getting bias for each of the bits of , which provides a good string with high probability.

#### 1.3.2 Main idea for the composition theorem

The main idea for proving the composition theorem is to try to turn an algorithm for into an algorithm for . This is the standard approach for most composition theorems, and the main question becomes how to solve when we only have an algorithm which makes queries to an -length input for . When the algorithm queries bit inside copy of , and we only have an -bit input to , what do we query?

One solution would be to fix hard distributions and for , and then, when makes a query to bit inside copy of , we can query , sample an -bit string from , and then return the -th bit of that string. However, this uses a lot of queries: in the worst case, one query to would be needed for each query makes, giving only the upper bound instead of something closer to . The goal is to simulate the behavior of while avoiding making queries to as much as possible.

One insight (also used in previous work) is that if bit is queried inside copy of , we only need to query from the real input if and disagree on the -th bit with substantial probability. In [GLSS18], the approach was to first try to generate the answer from and , and see if they happen to agree; this way, querying the real input is only needed in case they disagree.

We do something slightly different: we assume we have access to a (very) noisy oracle for , and use calls to the oracle to generate bit from without actually finding out . In effect, this lets us use the squared-Hellinger distance between the marginal distributions and as the cost of generating the sample, instead of using the total variation distance between and . That is, we charge a cost for the noisy oracle calls in a special way, which ensures that the total cost of the noisy oracle calls will be proportional to the squared-Hellinger distance between the transcript of when run on and when run on . In other words, the cost our algorithm pays for simulating will be proportional to how much solved the copies of , as tracked by the Hellinger distance of the transcript of (i.e. its set of queries and query answers) on vs. . It turns out this way of tracking the progress of in solving is tight, at least for the appropriate choice of hard distributions and for . Therefore, this will give us an algorithm for that has only cost, though this algorithm for will require noisy oracles for the bits of the input—that is to say, it will be a algorithm instead of an algorithm.

One wrinkle is that the hard distribution produced by Yao’s minimax theorem is not sufficient to give the hardness guarantee we will need from and . Roughly speaking, we will need and to be such that distinguishing them with squared-Hellinger distance requires at least queries, uniformly across all choices of . To get such a hard distribution, we use our companion paper [BB20]. The concurrent work of [BDG20] also gives a sufficiently strong hard distribution for (though it is phrased somewhat differently).

#### 1.3.3 Noisy oracle model

The noisy oracle model we will use is the following. There is a hidden bit known to the oracle. The oracle will accept queries with any parameter , and will return a bit that has bias towards —that is, a bit from (independently sampled for each query call). This oracle can be called any number of times with possibly different parameters, but each call with parameter costs . (The cost is a natural choice, as it would take bits of bias to determine the bit with constant error.)

The measure is defined as the cost of computing (to worst-case bounded error) using noisy oracle access to each bit in the input of . That is, instead of receiving query access to the -bit string , we now have access to noisy oracles, one for each bit of . We can call each oracle with any parameter of our choice, at the cost of per such call. The goal is to compute to bounded error using minimum expected cost (measured in the worst case over inputs ). We note that by using each time, this reverts to the usual query complexity of , meaning that .

The key to our composition theorem lies in using such a noisy oracle for a bit to generate a sample from a distribution (distribution marginalized to bit ) without learning . More generally, suppose we have two distributions, and , and we wish to sample from one of them, but we don’t know which one. The choice of which distribution to sample from depends on a hidden bit , and we have noisy oracle access to . Suppose we know that and are close, say . How many queries to this noisy oracle to we need to make in order to generate this sample?

We show that using such noisy oracle calls, we can return a sample from with an expected cost of . When and are close, this is a much lower cost than the cost of extracting . In other words, when the distributions are close, we can return a sample from (without any error) without learning the value of the bit ! This is the key insight that allows our composition result to work.

#### 1.3.4 Main idea for characterizing noisyR(f)

In order to show that , we first note that the upper bound follows from our composition theorem: that is, , and . For the lower bound direction, we need to convert a algorithm (which makes noisy oracle calls to the input bits, with cost for a noisy oracle call with parameter ) into an algorithm for where each query costs . Recalling that is the majority function with the promise that the Hamming weight of the input is , it’s not hard to see that a single random query to a gadget (with cost each) is the same thing as a noisy oracle query with . Also, querying all bits in a (with cost in total) is the same thing as a noisy oracle query with .

To finish the argument, all we have to show is that a algorithm can always be assumed to make only queries with or . Now, it is well-known that an oracle with bias can be amplified to an oracle with bias by calling it times and taking the majority of the answers. Since oracle calls with parameter cost us , this fact ensures that we only need to make noisy oracle calls with parameter either or , where is extremely small – smaller than anything used by an optimal (or at least near-optiomal) algorithm. This is because for any desired bias level larger than , we could simply amplify the calls.

Hence it only remains to show how to simulate noisy oracle queries with an arbitrarily small parameter using noisy oracle queries with parameter . For this, we consider a random walk on a line that starts at and flips a coin when deciding whether step forwards or backwards. Consider making this walk starting at , walking until either or is reached, and then stopping (where is some fixed integer). Note that the probability that neither or is ever reached after infinitely many steps is

. We then make the following key observation: the probability distribution over the sequence steps of this walk,

conditioned on reaching before , is the same whether or . Therefore, it is possible to generate the full walk by generating the sequence of multiples of the walk will reach (in a way that depends on ), and then completely separately – and independently of – generating the sequence of steps between one multiple and the next, up to negation.

To simulate a bias oracle with a bias oracle, we can use latter to generate the sequence of multiples of described above, with . We generate this sequence one at a time. For each one, we can then generate calls to the bias oracle, where is the (random) number of steps the random walk takes to go from one multiple of to the next. This simulation is perfect: is produces the distribution of any number of calls to the -bias oracle. It also turns out to use the right number of noisy oracle queries in the long run. The only catch is that if the algorithm makes only one noisy oracle call with bias , this still requires one call to the oracle of bias , at a cost of instead of . Since there are total bits, this means the simulation can suffer an additive cost of . To complete the argument, we then show that for every non-constant Boolean function .

## 2 Preliminaries and definitions

### 2.1 Query complexity

We introduce some basic concepts in query complexity. For a survey, see [BdW02]. Fractional block sensitivity can be found in [Aar08, KT13].

##### Partial Boolean functions.

In this work, we will refer to partial Boolean functions, which are functions where and is a positive integer. For a partial function , the term promise refers to its domain , which we also denote by . If , we say is a total function.

##### Composition.

For partial Boolean functions and on and bits respectively, we define their composition, denoted , as the Boolean function on bits with the following properties. will contain the set of -bit strings which are concatenations of different -bit strings in , say , where the tuple must have the property that the string is in . The value of on such a string is then defined as .

##### Partial assignments

. A partial assignment is a string in representing partial knowledge of a string in . We say two partial assignments and are consistent if they agree on the non- bits, that is, for every we have either or or (we use to denote ).

##### Decision trees.

A decision tree on bits is a rooted binary tree whose leaves are labeled by and whose internal nodes are labeled by . We do not allow two internal nodes of a decision tree to have the same label if one is a descendant of the other. We interpret a decision tree as a deterministic algorithm which takes as input a string , starts at the root, and at each internal node with label , the algorithm queries and then goes left down the tree if and right if . When this algorithm reaches a leaf, it outputs its label. We denote by the number of queries makes when run on , and by the height of the tree . We denote the output of on input by . We say computes Boolean function if for all .

##### Randomized decision trees.

A randomized decision tree on bits is a probability distribution over deterministic decision trees on bits. We denote by the expectation of over decision trees sampled from . If is a distribution over , we further denote by the expectation of over sampled from . We denote by the maximum of over , and by the maximum of over in the support of . Further, we let

denote the random variable

with sampled from . We say computes to error if for all .

##### Randomized query complexity.

The randomized query complexity of a Boolean function to error , denoted , is the minimum height of a randomized decision tree computing to error . The expectation version of the randomized query complexity of , denoted , is the minimum value of of a randomized decision tree computing to error . When , we omit it and write and . We note that randomized query complexity can be amplified by repeating the algorithm a few times and taking the majority vote of the answers; for this reason, the constant is arbitrary and any other constant in could work for the definition. Note that in the constant error regime, , since we can cut off paths of a algorithm that run too long and use Markov’s inequality to argue that we only suffer a constant error penalty for this.

##### Block sensitivity.

Let be a Boolean function and let . A sensitive block of at is a subset such that and , where denotes the string with bits in flipped (i.e.  for and for ). The block sensitivity of at , denoted , is the maximum number of disjoint sensitive blocks of at . The block sensitivity of , denoted , is the maximum value of over . We note that , since if are disjoint sensitive blocks of at , then a randomized algorithm must make queries to determine whether the input is or for some .

##### Fractional block sensitivity.

Fix a Boolean function and an input , and let be the set of all sensitive blocks of at . We consider weighting schemes assigning non-negative weights to blocks . We say such a scheme is feasible if for each , the sum of over all blocks containing is at most . The fractional block sensitivity of at , denoted , is the maximum total weight in such a feasible weighting scheme. The fractional block sensitivity of , denoted , is the maximum of over all . We note that . To see this, let be a randomized algorithm solving let be an input, and for let be the probability that queries bit when run on . If, for any sensitive block , we have , then does not distinguish from with constant probability, which means fails to compute to bounded error (since ). So we have for all . Then

 height(R)≥∑i∈[n]pi≥∑i∈[n]pi∑B∈B:i∈BwB=∑B∈BwB∑i∈Bpi=Ω(∑B∈BwB)=Ω(fbsx(f)).
##### Relations.

A relation is a subset of for some finite alphabet . When computing a relation , we only require that an algorithm given input outputs some satisfying . In other words, each input may have many valid outputs. It is not hard to generalize the definitions of and to include relations: the decision trees need leaves labeled by , but otherwise everything works the same (though one catch is that amplification no longer works, which means becomes a different measure for different values of ). Note that relations generalize partial functions, because instead of restricting the inputs to a promise set , we can simply allow all possible outputs for every . With this in mind, it is not hard to see that composition is well-defined if is a relation, so long as remains a (possibly partial) Boolean function. In general, we will define measures for Boolean functions and later wish to apply them to relations; this will usually work without too much trouble.

### 2.2 Distance measures for distributions

In this work, we will only consider finite-support distributions and finite-support random variables. For a distribution , we will use to denote the conditional distribution of conditioned on event . If is a distribution over and is a partial assignment, we will also use to denote the distribution conditioned on the string sampled from agreeing with the partial assignment . If is a distribution over and is an index, we will use to denote the marginal distribution of on the bit (the distribution we get by sampling from and returning ).

The following distance measures will be useful. All logarithms are base 2.

###### Definition 8 (Distance measures).

For probability distributions and over a finite support , define the squared-Hellinger, symmetrized chi-squared, Jensen-Shannon, and total variation distances respectively as follows:

 h2(μ0,μ1) \coloneqq12∑x∈S(√μ0[x]−√μ1[x])2 S2(μ0,μ1) \coloneqq12∑x∈S(μ0[x]−μ1[x])2μ0[x]+μ1[x] JS(μ0,μ1) \coloneqq12∑x∈Sμ0[x]log2μ0[x]μ0[x]+μ1[x]+μ1[x]log2μ1[x]μ0[x]+μ1[x] Δ(μ0,μ1) \coloneqq12∑x∈S|μ0[x]−μ1[x]|.

We will need a few basic claims regarding the properties of various distance measures between probability distributions. The first one relates these probability distributions to each other. This is known in the literature, though the citations are hard to trace down; some parts of this inequality chain follow from [Top00], some parts from [MCAL17], and for others we cannot find a good citation. In any case, a proof of the complete chain is provided in the appendix of our companion manuscript [BB20].

###### Claim 9 (Relationship of distance measures).

For probability distributions and ,

 h2(μ0,μ1)≤JS(μ0,μ1)≤S2(μ0,μ1)≤2h2(μ0,μ1).

We also have .

Since the distance measures , , and

are equivalent up to constant factors, one might wonder why we need all three. It turns out that the squared-Hellinger distance is mathematically the nicest (e.g. it tensorizes and behaves nicely under disjoint mixtures), the Jensen-Shannon distance has an information-theoretic interpretation that allows us to use tools from information theory, and the symmetrized chi-squared distance

is the one that most naturally captures the cost of outputting a sample from given noisy oracle access to the bit (see Lemma 27). In addition, while the three are equivalent up to constant factors, this equivalence is fairly annoying to prove, so it makes more sense to refer to Claim 9 when we need to convert between them instead of reproving the conversion from scratch.

#### 2.2.1 Properties of the squared-Hellinger distance

###### Claim 10 (Hellinger tensorization).

Fix distributions and with finite support, and let denote the distribution where independent samples from are returned (with defined similarly). Then

 h2(μ⊗k0,μ⊗k1)=1−(1−h2(μ0,μ1))k.
###### Proof.

From the definition of , it is not hard to see that , with denoting the fidelity between and . The claim that is easy to see, as it is simply the claim

 ∑x1∑x2…∑xk√μ0[x1]…μ0[xk]⋅μ1[x1]…μ1[xk]=(∑x√μ0[x]μ1[x])k.\qed
###### Claim 11 (Hellinger interpretation).

For distributions and , let be the minimum number of independent samples from necessary to be able to deduce with error at most . Then

 k=Θ(1h2(μ0,μ1)),

with the constants in the big- notation being universal.

###### Proof.

This minimum is the minimum such that and can be distinguished with constant error; it is well-known that this is the same as saying is at least a constant. By Claim 9, this is the same as saying is at least a constant. By Claim 10, this is the same as saying is at least a constant. The function behaves like when is small compared to , so the minimum such must be . ∎

###### Claim 12 (Hellinger of disjoint mixtures).

Let and be families of distributions, with ranging over a finite set . Suppose that for each with , it holds that the support of and is disjoint from the support of and . Let be a distribution over . Let denote the distribution that samples and then returns a sample from , and let be defined similarly. Then

 h2(pμ,qμ)=Ea∼μ[h2(pa,qa)].
###### Proof.

As in the proof of Claim 10, it suffices to prove that the fidelity satisfies . This is clear, as it is simply the claim

 ∑a∈S∑x∈Ua√μ[a]pa[x]μ[a]qa[x]=∑a∈Sμ[a]∑x∈Ua√pa[x]qa[x].\qed

#### 2.2.2 Properties of the Jensen-Shannon distance

Here we will need some standard notation from information theory. For random variables and with finite supports, we write for the entropy of , and for the mutual information between and . If is another random variable, we will write for the conditional mutual information, where we use the notation to denote the random variable conditioned on the event . We note that and .

The chain rule for mutual information is well-known.

###### Claim 13 (Chain rule for mutual information).

, , and , we have

 I(X;Y|Z)=I(XZ;Y)−I(Z;Y).

We now use information theory to characterize the Jensen-Shannon distance .

###### Claim 14 (Jensen-Shannon interpretation).

For finite-support probability distributions and ,

 JS(μ0,μ1)=I(X;μX)

where is a random variable.

###### Proof.

Let . We have

 I(X;μX)=H(X)+H(μX)−H(XμX)=1+∑xμ[x]log1μ[x]−12∑xμ0[x]log2μ0[x]+μ1[x]log2μ1[x]
 =1+12∑xμ0[x]logμ0[x]μ0[x]+μ1[x]+μ1[x]logμ1[x]μ0[x]+μ1[x].

This last line equals the definition of by using