1 Introduction
Communication complexity is a common technique for establishing lower bounds on the resources required of problems, such as the memory required of a streaming algorithm. The multiplayer promise set disjointness is one of the most widely used problems from communication complexity in applications, not only in data streams [AMS99, BJK+04, CKS03, GRO09, JW09, JAY09, CLL+11], but also compressed sensing [PW13], distributed functional monitoring [WZ12, WZ14], distributed learning [GMN14, KVW14, BGM+16a]
, matrixvector query models
[SWY+19], voting [MPS+19, MSW20], and so on. We shall restrict ourselves to the study of set disjointness in the numberinhand communication model, described below, which covers all of the above applications. Set disjointness is also wellstudied in the numberonforehead communication model, see, e.g., [GRO94, TES03, BPS+06, LS09, CA08, BH12, SHE12, SHE14], though we will not discuss that model here.In the numberinhand multiplayer promise set disjointness problem there are players with subsets , each drawn from , and we are promised that either:

the are pairwise disjoint, or

there is a unique element occurring in all the sets, which are otherwise pairwise disjoint.
The promise set disjointness problem was posed by Alon, Matias, and Szegedy [AMS99], who showed an total communication bound in the blackboard communication model, where each player’s message can be seen by all other players. This total communication bound was then improved to by BarYossef, Jayram, Kumar, and Sivakumar [BJK+04], who further improved this bound to for an arbitrarily small constant in the oneway model of communication. These bounds were further improved by Chakrabarti, Khot, and Sun [CKS03] to in the general communication model and an optimal bound for way communication. The optimal total communication bound for general communication was finally obtained in [GRO09, JAY09].
To illustrate a simple example of how this problem can be used, consider the streaming model. The streaming model is one of the most important models for processing massive datasets. One can model a stream as a list of integers , where each item has a frequency which denotes its number of occurrences in the stream. We refer the reader to [BBD+02, MUT05] for further background on the streaming model of computation.
An important problem in this model is computing the
th frequency moment
. To reduce from the promise set disjointness problem, the first player runs a streaming algorithm on the items in its set, passes the state of the algorithm to the next player, and so on. The total communication is , where is the amount of memory of the streaming algorithm. Observe that in the first case of the promise we have , while in the second case we have . Setting therefore implies an algorithm estimating up to a factor better than can solve promise set disjointness and therefore , that is, . For , this is known to be best possible up to a constant factor [BKS+14].Notice that nothing substantial would change in this reduction if one were to change the second case in the promise to instead say: (2) there is a unique element occurring in at least half of the sets, and the sets are otherwise disjoint. Indeed, in the above reduction, in one case we have , while in the second case we have . This recovers the same lower bound, up to a constant factor. We call this new problem “mostly” set disjointness ().
While it is seemingly inconsequential to consider instead of promise set disjointness, there are some peculiarities about this problem that one cannot help but wonder about. In the promise set disjointness problem, there is a deterministic protocol solving the problem with bits of communication – we walk through the players one at a time, and each indicates if its set size is smaller than . Eventually we must reach such a player, and when we do, that player posts its set to the blackboard. We then ask one other player to confirm an intersection. Notice that there always must exist a player with a set of size at most by the pigeonhole principle. On the other hand, for the problem, it does not seem so easy to achieve a deterministic protocol with bits of communication. Indeed, in the worst case we could have up to players posting their entire set to the blackboard, and still be unsure if we are in Case (1) or Case (2).
More generally, is there a gap in the dependence on the error probability of algorithms for promise set disjointness versus ? Even if one’s main interest is in constant error probability protocols, is there anything that can be learned from this new problem?
1.1 Our Results
We more generally define so that in Case (2), there is an item occurring in of the sets, though it is still convenient to think of . Our main theorem is that requires communication to solve deterministically, or even with failure probability .
Theorem 1.1.
with elements, players, and for an absolute constant requires bits of communication for failure probability .
This result does not have any restriction on the order of communication, and is in the “blackboard model” where each message is visible to all other players. We note that as , our lower bound goes to , but for any absolute constant , we achieve the stated lower bound. We did not explicitly compute our lower bound as a function of , as .
Notice that for constant , Theorem 1.1 recovers the total communication bound for promise set disjointness, which was the result of a long sequence of work. Our proof of Theorem 1.1 gives a much simpler proof of an total communication lower bound, avoiding Hellinger distance and Poincare inequalities altogether, which were the main ingredients in obtaining the optimal lower bound for promise set disjointness in previous work. Moreover, as far as we are aware, an lower bound for the problem suffices to recover all of the lower bounds in applications that promise set disjointness has been applied to. Unlike our work, however, existing lower bounds for promise set disjointness do not give improved bounds for small error probability . Indeed, it is impossible for them to do so because of the deterministic protocol described above. We next use this bound in terms of to obtain the first lower bounds for deterministic streaming algorithms and randomized error algorithms for a large number of problems.
We note that other work on deterministic communication lower bounds for streaming, e.g., the work of Chakrabarti and Kale [CK16], does not apply here. They study multiparty equality problems and it is not clear how to use their fooling set arguments to prove a lower bound for . One of the challenges in designing a fooling set is the promise, namely, that a single item occurs on a constant fraction of the players and all remaining items occur on at most one player. This promise is crucial for the applications of .
We now formally introduce notation for the data stream model. In the streaming model, an integer vector is initialized to and undergoes a sequence of updates. The streaming algorithm is typically allowed one (or a few) passes over the stream, and the goal is to use a small amount of memory. We cannot afford to store the entire stream since and are typically very large. In this paper, we mostly restrict our focus to the insertiononly model where the updates to the vector are of the form where is a standard basis vector. There are also the turnstile data stream models in which where . In the strict turnstile model it is promised that at all times in the stream, whereas in the general turnstile model there are no restrictions on . Therefore, an algorithm in the general turnstile model works also in the strict turnstile model and insertiononly models.
Finding Heavy Hitters.
Finding the heavy hitters, or frequent items, is one of the most fundamental problems in data streams. These are useful in IP routers [EV03], in association rules and frequent itemsets [AS94, SON95, TOI96, HID99, HPY00] and databases [FSG+98, BR99, HPD+01]. Finding the heavy hitters is also frequently used as a subroutine in data stream algorithms for other problems, such as moment estimation [IW05], entropy estimation [CCM10, HNO08], sampling [MW10], finding duplicates [GR09], and so on. For surveys on algorithms for heavy hitters, see, e.g., [CH08, WOO16].
In the heavy hitters problem, for , the goal is to find a set which contains all indices for which , and contains no indices for which .
The first heavy hitters algorithms were for , given by Misra and Gries [MG82], who achieved words of memory, where a word consists of bits of space. Interestingly, their algorithm is deterministic, i.e., the failure probability . This algorithm was rediscovered by Demaine, LópezOrtiz, and Munro [DLM02], and again by Karp, Shenker, and Papadimitriou [KSP03]. Other than these algorithms, which are deterministic, there are a number of randomized algorithms, such as the CountMin sketch [CM05], sticky sampling [MM02], lossy counting [MM02], spacesaving [MAE05], sample and hold [EV03], multistage bloom filters [CFM09], and sketchguided sampling [KX06]. One can also achieve stronger residual error guarantees [BIC+10].
An often much stronger notion than an heavy hitter is an heavy hitter. Consider an dimensional vector . The first coordinate is an heavy hitter with parameter , but it is only an heavy hitter with parameter . Thus, the algorithms above would require at least words of memory to find this heavy hitter. In [CCF04] this problem was solved by the CountSketch algorithm, which provides a solution to the heavy hitters problem, and more generally to the heavy hitters problem for any ^{1}^{1}1For the quantity is not a norm, but it is still a welldefined quantity., in pass and in the general turnstile model using words of memory. For insertiononly streams, this was recently improved [BCI+16c, BCI+16b] to words of memory for constant , and in general. See also work [LNN+19] on reducing the decoding time for finding the heavy hitters from the algorithm’s memory contents, without sacrificing additional memory.
There is also work establishing lower bounds for heavy hitters. The works of [DIP+11, JST11] establish an word lower bound for any value of and constant , for any algorithm in the strict turnstile model. This shows that the above algorithms are optimal for constant . Also for , it is known that solving the heavy hitters problem even with constant and requires words of memory [BJK+04, GRO09, JAY09], and thus is often considered the gold standard for spaceefficient streaming algorithms since it is the largest value of for which there is a poly space algorithm. For deterministic algorithms computing linear sketches, the work of [GAN09] shows the sketch requires dimensions for (also shown for by [CDD09]). This also implies a lower bound for general turnstile algorithms for streams with several important restrictions; see also [LNW14a, KP20]. There is also work on the related compressed sensing problem which studies small [GNP+13].
Despite the work above, for all we knew it could be entirely possible that, in the insertiononly model, an heavy hitters algorithm could achieve words of memory and solve the problem deterministically, i.e., with . In fact, it is wellknown that the above lower bound for heavy hitters for linear sketches does not hold in the insertiononly model. Indeed, by running a deterministic algorithm for heavy hitters, we have that if , then , and consequently one can find all heavy hitters using words of memory. Thus, for constant , there is a deterministic words of memory upper bound, but only a trivial word lower bound. Surprisingly, this factor gap was left wide open, and the main question we ask about heavy hitters is:
Can one deterministically solve heavy hitters in insertiononly streams in constant memory?
One approach to solve would be for each player to insert their elements into a stream and apply a heavy hitters algorithm. For example, if , there will be a heavy hitter if and only if the instance is a YES instance. For a space streaming algorithm, this uses communication to pass the structure from player to player. Hence . In general:
Theorem 4.3.
Given and , any error pass insertiononly streaming algorithm for heavy hitters requires bits of space.
Most notably, setting and and , this gives an bound for deterministic heavy hitters. The FrequentElements algorithm [MG82] matches this up to a factor of (i.e., it uses this many words, not bits). For , the other term () is also achievable up to the bit/word distinction, this time by CountSketch. For larger , we note that it takes bits already to encode the output size. As a result, we show that the existing algorithms are within a factor of optimal.
One common motivation for heavy hitters is that many distributions are powerlaw or Zipfian distributions. For such distributions, the th most frequent element has frequency approximately proportional to for some constant , typically [CSN09]. Such distributions have significant heavy hitters. Despite our lower bound for general heavy hitters, one might hope for more efficient deterministic/very high probability insertiononly algorithms in this special case. We rule this out as well, getting an lower bound for finding the heavy hitters of these distributions (see Theorem 4.6). This again matches the upper bounds from FrequentElements or CountSketch up to a logarithmic factor.
To extend our lower bound to powerlaw distributions, we embed our hard instance as the single largest and smallest entries of a powerlaw distribution; we then insert the rest of the powerlaw distribution deterministically, so the overall distribution is powerlaw distributed. Solving heavy hitters will identify whether this single largest element exists or not, solving the communication problem.
Frequency Moments.
We next turn to the problem of estimating the frequency moments , which in our reduction from the problem, corresponds to estimating . Our hard instance for immediately gives us the following theorem:
Theorem 1.2.
For any constant and , any error pass insertiononly streaming algorithm for estimation must have space complexity of bits.
The proof of Theorem 1.2 follows immediately by setting the number of players in to be , and performing the reduction to estimation described before Section 1.1. This improves the previous lower bound, which follows from [BJK+04, JAY09], as well as a simple reduction from the Equality function [AMS99], see also [CK16]. It matches an upper bound of [BKS+14] for constant , by repeating their algorithm independently times. Our lower bound instance shows that to approximate of an integer vector, with bit coordinates in dimensions, up to an additive deterministically, one needs memory. This follows from our hard instance. Approximating the norm is an important problem in streaming, and its complexity was asked about in Question 3 of [COR06].
Low Rank Approximation.
Our heavy hitters lower bound also has applications to deterministic low rank approximation in a stream, a topic of recent interest [LIB13, GP14, WOO14, GLP+16a, GLP16b, HUA18]. Here we see rows of an matrix one at a time. At the end of the stream we should output a projection onto a rank space for which , where is the best rank approximation to . A natural question is if the deterministic FrequentDirections algorithm of [GLP+16a] using words of memory can be improved when the rows of are sparse. The sparse setting was shown to have faster running times in [GLP16b, HUA18], and more efficient randomized communication protocols in [BWZ16]. Via a reduction from our problem, we show a polynomial dependence on is necessary.
Theorem 5.1.
Any pass deterministic streaming algorithm outputting a rank projection matrix providing a approximate rank low rank approximation requires bits of memory, even for , , and when each row of has only a single nonzero entry.
Algorithms and Lower Bounds in Other Streaming Models.
We saw above that deterministic insertiononly heavy hitters requires space for constant . We now consider turnstile streaming and linear sketching.
The work of [GAN09, CDD09] shows that space is needed for general deterministic linear sketching, but the corresponding hard instances have negative entries. We extend this in two ways: when negative entries are allowed, an lower bound is easy even in turnstile streaming (for heavy hitters, but not the closely related sparse recovery guarantee; see Remark 6.2). If negative entries are not allowed, we still get an bound on the number of linear measurements for deterministic linear sketching (see Theorem 4.2).
A question is if we can solve heavy hitters deterministically in the strict turnstile model in space. In some sense the answer is no, due to the near equivalence between turnstile streaming and linear sketching [GAN09, LNW14b, AHL+16], but this equivalence has significant limitations. Recent work has shown that with relatively mild restrictions on the stream, such as a bound on the length , significant improvements over linear sketching are possible [JW18, KP20]. Can we get that here? We show that this is indeed possible: streams with updates can be solved in space. While this does not reach the lower bound from insertiononly streams (Theorem 4.4), it is significantly better than the for linear sketches. In general, we show:
Theorem 6.1.
There is a deterministic heavy hitters algorithm for length strict turnstile streams with updates using words of space.
Our algorithm for short strict turnstile streams is a combination of FrequentElements and exact sparse recovery. With space , FrequentElements (modified to handle negative updates) gives estimation error , which is good unless . But if it is not good, then . Hence in that case sparse recovery will recover the vector (and hence the heavy hitters). Running both algorithms and combining the results takes space, which is optimized at .
1.2 Our Techniques
Our key lemma is that solving on elements, items, and with probability has conditional information complexity for any constant . It is wellknown that the conditional information complexity of a problem lower bounds its communication complexity (see, e.g., [BJK+04]).
This can then be extended to using repetition, namely, we can amplify the success probability of the protocol to by independent repetition, apply our lower bound on the new protocol with , and then conclude a lower bound on the original protocol. Indeed, this is how we obtain our total communication lower bound of for constant , providing a much simpler proof than that of the total communication lower bound for promise set disjointness in prior work.
Our bound is tight up to a factor. It can be solved deterministically with communication (for each bit, the first player with that bit publishes it), and with probability using communication (only publish the bit with probability ). Setting , any failure probability is possible with communication.
We lower bound using conditional information complexity. Using the direct sum property of conditional information cost, analogous to previous work (see, e.g., [BJK+04]), it suffices to get an conditional information cost bound for the problem : we have players, each of whom receives one bit, and the players must distinguish (with probability ) between at most one player having a , and at least players having s. In particular, it suffices to show for correct protocols that
(1) 
where is the distribution of protocol transcripts if the players all receive , and is the distribution if player receives a . The main challenge is therefore in bounding this expression.
Consider any protocol that does not satisfy (1). We show that, when , player can be implemented with an equivalent protocol for which the player usually does not even observe its input bit. That is, if every other player receives a , player will only observe its bit with probability . This means that most players only have a small probability of observing their bit. The probability that any two players observe their bits may be correlated; still, we show that this implies the existence of a large set of players such that the probability—if every player receives a zero—that no player observes their bit throughout the protocol is above . But then , so the protocol cannot distinguish these cases with the desired probability. We now give the full proof.
2 Preliminaries
We use the following measures of distance between distributions in our proofs.
Definition 2.1.
Let and
be probability distributions over the same countable universe
. The total variation distance between and is defined as:In our proof we also use the JensenShannon divergence and KullbackLiebler divergence. We define these notions of divergence here:
Definition 2.2.
Let and be probability distributions over the same discrete universe . The KullbackLiebler divergence or KLdivergence from to is defined as: This is an asymmetric notion of divergence. The JensenShannon divergence between two distributions and is the symmetrized version of the KL divergence, defined as:
From Pinsker’s inequality, for any two distributions and , .
In the multiparty communication model we consider ary functions where . There are parties(or players) who receive inputs
which are jointly distributed according to some distribution
. We consider protocols in the blackboard model where in any protocol players speak in any order and each player broadcasts their message to all other players. So, the message of player is a function of the messages they receive, their input and randomness i.e., . The final player’s message is the output of the protocol.The communication cost of a multiparty protocol is the sum of the lengths of the individual messages . A protocol is a error protocol for the function if for every input , the output of the protocol equals with probability . The randomized communication complexity of , denoted , is the cost of the cheapest randomized protocol that computes correctly on every input with error at most over the randomness of the protocol.
The distributional communication complexity of the function for error parameter is denoted as . This is the communication cost of the cheapest deterministic protocol which computes the function with error at most under the input distribution . By Yao’s minimax theorem, and hence it suffices to prove a lower bound for a hard distribution . In our proofs, we bound the conditional information complexity of a function in order to prove lower bounds on . We define this notion below.
Definition 2.3.
Let be a randomized protocol whose inputs belong to . Suppose where is a distribution over for some set . The conditional information cost of with respect to is defined as:
Definition 2.4.
The error conditional information complexity of with respect to , denoted is defined as the minimum conditional information cost of a error protocol for with respect to .
In [BJK+04] it was shown that the randomized communication complexity of a function is at least the conditional information complexity of the function with respect to any input distribution .
Proposition 2.5 (Corollary 4.7 of [Bjk+04]).
Let , and let be a distribution over for some set . Then, .
Direct Sum.
Per [BJK+04], conditional information complexity obeys a Direct Sum Theorem condition under various conditions. The Direct Sum Theorem of [BJK+04] allows us to reduce a player conditional information complexity problem with an dimensional input to each player to a player conditional information complexity with a dimensional input to each player. This theorem applies when the function is “decomposable” and the input distribution is “collapsing”. We define both these notions here.
Definition 2.6.
Suppose and . A function is decomposable with primitive if it can be written as:
for .
Definition 2.7.
Suppose and . A distribution over is a collapsing distribution for with respect to if for all in the support of , for all and for all ,
We state the Direct Sum Theorem for conditional information complexity below. The proof of this theorem in [BJK+04] applies to the blackboard model of multiparty communication. We state this in the most general form here and then show that it may be applied to the hard distribution which we choose in Section 3.
Theorem 2.8 (Multiparty version of Theorem 5.6 of [Bjk+04]).
Let and let . Suppose that the following conditions hold:

is a decomposable function with primitive ,

is a distribution over , such that for any the distribution is a product distribution,

is supported on , and

the marginal probability distribution of over is a collapsing distribution for with respect to .
Then
3 Communication Lower Bound for Mostly Set Disjointness
3.1 The Hard Distribution
Definition 3.1.
Denote by , the multiparty Mostly SetDisjointness problem in which each player receives an dimensional input vector where and the input to the protocol falls into either of the following cases:

NO: For all ,

YES: There exists a unique such that and for all other .
The final player must output if the input is in the YES case and in the NO case.
Let be the set of valid inputs along one index in for , i.e., the set of elements in with or . Let denote the set of valid inputs to the function.
We define as: where defined as: This means is ORdecomposable into copies of and we may hope to apply a direct sum theorem with an appropriate distribution over the inputs.
In order to prove a lower bound on the conditional information complexity, we need to define a “hard” distribution over the inputs to . We define the distribution over where as follows:

For each pick uniformly at random and sample uniformly from and for all set .

Pick uniformly at random and

if , pick a set such that uniformly at random and for all set and for all , set
Let denote the distribution for each conditioned on . For any , when , the conditional distribution over
is the uniform distribution over
and hence a product distribution. Let be the distribution conditioned on . Clearly, .This definition of and the hard distribution allows us to apply the Direct Sum theorem (Theorem 2.8) of [BJK+04]. Note that: (i) is ORdecomposable by , (ii) is a distribution over such that the marginal distribution over is uniform over (and hence a product distribution), (iii) , and (iv) since is ORdecomposable and has support only on inputs in the NO case, is a collapsing distribution for with respect to . Hence:
(2) 
3.2 Information Cost for a Single Bit
A key lemma for our argument is that the players can be implemented so that they only “observe” their input bits with small probability. The model here is that each player’s input starts out hidden, but they can at any time choose to observe their input. Before they observe their input, however, all their decisions (including messages sent and choice of whether to observe) depend on the transcript and randomness, but not the player’s input.
In this section we use to denote the protocol in consideration and abuse notation slightly by using to denote the distribution of the transcript of the protocol on input .
Definition 3.2.
Any (possibly multiround) communication protocol involving players, where each player receives one input bit, is defined to be a “clean” protocol with respect to player if, in each round,

if player has previously not “observed” his input bit, he “observes” his input bit with some probability that is a function only of the previous messages in the protocol,

if player has not observed his input bit in this round or any previous round, then his message distribution depends only on the previous messages in the protocol but not his input bit, and

if player has observed his input bit in this round or any previous round, then—for a fixed value of the previous messages in the protocol—his distribution of messages on input and on input are disjoint.
We start off by proving a lemma about decomposing any two arbitrary distributions into one “common” distribution and two disjoint different distributions. This lemma will enable us to show that any communication protocol can be simulated in a clean manner.
Lemma 3.3.
Let be two distributions, and . There exist three distributions and a parameter such that: and has a disjoint support from .
Proof.
We begin with two special cases. If , then setting allows us to set , . may be any arbitrary distribution that has disjoint support from . If , we may set , and .
So it suffices to consider the case where and . Let and be such that is a distribution over the support of . Then, it suffices to define:
and we define:
If , we set , where is a scaling term which ensures that is a valid distribution. ∎
Lemma 3.4.
Consider any (possibly multiround) communication protocol where each player receives one input bit. Then for any player , the protocol can simulated in a manner that is “clean” with respect to that player.
Proof.
Let denote player ’s bit. We use “round ” to refer to the th time that player is asked to speak. Let be the transcript of the protocol just before player speaks in round , and let denote the transcript immediately after player speaks in round . Let be the distribution of player ’s message the th time he is asked to speak, conditioned on the transcript so far being and on player having the bit . We will describe an implementation of player that produces outputs with the correct distribution such that the implementation only looks at with relatively small probability.
In the first round, given , player looks at with probability . If he does not look at the bit, he outputs each message with probability proportional to ; if he sees the bit , he outputs each message with probability proportional to . His output is then distributed according to . Note also that, for any message , it is not possible that the player can send both after reading a and after reading a .
In subsequent rounds , given , player needs to output a message with distribution . Let denote the probability that the player has already observed his bit in a previous round, conditioned on and ; let be analogous for . We will show by induction that for all . That is, any given transcript may be compatible with having already observed a or a but not both. As noted above, this is true for .
Without loss of generality, suppose . We apply Lemma 3.3 to and with , obtaining three distributions such that and and is disjoint from .
Player behaves as follows: if he has not observed his bit already, he does so with probability . After this, if he still has not observed his bit, he outputs a message according to ; if he has observed his bit , he outputs according to .
The resulting distribution is regardless of , and the set of possible transcripts where a has been observed is disjoint from those possible where a has been observed. By induction, this holds for all rounds . Thus, this is a simulation of the original protocol that is ”clean” with respect to player . ∎
Lemma 3.5.
Consider any (possibly multiround) communication protocol where each player receives one bit. Each player can be implemented such that, if every other player receives a input, player only observes his input with probability .
Proof.
Using Lemma 3.4, we know that player can be implemented such that the protocol is clean with respect to that player.
We may now analyze the probability that player ever observes his bit, assuming that all other players receive the input zero. For every possible transcript let denote the probability, conditioned on the transcript being and player ’s bit being , that player observes his bit at any point during the protocol; let be analogous for the bit being . Because the choice of player to observe his input bit in a clean protocol is independent of the bit, we have that Moreover, because the protocol is independent of the bit if it is not observed,
for all . By the definition of a clean protocol, the last message player sends can be consistent with him observing a or a but not both; therefore or for all . Now, define . Therefore
as desired. ∎
Lemma 3.5 will be used to show that each player has a decent chance of not reading their input. But to get a lower bound for , we need a large set of players that have a nontrivial chance of all ignoring their input at the same time. We show the existence of such a set, despite the players not being independent. For any , define
(3) 
We have
Lemma 3.6.
Let , , and as in (3). For a set of 01 random variables such that , there exists of size such that
Proof.
We wish to show that there exists a set such that for all with nontrivial probability. Observe that if were chosen uniformly at random,
where the first inequality considers the existence of such a set, the second inequality uses and Markov’s inequality, and denotes the Hamming weight of , i.e., number of nonzero entries of the vector . Therefore there exists a set of size such that . ∎
We can now bound the 1bit communication cost of our problem.
Lemma 3.7.
Given , as in (3), and , for any error protocol for we have that
Proof.
Let be a protocol for . Let is the distribution of the transcript of the protocol on input . We start by establishing a connection between conditional information cost and total variation distances. First observe that due to the choice of distribution , we may write the conditional mutual information as:
Since is uniformly picked from , this mutual information is a JensenShannon divergence (see, for example, Wikipedia [WIK20] or Proposition A.6 of [BJK+04]):
From Pinsker’s inequality,