1 Introduction
Consider some classic streaming problems: heavy hitters, approximate counting, approximation, finding a nonzero entry in a vector (for turnstile algorithms), counting the number of distinct elements in a stream. These problems were shown to have lowspace randomized algorithms in [CCFC04, Mor78, Fla85, AMS99, IW05, MW10], respectively. All of these algorithms exhibit the property that when running the algorithm multiple times on the same stream, different outputs may result on the different executions.
For the sake of concreteness, let’s consider the problem of approximation: given a stream of poly() updates to a vector (the vector begins as the zero vector, and updates are of the form “increase the entry by ” or “decrease the entry by ”), output an approximation of the norm of the vector. There exists a celebrated randomized algorithm for this problem [AMS99]. This algorithm has the curious property that running the same algorithm multiple times on the same stream may result in different approximations. That is, if Alice runs the algorithm on the same stream as Bob (but using different randomness), Alice may get some approximation of the norm (such as 27839.8), and Bob (running the same algorithm, but with your own randomness) may get a different approximation (such as 27840.2). The randomized algorithm has the guarantee that both of the approximations will be close to the true value. However, interestingly, Alice and Bob end up with slightly different approximations. Is this behavior inherent? That is, could there exist an algorithm which, while being randomized, for all streams with high probability both Alice and Bob will end up with the same approximation for the norm?
Such an algorithm, which when run on the same stream multiple times outputs the same output with high probability is called pseudodeterministic. The main question we tackle in this paper is:
What streaming problems have lowmemory pseudodeterministic algorithms?
1.1 Our Contributions
This paper is the first to investigate pseudodeterminism in the context of streaming algorithms. We show certain problems have pseudodeterministic algorithms substantially faster than the optimal deterministic algorithm, while other problems do not.
1.1.1 Lower Bounds
FindSupportElem:
We show pseudodeterministic lower bounds for finding a nonzero entry in a vector in the turnstile model. Specifically, consider the problem FindSupportElem of finding a nonzero entry in a vector for a turnstile algorithm (the input is a stream of updates of the form “increase entry by 1” or “decrease entry by 1”, and we wish to find a nonzero entry in the final vector). We show this problem does not have a lowmemory pseudodeterministic algorithm:
Theorem 1.1.
There is no pseudodeterministic algorithm for FindSupportElem which uses memory.
This is in contrast with the work of [MW10], which shows a randomized algorithm for the problem using polylogarithmic space.
Theorem 1.1 can be viewed as showing that any lowmemory algorithm for FindSupportElem must have an input where the output
(viewed as a random variable depending on the randomness used by
) must have at least a little bit of entropy. The algorithms we know for FindSupportElem have a very high amount of entropy in their outputs (the standard algorithms, for an input which is the all 1s vector, will find a uniformly random entry). Is this inherent, or can the entropy of the output be reduced? We show that this is inherent: for every low memory algorithm there is an input such that has high entropy.Theorem 1.2.
Every randomized algorithm for FindSupportElem using space must have an input such that has entropy at least .
So, in particular, an algorithm using space must have outputs with entropy , which is maximal up to constant factors.
We also show analogous lower bounds for the problem FindDuplicatein which the input is a stream of integers between and , and the goal is to output a number which appears at least twice in the stream:
Theorem 1.3.
Every randomized algorithm for FindDuplicate using space must have an input such that has entropy at least .
Techniques
To prove a pseudodeterministic lower bound for FindSupportElem, the idea is to show that if a pseudodeterministic algorithm existed for FindSupportElem, then there would also exist a pseudodeterministic oneway communication protocol for the problem OneWayFindDuplicate, where Alice has a subset of of size , and so does Bob, and they wish to find an element which they share.
To prove a lower bound on the oneway communication problem OneWayFindDuplicate, we show that if such a pseudodeterministic protocol existed, then Bob can use Alice’s message to recover many () elements of her input (which contains much more information than one short message). The idea is that using Alice’s message, Bob can find an element they have in common. Then, he can remove the element he found that they have in common from his input, and repeat to find another element they have in common (using the original message Alice sent, so Alice does not have to send another message). After repeating times, he will have found many elements which Alice has.
It may not be immediately obvious where pseudodeterminism is being used in this proof. The idea is that because the algorithm is pseudodeterministic, the element which Bob finds as the intersection with high probability does not depend on the randomness used by Alice. That is, let be the sequence of elements which Bob finds. Because the algorithm is pseudodeterministic, there exists a specific sequence such that with high probability this will be the sequence of elements which Bob finds. Notice that a randomized (but not pseudodeterministic) algorithm for OneWayFindDuplicatewould result in different sequences on different executions.
When the sequence is determined in advance, we can use a union bound and argue that with high probability, one of Alice’s messages will likely work on all of Bob’s inputs. If is not determined in advance, then it’s not possible to use a union bound.
Proving a lower bound on the entropy of the output of an algorithm for FindSupportElem uses a similar idea, but is more technically involved. It is harder to ensure that Bob’s later inputs will be able to succeed with Alice’s original message. The idea, at a very high level, is to have Alice send many messages (but not too many), so that Bob’s new inputs will not strongly depend on any part of Alice’s randomness, and also to have Alice send additional messages to keep Bob from going down a path where Alice’s messages will no longer work.
This lower bound technique may seem similar to the way one would show a deterministic lower bound. It’s worth noting that for certain problems, deterministic lower bounds do not generalize to pseudodeterministic lower bounds; see our results on pseudodeterministic upper bounds for some examples and intuition for why certain problems remain hard in the pseudodeterministic setting while others do not.
Sketching lower bounds for pseudodeterministic norm estimation:
The known randomized algorithms (such as [AMS99]) for approximating the norm of a vector in a stream rely on sketching, i.e., storing where is a random matrix where and outputting the norm of . More generally, an abstraction of this framework is the setting where one has a distribution over matrices and a function . One then stores a sketch of the input vector where and outputs . By far, most streaming algorithms fall into this framework and in fact some recent work [LNW14, AHLW16] proves under some caveats and assumptions that lowspace turnstile streaming algorithms imply algorithms based on lowdimensional sketches. Since sketchingbased streaming algorithms are provably optimal in many settings, it motivates studying whether there are lowdimensional sketches of from which the norm can be estimated pseudodeterministically.
We prove a lower bound on the dimension of sketches from which the norm can be estimated pseudodeterministically:
Theorem 1.4.
Suppose is a distribution over matrices and is a function from to such that for all , when :

approximates the norm of to within a constant factor with high probability,

takes a unique value with high probability.
Then must be .
As an extension, we also show that
Theorem 1.5.
For every constant , every randomized sketching algorithm for norm estimation using a dimensional sketch, there is a vector such that the output entropy of is at least . Furthermore, there is a randomized algorithm using a dimensional sketch with output entropy at most on all input vectors.
Techniques
The first insight in our lower bound is that if there is a pseudodeterministic streaming algorithm for norm estimation in space, then that means there is a fixed function such that approximates and is a randomized algorithm to compute with high probability. The next step uses a result in the work of [HW13] to illustrate a (randomized) sequence of vectors only depending on such that any linear sketchingbased algorithm that uses sublinear dimensional sketches outputs an incorrect approximation to the norm of some vector in that sequence with constant probability, thereby implying a dimension lower bound.
1.1.2 Upper Bounds
On the one hand, all the problems considered so far were such that

There were “lowspace” randomized algorithms.

The pseudodeterministic and deterministic space complexity were “high” and equal up to logarithmic factors.
This raises the question if there are natural problems where pseudodeterministic algorithms outperform deterministic algorithms (by more than logarithmic factors). We answer this question in the affirmative.
We illustrate several natural problems where the pseudodeterministic space complexity is strictly smaller than the deterministic space complexity.
The first problem is that of finding a nonzero row in a matrix given as input in a turnstile stream. Our result for this problem has the bonus of giving a natural problem where the pseudodeterministic streaming space complexity is strictly sandwiched between the deterministic and randomized streaming space complexity.
In the problem FindNonzeroRow, the input is an matrix streamed in the turnstile model, and the goal is to output an such that the row of the matrix is nonzero.
Theorem 1.6.
The randomized space complexity for FindNonzeroRow is , the pseudodeterministic space complexity for FindNonzeroRow is , and the deterministic space complexity for FindNonzeroRow is .
The idea behind the proof of Theorem 1.6 is to sample a random vector , and then deterministically find a nonzero entry of . With high probability, if a row of is nonzero, then the corresponding entry of will be nonzero as well.
Discussion:
Roughly speaking, in this problem there is a certain structure that allows us to use randomness to “hash” pieces of the input together, and then apply a deterministic algorithm on the hashed pieces. The other upper bounds we show for pseudodeterministic algorithms also have a structure which allows us to hash, and then use a deterministic algorithm. It is interesting to ask if there are natural problems which have faster pseudodeterministic algorithms than the best deterministic algorithms, but for which the pseudodeterministic algorithms follow a different structure.
The next problems we show upper bounds for are estimating frequencies in a length stream of elements from a large universe up to error , and that of estimating the inner product of two vectors and in an insertiononly stream of length up to error . We show a separation between the deterministic and (weak) pseudodeterministic space complexity in the regime where .
Theorem 1.7.
There is a pseudodeterministic algorithm for point query estimation and inner product estimation that uses bits of space. On the other hand, any deterministic algorithm needs bits of space.
1.2 Related work
Pseudodeterministic algorithms were introduced by Gat and Goldwasser [GG11]. Such algorithms have since been studied in the context of standard (sequential algorithms) [Gro15, OS16], average case algorithms [Hol17], parallel algorithms [GG15]
, decision tree algorithms
[GGR13, Gol19], interactive proofs [GGH17], learning algorithms [OS18], approximation algorithms [OS18, DPV18], and low space algorithms [GL19]. In this work, we initiate the study of pseudodeterminism in the context of streaming algorithms (and in the context of oneway communication complexity).The problem of finding duplicates in a stream of integers between 1 and was first considered by [GR09], where an bits of space algorithm is given, later improved by [JST11] to bits. We show that in contrast to these low space randomized algorithms, a pseudodeterministic algorithm needs significantly more space in the regime where the length of the stream is, say, . [KNP17] shows optimal lower bounds for randomized algorithms solving the problem.
The method of sampling to sample an index of a turnstile vector with probability proportional to its mass, whose study was initiated in [MW10], is one way of outputting an element from the support of a turnstile stream. A line of work [FIS08, MW10, JST11, AKO10], ultimately leading to an optimal algorithm in [JW18] and tight lower bounds in [KNP17], characterizes the space complexity of randomized algorithms to output an element from the support of a turnstile vector as , in contrast with the space lower bounds we show for algorithms constrained to a low entropy output.
1.3 Open Problems
Morris Counters:
In [Mor78], Morris showed that one can approximate (up to a multiplicative error) the number of elements in a stream with up to elements using bits of space. Does there exist an bits of space pseudodeterministic algorithm for the problem?
norm estimation:
In this work, we show that there are no lowdimensional pseudodeterministic sketching algorithms for estimating the norm of a vector. However, we do not show a turnstile streaming lower bound for pseudodeterministic algorithms, which motivates the following question. Does there exist a space pseudodeterministic algorithm for norm estimation?
Multipass streaming lower bounds:
All the streaming lower bounds we prove are in the single pass model, i.e., where the algorithm receives the stream exactly once. How do these lower bounds extend to the multipass model, where the algorithm receives the stream multiple times? All of the pseudodeterministic streaming lower bounds in this paper do not even extend to 2pass streaming algorithms.
1.4 Table of complexities
In the below table, we outline the known space complexity of various problems considered in our work.
Problem  Randomized  Deterministic  Pseudodeterministic 
Morris Counters  ,  
FindDuplicate  
approximation (streaming)  
approximation (sketching)  ,  
FindNonzeroRow 
2 Preliminaries
A randomized algorithm is called pseudodeterministic if for every valid input , when running the algorithm twice on , the same output is obtained with probability at least . Equivalently (up to amplification of error probabilities), one can think of an algorithm as pseudodeterministic if for every input , there is a unique value such that with probability at least 2/3 the algorithm outputs on input
Definition 2.1 (Pseudodeterministic).
A (randomized) algorithm is called pseudodeterministic if for all valid inputs , the algorithm satisfies
An extension of pseudodeterminism is that of entropy randomized algorithms [GL19]. Such algorithms have the guarantee that for every input , the distribution (over a random choice of randomness ) has low entropy, in particular bounded by .
Another extension of pseudodeterminism is that of pseudodeterministic algorithms, from [Gol19]. Intuitively speaking, any algorithm is pseudodeterministic if for every valid input, with high probability the algorithm outputs one of options (so, a 1pseudodeterministic algorithm is the same as a standard pseudodeterministic algorithm, since it outputs the one unique option with high probability):
Definition 2.2 (pseudodeterministic).
We say that an algorithm is pseudodeterministic if for all valid inputs , there is a set of size at most , such that
For the purposes of this work, we define a simple notion that we call a concentrated algorithm.
Definition 2.3.
We say that an algorithm is concentrated if for all valid inputs , there is some output such that .
The reason for making this definition is that any entropy randomized algorithm, and any pseudodeterministic algorithm is concentrated. Thus, showing an impossibility result for concentrated algorithms also shows an impossibility result for entropy and pseudodeterministic algorithms. Indeed, in this work, we use space lower bounds against concentrated algorithms to simultaneously conclude space lower bounds against low entropy and multipseudodeterministic algorithms.
Definition 2.4.
A turnstile streaming algorithm is one where there is a vector , and the input is a stream of updates of the form “increase the coordinate of by ” or “decrease the coordinate of by ”. The goal is to compute something about the final vector, after all of the updates.
We use a pseudorandom generator for spacebounded computation due to Nisan [Nis92], which we recap below.
Theorem 2.5.
There is a function such that

Any bit of for any input can be computed in space.

For all functions from to some set such that is computable by a finite state machine on states, the total variation distance between the random variables and where is uniformly drawn from and is uniformly drawn from is at most .
3 FindDuplicate: Pseudodeterministic lower bounds
Consider the following problem: the input is a stream of integers between and . The goal is to output a number which appears at least twice in the stream. Call this problem FindDuplicate. Recall that this problem has been considered in the past literature, specifically in [GR09, JST11, KNP17], where upper and lower bounds for randomized algorithms have been shown.
Theorem 3.1.
FindDuplicate has an algorithm which uses memory and succeeds with all but probability .
We formally define a pseudodeterministic streaming algorithm and show a pseudodeterministic lower bound for FindDuplicate to contrast with the randomized algorithm from thm:randalgfinddup.
Definition 3.2 (Pseudodeterministic Streaming Algorithm).
A pseudodeterministic streaming algorithm is a (randomized) streaming algorithm such that for all valid input streams , the algorithm satisfies .
One can also think of a pseudodeterministic streaming algorithm as an algorithm such that for every valid input stream , there exists some valid output such that the algorithm outputs with probability at least (one would have to amplify the success probability using repetition to see that this alternate notion is the same as the definition above).
Definition 3.3 (FindDuplicate).
Define FindDuplicate to be the streaming problem where the input is a stream of length consisting of up to , and the output must be an integer which has occured at least twice in the string.
Theorem 3.4.
FindDuplicate has no pseudodeterministic algorithm with memory .
Proof Overview:
In order to prove thm:psdlowerbound, we introduce two communication complexity problems — OneWayFindDuplicate and OneWayPartialRecovery:
In the OneWayFindDuplicate problem, Alice has a list of integers between and , and so does Bob. Alice sends a message to Bob, after which Bob must output an integer which is in both Alice’s and Bob’s list. Formally:
Definition 3.5 (OneWayFindDuplicate).
Define OneWayFindDuplicate to be the oneway communication complexity problem where Alice has input and Bob has input , where . The goal is for Bob to output an element in .
The idea is that one can reduce OneWayFindDuplicate to FindDuplicate. So, our new goal will be to show that OneWayFindDuplicate requires high communication. To do so, we will show that it is possible to reduce a different problem, denoted OneWayPartialRecovery(defined below), to OneWayFindDuplicate. Informally, in the OneWayPartialRecovery problem, Alice has a list of integers between and . Bob does not have an input. Alice sends a message to Bob, after which Bob must output distinct elements which are all in Alice’s list. Formally:
Definition 3.6 (OneWayPartialRecovery).
Define OneWayPartialRecovery to be the oneway communication complexity problem where Alice has input and Bob has no input. The goal is for Bob to output a set satisfying and .
We will show in Claim 1 that a low memory pseudodeterministic algorithm for FindDuplicate implies a lowcommunication pseudodeterministic algorithm for OneWayFindDuplicate, and in Claim 2 that a lowcommunication pseudodeterministic algorithm for OneWayFindDuplicate implies a low communication algorithm for OneWayPartialRecovery. Finally, in Claim 3, we show that OneWayPartialRecovery cannot be solved with low communication. Combining the claims yields Theorem 3.4.
Proof of Theorem 3.4.
Claim 1.
A pseudodeterministic algorithm for FindDuplicate with space and success probability implies a pseudodeterministic communication protocol for OneWayFindDuplicate with communication and success probability at least .
Proof.
To prove the above claim, we construct a protocol for OneWayFindDuplicate from a streaming algorithm for FindDuplicate. Given an instance of OneWayFindDuplicate, Alice can stream her input set of integers in increasing order, and simulate the streaming algorithm for FindDuplicate. Then, she sends the current state of the algorithm (which is at most bits) to Bob, who continues the execution of the streaming algorithm. At the end, the streaming algorithm outputs a repetition with probability , which means the element showed up in both Alice and Bob’s lists. Note that for a given input to Alice and Bob, Bob outputs a unique element with high probability because the streaming algorithm is pseudodeterministic. ∎
Claim 2.
A pseudodeterministic oneway communication protocol for OneWayFindDuplicate with communication and failure probability implies a pseudodeterministic communication protocol for OneWayPartialRecovery with communication and failure probability.
Proof.
We will show how to use a protocol for OneWayFindDuplicate to solve the instance of OneWayPartialRecovery.
Suppose we have an instance of OneWayPartialRecovery. Alice sends the same message to Bob as if the input was an instance of OneWayFindDuplicate, which is valid since in both of these problems, Alice’s input is a list of length of integers between and .
Now, Bob’s goal is to use the message sent by Alice to recover elements of Alice. Let be the (initially empty) set of elements of Alice’s input that Bob knows and let be a set of elements in disjoint from , where we initially set to . While the size of is less than , Bob simulates the protocol of OneWayFindDuplicate with Alice’s message and input . This will result in Bob finding a single element in Alice’s input that is (i) in , and (ii) not in . Bob adds to , and deletes from . Once the size of is , Bob outputs .
If Alice has the set as her input, define to be the output which the pseudodeterministic algorithm for OneWayFindDuplicate outputs with high probability when Alice’s input is and Bob’s input is . Now, set , and . Note that these (for through ) are the sets which, assuming the pseudodeterministic algorithm never errs during the reduction (where we say the algorithm errs if it does not output the unique element which is guaranteed to be output with high probability), Bob will use as his inputs for the simulated executions of OneWayFindDuplicate. The pseudodeterministic algorithm does not err on any of the except with probability at most , by a union bound. If Bob succeeds on all of the , that means that the sequence of inputs which will be his inputs for the simulated executions of OneWayFindDuplicate are indeed . So, since we have shown with high probability the algorithms succeeds on all of the , and therefore with high probability the are also Bob’s inputs for the simulated executions of OneWayFindDuplicate, we see that with high probability Bob will succeed on all of the inputs he tries to simulate executions of OneWayFindDuplicate with.
Note that we used the union bound over all the for through . All of these are a function of . In particular, notice that by definition, the do not depend on the randomness chosen by Alice. ∎
Claim 3.
Every pseudodeterministic OneWayPartialRecovery protocol which succeeds with probability at least requires bits of communication.
Proof.
We prove this lower bound by showing that a protocol for OneWayPartialRecovery can be used to obtain a protocol with exactly the same communication for the problem where Alice is given a string in as input, she sends a message to Bob, and Bob must exactly recover from Alice’s message with probability at least 2/3. This problem has a lower bound of bits of communication.
Suppose there exists a pseudodeterministic algorithm for OneWayPartialRecovery. Given such a pseudodeterministic protocol that succeeds with probability at least , there is a function such that (a set with elements) is Bob’s output after the protocol with probability at least when Alice is given as input.
We will construct sets to be subsets of of size such that for any , is not a subset of . To do so, we use the probabilistic method: set be random subsets of of size . The probability that is contained for fixed is at most . Thus, by a union bound, the probability that for any , is contained is at most , a quantity which is strictly less than when is , so satisfying the desired guarantee exist.
Alice and Bob can (ahead of time) agree on an encoding of bit strings that is an injective function from to . Now, if Alice is given a bit string as input, she can send a message to Bob according to the pseudodeterministic protocol for OneWayPartialRecovery by treating her input as . Bob then recovers with probability at least 2/3, and can use it to recover since there is unique in which is contained. Since is injective, Bob can also recover with probability .
This reduction establishes a lower bound of on the pseudodeterministic communication complexity of OneWayPartialRecovery, which is an lower bound. ∎
Combining claim:findrepcommunication, claim:findreptopartrec and claim:partreclowerbound completes the proof of thm:psdlowerbound. ∎
It is worth noting that the problem has pseudodeterministic algorithms with sublinear space if one allows multiple passes through the input. Informally, a pass streaming algorithm is a streaming algorithm which, instead of seeing the stream only once, gets to see the stream times.
Claim 4.
There is a pass deterministic streaming algorithm that uses memory for the FindDuplicate problem.
Proof.
At the start of th pass, the algorithm maintains a candidate interval of width from which it seeks to find a repeated element. At the very beginning, this candidate interval is . In the th pass, first partition the interval into equal sized intervals , each of whose width (the width of an interval is ) is and count the number of elements of the stream that lie in each such subinterval – this count must exceed the width of at least one subinterval . Update to and proceed to the next pass. After passes, this interval will contain at most 1 integer. ∎
4 Entropy Lower Bound for FindDuplicate
Theorem 4.1.
Every zeroerror randomized algorithm for FindDuplicate that is concentrated must use space.
By zero error, we mean that the algorithm never outputs a number which is not repeated. With probability one it either outputs a valid output, or .
Proof.
We use a reduction similar to that of the pseudodeterministic case (cf. Proof of claim:findreptopartrec). Using the exact same reduction from the proof of claim:findrepcommunication, we get that a concentrated streaming algorithm for FindDuplicate using space must give us a concentrated protocol for OneWayFindDuplicate with communication complexity . If we can give a way to convert such a protocol for OneWayFindDuplicate into an communication protocol for OneWayPartialRecovery, the desired lower bound on follows from the lower bound on communication complexity of OneWayPartialRecovery from claim:partreclowerbound. We will now show how to make such a conversion by describing a protocol for OneWayPartialRecovery.
Alice sends Bob messages according to the protocol for OneWayFindDuplicate (that is, she simulates the protocol for OneWayFindDuplicate a total of times). Bob’s goal is to use these messages to recover at least input elements of Alice. Towards this goal, he maintains a set of elements recovered so far, (initially empty), and a family of ‘active sets’ (initially containing the set ). While the size of is smaller than , Bob simulates the remainder of the OneWayFindDuplicate protocol on every possible pair where is a set in and is one of the messages of Alice. For each such pair where the protocol is successful in finding a duplicate element , Bob adds to , removes from and adds to .
We now wish to prove that this protocol indeed lets Bob recover elements of Alice. Suppose Alice has input . For each set , define be an element of that has probability at least of being outputted by Bob on input at the end of a OneWayFindDuplicate protocol. Let and be defined for . Note that are predetermined: it is a function of Alice’s input (and, in particular, not a function of the randomness she uses when choosing her messages). For a fixed , the probability of failure to recover from any of Alice’s messages is at most . A failure to fill in with elements implies that for some , Bob failed to recover from all of Alice’s messages. The probability that such a failure happens for a specific is at most . By setting the constant in the to be large enough, we can have this be at most , and so by a union bound the probability that there is an such that is not recovered by Bob is at most .
Thus, we obtain a protocol for OneWayPartialRecovery with communication complexity , and so , completing the proof. ∎
We obtain the following as immediate corollaries:
Corollary 4.2.
Any zeroerror entropy randomized algorithm for FindDuplicate must use space.
Corollary 4.3.
Any zeroerror pseudodeterministic algorithm for FindDuplicate must use space.
Below we show that the above lower bound is tight up to log factors.
Theorem 4.4.
For all , there exists a zeroerror randomized algorithm for FindDuplicate using space (where hides factors polylogarithmic in ) that is concentrated.
Proof.
Define the following algorithm for FindDuplicate: pick a random number in , then remember the element of the stream, and see if appears again later in the stream. If it does, return . Otherwise return .
The concentrated algorithm algorithm is as follows: Run copies of Algorithm independently (in parallel), and then output the minimum of the outputs.
We are left to show that this algorithm is indeed concentrated.
Define to be a function where is the total number of times which shows up in the stream, and define . Note that then, the probability that is outputted by algorithm is , since will be outputted if chooses to remember one of the first .
Consider the smallest such that . We will show that the probability that the output is less than with high probability. It will follow that the algorithm is concentrated, since of the smallest elements, at most outputs are possible (since if , then is not a possible output). So, we will see that with high probability, one of at most outputs (namely, the valid outputs less than or equal to ) will be outputted with high probability. And hence, at least one of them will be outputted with probability at least .
The probability that the output is at most in a single run of algorithm is . So, the probability that in runs of algorithm, in at least one of them an element which is at most is outputted is , which is polynomially small in . Hence, with high probability an element which is at most (and there are valid outputs less than ) will be outputted.
∎
4.1 Getting Rid of the Zero Error Requirement
A downside of thm:nuancedfindrep is that it shows a lower bound only for zeroerror algorithms. In this section, we strengthen the theorem by getting rid of that requirement:
Theorem 4.5.
Every randomized algorithm for FindDuplicate that is concentrated and errs with probability at most must use space (for all ).
Proof overview:
We begin by outlining why the approach of thm:nuancedfindrep does not work without the zeroerror requirement. Recall that the idea in the proof was to have Alice send many messages (for OneWayFindDuplicate) to Bob, and Bob simulates the OneWayFindDuplicate algorithm (using simulated inputs he creates for himself) using these messages to find elements in Alice’s input.
The problem is that the elements we end up removing from Bob’s simulated input^{1}^{1}1recall that Bob simulates an input to the OneWayFindDuplicate problem, and then he repeatedly finds elements he shares with Alice, removes them from the “fake” input, and reconstructs a large fraction Alice’s inputs depend on Alice’s messages, and therefore we can’t use a union bound to bound the probability that the protocol failed for a certain simulated input. So, we want the elements we remove from Bob’s fake input not to depend on the inputs Alice sent. One idea to achieve this is to have Alice send a bunch of messages (for finding a shared element), and then Bob will remove the element that gets output the largest number of times (by simulating the protocol with each of the many messages Alice sent). The issue with this is that if the two most common outputs have very similar probability, the outputted element depends not only on Alice’s input, but also on the randomness she uses when choosing what messages to send to Bob. This makes it again not possible to use a union bound.
There are two new ideas to fix this issue. The first is to use the following “Threshold” technique: Bob will pick a random “threshold” T between and (where we wish to show a lower bound on concentrated algorithms, and is the total number of messages Alice sends to Bob). He simulates the algorithm for OneWayFindDuplicate with all messages Alice sent him, and gets a list of outputs. Then, he will consider the “real” output to be the lexicographically first output where there are more than copies of in the list (note that since the algorithm is concentrated, its very unlikely for no such element to exist).
Now, it follows that with high probability, the shared element does not really depend on the messages. This is because with all but probability approximately , the threshold is far (more than away) from the the frequency of every element in . We note that we pick since from noise we would expect to have the frequencies of elements in change by up to , depending on the randomness of . We want the threshold to be further than that from the expected frequencies, so that with high probability there will be no element which sometimes has frequency more than and sometimes has frequency less than , depending on Alice’s messages (recall that the goal is to make the outputs depend as little as possible on Alice’s messages, but to only depend on shared randomness and on Alice’s input).
This is still not enough for us: we still cannot use a union bound, as fraction of the time Bob’s output will depends on Alice’s message (and not just her input). The next idea resolves this. What Alice will do is send additional pieces of information: telling Bob where the chosen thresholds are bad, and what threshold to use instead. We assume that we have shared randomness so Alice knows all of the thresholds that will be chosen by Bob (note heavyrecovery is hard, even in the presence of shared randomness, so the lower bound is sufficient with shared randomness). Now, Alice can tell for which executions there the threshold chosen will be too close to the likelihood of an element. So, Alice will send approximately additional pieces of information: telling Bob where the chosen thresholds are bad, and what threshold to use instead. By doing so, Alice has guaranteed that a path independent of her messages will be taken.
To recap, idea 1 is to use the threshold technique so that with probability what Bob does doesn’t depend on Alice’s messages (only on her input). Idea 2 is to have Alice tell Bob where these bad situations are, and how to fix them.
The total amount of information Alice sends (ignoring logs) is , (where is the message size we are assuming exists for pseudodeterministically finding a shared element, and k is the number of messages Alice sends). The factor follows since of the times, short messages will be sent to Bob due to a different threshold. A threshold requires bits to describe, which can be dropped since we are ignoring log factors. Setting , we conclude that Alice sends a total of bits. This establishes a contradiction, since we need bits to solve OneWayPartialRecovery. So, whenever , we can pick a such that we get a contradiction.
Proof.
Below we write the full reduction written out as an algorithm for OneWayFindDuplicate.

Alice Creates messages for OneWayFindDuplicate, and sends them to Bob (Call these messages of type A).

Additionally, Alice looks at the thresholds in the shared randomness. every time there is a threshold that is close (within ) of the expected number of times a certain will be outputted on the corresponding input (that is, for each fake input Bob will try, Alice checks if the probability of outputting some is close to – to be precise, say she checks if its probability of being outputted, assuming a randomly chosen message by Alice, is close to ), she sends a message to Bob informing him about the bad threshold, and suggests a good threshold to be used instead (call these messages of type B). Notice that these messages do not depend on the messages of type A that Alice sends, and that each such message is of size .

Bob sets to be the simulated input

Bob uses each of the messages of type that Alice sent, along with , to construct a list of outputs.

Bob looks at the shared randomness to find a threshold (if Alice has informed him it is a bad threshold, use the threshold Alice suggests instead), and consider the lexicographically minimal output y that is contained in the multiset more than times.

Bob removes from the fake input and repeat the last three steps of the algorithm (this time using a new threshold).
Claim 5.
The above protocol solves OneWayPartialRecoverywith high probability using bits.
Proof.
First we show that the total number of bits communicated is . Notice that the total number of messages of type that are sent is . We assume that each of these is of size at most , giving us a total of bits sent in messages of type . Under the assumption that , we see that this is total bits for messages of type .
We now count the total number of bits communicated in messages of type . Each message of type is of size (it is describing a single element, and a number corresponding to which execution the message is relevant for, each requiring bits). So, we wish to show that with high probability the total number of messages of type is . The total number of messages of type that will be sent is , since for every input, the probability that the randomly chosen threshold (which is sampled using public randomness) is more than away from the frequency of every output is . Note that since , and we assume .
We are now left to show the protocol correctly solves OneWayPartialRecovery with high probability. We will first show that, after fixing Alice’s input and the public randomness, with high probability there will be a single sequence of inputs that Bob will try that will occur with high probability (that is, there is a sequence of ’s that Bob goes through with high probability). To do this, consider a certain input that Bob tries. We will bound the probability that there are two values and such that both and have probability at least of being outputted. Suppose there exists two such and that means that at least one of them (say , without loss of generality) has to be the output of more than of the executions with probability more than , but less than . Additionally, we know that the expected number of times that will be outputted of the times is more than away from (otherwise Alice will pick a different value of such that this will be true, and send that value to Bob in a message of type ). However, the probability of being more than away from the expectation, by a Chernoff bound, is (asymptotically) less than .
Notice also, that by the assumption that the algorithm in concentrated, there will always be an output which is expected to appear at least of the time. Also, since the threshold is at most , the probability that appeared fewer than times is exponentially low in , and so with high probability there will always exist a which was outputted on more than of the executions, so in the second to last step, the multiset will always have an element that appears at least times.
Hence, by a union bound over all inputs that Bob tries, with high probability there will be a single sequence of inputs which Bob goes through (which depends only only on the public thresholds and Alice’s input).
We will show that each generated by Bob is an element in Alice’s input with high probability. Notice that the that Bob picks has appeared more than times out of , where is at least . If is not a valid output then its probability of being outputted is . The probability it is outputted at least once is at most . Taking a union bound over the inputs that Bob tries (of which there are ), we get that the probability that there is an invalid at any point is at most 1/10. So, with probability 9/10, no invalid is ever outputted. ∎
∎
5 Entropy lower bounds for finding a support element
Consider the turnstile model of streaming, where a vector starts out as and receives updates of the form ‘increment by 1’ or ‘decrement by 1’, and the goal of outputting a nonzero coordinate of . This is a well studied problem and a common randomized algorithm to solve this problem in a small amount of space is known as sampling [FIS08]. sampling uses polylogarithmic space and outputs a uniformly random coordinate from the support of . A natural question one could ask is whether the output of any low space randomized algorithm is necessarily close to uniform, i.e., has high entropy. We answer this affirmatively and show a nearly tight tradeoff between the space needed to solve this problem and the entropy of the output of a randomized algorithm under the assumption that the algorithm is not allowed to output anything outside the support^{2}^{2}2We note that using similar ideas to those in Subsection 4.1, the zero error requirement could be removed. We omit this adaptation since it is very similar to that of Subsection 4.1.
Theorem 5.1.
Every zeroerror randomized algorithm for FindSupportElem that is concentrated must use space.
We only provide a sketch of the proof and omit details since they are nearly identical to the proof of thm:nuancedfindrep.
Proof Sketch.
Let be such an algorithm that uses space. Just like the proof of thm:nuancedfindrep, the way we show this lower bound is by illustrating that can be used to obtain an communication protocol for OneWayPartialRecovery, which combined with claim:partreclowerbound yields the desired result.
For every element in Alice’s input set , she streams ‘increment by 1’ and runs independent copies of on the input. She then sends the states of each these independent runs of to Bob, which is at most bits, to Bob. Bob maintains a set of states , initially filled with all of Alice’s messages. While he has not yet recovered elements, Bob picks a message and recovers in using algorithm . And for each , Bob resumes on state and streams ‘decrement by 1’ and adds the new state to , and deletes from .
The proof of correctness for why Bob indeed eventually recovers elements of is identical to that in the proof of thm:nuancedfindrep, thus giving a protocol for OneWayPartialRecovery and proving the statement. ∎
We can immediately conclude the following.
Corollary 5.2.
Any zeroerror entropy randomized algorithm for FindSupportElem must use space.
Corollary 5.3.
Any zeroerror pseudodeterministic algorithm for FindSupportElem must use space.
This lower bound is also tight up to polylogarithmic factors due to an algorithm nearly identical to the one from thm:findreplowentropyalgo. In particular, we have:
Theorem 5.4.
For all , there exists a zeroerror randomized algorithm for FindDuplicate using space that is concentrated.
6 Space complexity of pseudodeterministic norm estimation
In this section, we once again consider the pseudodeterministic complexity of norm estimation in the sketching model. The algorithmic question here is to design a distribution over matrices along with a function so that for any :
Further, we want to be a pseudodeterministic function; i.e., we want to be a unique number with high probability.
Theorem 6.1.
The pseudodeterministic sketching complexity of norm estimation is .
The following query problem is key to our lower bound.
Definition 6.2 ( adaptive attack).
Let be some constant. Let be an matrix with realvalued entries and be some function. Now, consider the query model where an algorithm is allowed to specify a vector as a query and is given as a response. The goal of the algorithm is to output such that
in as few queries as possible. We call this algorithmic problem the adaptive attack problem.
We use a theorem on adaptive attacks on sketches proved in [HW13].
Theorem 6.3.
There is a query protocol to solve the adaptive attack problem with probability at least , i.e., the problem in def:adaptiveattack when .
Proof of thm:l2main.
Suppose is a distribution over sketching matrices and is a function mapping to with the property that the pair gives a pseudodeterministic sketching algorithm for norm estimation. Henceforth, we use to denote a random matrix sampled from . Then there is a function such that:

is an approximation of the norm.

On every input , with probability at least for some constant .
We will show that must be by deducing a contradiction when . Let be a parameter to be chosen later. Let be the (random) sequence of vectors in obtained by the adaptive query protocol from thm:adaptiveattack based on responses where , and let be the (random) output of the protocol. Note that the guarantee that hinges on assuming . From the guarantees of thm:adaptiveattack, for any fixed matrix and function such that for all , it is true with probability at least that . On the other hand, for any sequence of fixed vectors , for all with probability at least . Call the event as . Let
be the probability density function of
and let be the probability density function of . This results in the following two estimates of .On the one hand,
and on the other hand,
The contradiction arises since cannot simultaneously be at least and at most , and hence cannot be .
∎
Corollary 6.4.
For any constant , any concentrated sketching algorithm that where the sketching matrix is can be turned into a pseudodeterministic one by running independent copies of the sketch and outputting the majority answer. Thus, as an upshot of thm:l2main we obtain a lower bound of on concentrated algorithms for pseudodeterministic norm estimation in the sketching model.
In contrast to cor:tightell_2lower which says that concentrated algorithms for estimation in the sketching model need near linear dimension, we show that there is an dimension
concentrated sketching algorithm to solve the problem, thus exhibiting a ‘phase transition’.
Theorem 6.5.
There is a distribution over matrices and a function when For every constant , there is an space concentrated sketching algorithm for norm estimation.
Proof.
Let the true norm of the input vector be . Run the classic sketching algorithm of [AMS99] for randomized norm estimation with error and failure probability where is the desired approximation ratio. This uses a sketch of dimension . Now, we describe the function we use. Take the output of the sketching algorithm of [AMS99] and return the number obtained by zeroing out all its bits beyond the first significant bits.^{3}^{3}3The parameters and are chosen purely for safety reasons First, the outputted number is a approximation. Further, for each input, the output is one of two candidates with probability for every constant . This is because [AMS99] produces a approximation to , and there are only two candidates for the most significant bits of any real number that lies in an interval . ∎
7 Pseudodeterministic Upper Bounds
7.1 Finding a nonzero row
Given updates to an matrix (where we assume ) that is initially in a turnstile stream such that all entries of are always in range , the problem FindNonzeroRow is to either output an index such that the th row of is nonzero, or output none if
is the zero matrix.
Theorem 7.1.
The randomized space complexity for FindNonzeroRow is , the pseudodeterministic space complexity for FindNonzeroRow is , and the deterministic space complexity for FindNonzeroRow is .
Proof.
We first will show a randomized space algorithm for the problem, then we will show pseudodeterministic upper and lower bounds, and then show the deterministic lower bound.
Randomized algorithm for FindNonzeroRow.
A randomized algorithm for this problem is given below. Note that the version of the algorithm as stated below does not have the desirable space guarantee, but we will show how to use a pseudorandom generator of Nisan [Nis92] to convert the below algorithm to one that uses low space.

Sample a random dimensional vector where each entry is an independently drawn integer in and store it.

Simulate a turnstile stream which maintains . In particular, consider the dimensional vector , which is initially , and for each update to of the form “add to ”, add to . We run an sampling algorithm [FIS08] on this simulated stream updating , and return the output of the sampler, which is close in total variation distance to a uniformly random element in the support of .
In the above algorithm, step:samplevec is not lowspace as stated. Before we give a way to perform step:samplevec in space, we prove the correctness of the above algorithm. Suppose is a nonzero row of , then let be an index where is nonzero. Suppose all coordinates of except for the th coordinate have been sampled, there is at most one value for for which is , and there is at most a probability that equals , which means if is a nonzero row, then is nonzero except with probability at most . In fact, by taking a union bound over all nonzero rows we can conclude that the set of nonzero rows and the set of nonzero indices of are exactly the same, except with probability bounded by .
Now we turn our attention to implementing step:samplevec in low space. Towards doing so we use Nisan’s pseudorandom generator for space bounded computation in a very similar manner to [Ind06].
Instead of sampling bits to store , we sample and store a uniformly random seed of length and add to when an update “add to ” is received, where is the function from thm:nisanprg that maps the random seed to a sequence bits. To prove the algorithm is still correct if we use the pseudorandom vector instead of the uniformly random vector , we must show that when is nonzero, then is nonzero with probability at least . Towards this, for a fixed dimensional vector , consider the following finite state machine. The states are labeled by pairs where is in and is in . The FSM takes a dimensional vector as input, starts at state , and transitions from state to until it reaches a state . The FSM then outputs . This establishes that for a fixed , the function is computable by an FSM on states, and hence from thm:nisanprg, and are close in total variation distance, which means when is nonzero, then is nonzero except with probability bounded by .
A pseudodeterministic algorithm and lower bound for FindNonzeroRow
The pseudodeterministic algorithm is very similar to the randomized algorithm from the previous section.

Sample a random dimensional vector where each entry is an independently drawn integer in . Store and maintain .

Output the smallest index such that is nonzero.
Storing takes space, and maintaining takes space. Recall from the discussion surrounding the randomized algorithm that the set of nonzero indices of and the set of nonzero rows were equal with probability , which establishes correctness of the above pseudodeterministic algorithm. The space complexity is thus , which is equal to from the assumption that .
A pseudodeterministic lower bound of follows immediately from cor:multipsdfindsupelem since FindNonzeroRow specialized to the case is the same as FindSupportElem.
Lower Bound for deterministic algorithms.
An bit space lower bound for deterministic algorithms follows from a reduction to the communication complexity problem of Equality. Alice and Bob are each given bit strings as input, which they interpret as matrices, and respectively, where each entry is a chunk of length . Suppose a deterministic algorithm takes bits of space to solve this problem. We will show that this can be converted to a bit communication protocol to solve Equality. Alice runs on a turnstile stream updating matrix initialized at 0 by adding to for all in . Alice then sends the bits corresponding to the state of the algorithm to Bob and he continues running on the updates ‘add to ’. outputs none if and only if and thus Bob outputs the answer to Equality depending on the output of . Due to a communication complexity lower bound of on Equality, must be .
∎
7.2 Point Query Estimation and Inner Product Estimation
In this section, we give pseudodeterministic algorithms that beat the deterministic lower bounds for two closely related streaming problems — point query estimation and inner product estimation.
Point Query Estimation.
Given a parameter and a stream of elements where each element comes from a universe , followed by a query , output such that where is the frequency of element in the stream.
Inner Product Estimation.
Given a parameter and a stream of updates to (initially valued) dimensional vectors and in an insertiononly stream^{4}^{4}4A stream where only increments by positive numbers are promised., output estimate satisfying .
In the above problems, we will be interested in the regime where .
Our main result regarding a pseudodeterministic algorithm for point query estimation is:
Theorem 7.2.
There is an space pseudodeterministic algorithm for point query estimation with the following precise guarantees. For every sequence in , there is a sequence such that

For all , where is the frequency of in the stream.

Except with probability , for all outputs on query .
We remark that the deterministic complexity of the problem is (see Theorem 7.7).
Towards establishing thm:pointquery, we recall two facts.
Theorem 7.3 (Misra–Gries algorithm [Mg82]).
Given a parameter and a length stream of elements in , there is a deterministic space algorithm that given any query , outputs such that where is the number of occurrences of in the stream. An additional guarantee that the algorithm satisfies is the following, which we call permutation invariance. Consider the stream
and for any permutation , consider the stream
When the algorithm is given the first stream as input, let denote its output on query , and when the algorithm is given the second stream as input, let denote its output on query . The algorithm has the guarantee that
Comments
There are no comments yet.