Pseudo-deterministic Streaming

11/26/2019
by   Shafi Goldwasser, et al.
0

A pseudo-deterministic algorithm is a (randomized) algorithm which, when run multiple times on the same input, with high probability outputs the same result on all executions. Classic streaming algorithms, such as those for finding heavy hitters, approximate counting, ℓ_2 approximation, finding a nonzero entry in a vector (for turnstile algorithms) are not pseudo-deterministic. For example, in the instance of finding a nonzero entry in a vector, for any known low-space algorithm A, there exists a stream x so that running A twice on x (using different randomness) would with high probability result in two different entries as the output. In this work, we study whether it is inherent that these algorithms output different values on different executions. That is, we ask whether these problems have low-memory pseudo-deterministic algorithms. For instance, we show that there is no low-memory pseudo-deterministic algorithm for finding a nonzero entry in a vector (given in a turnstile fashion), and also that there is no low-dimensional pseudo-deterministic sketching algorithm for ℓ_2 norm estimation. We also exhibit problems which do have low memory pseudo-deterministic algorithms but no low memory deterministic algorithm, such as outputting a nonzero row of a matrix, or outputting a basis for the row-span of a matrix. We also investigate multi-pseudo-deterministic algorithms: algorithms which with high probability output one of a few options. We show the first lower bounds for such algorithms. This implies that there are streaming problems such that every low space algorithm for the problem must have inputs where there are many valid outputs, all with a significant probability of being outputted.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/11/2018

Reproducibility and Pseudo-Determinism in Log-Space

A curious property of randomized log-space search algorithms is that the...
07/18/2017

A Note on Unconditional Subexponential-time Pseudo-deterministic Algorithms for BPP Search Problems

We show the first unconditional pseudo-determinism result for all of sea...
09/23/2021

Adversarially Robust Coloring for Graph Streams

A streaming algorithm is considered to be adversarially robust if it pro...
07/09/2021

Optimal Space and Time for Streaming Pattern Matching

In this work, we study longest common substring, pattern matching, and w...
04/12/2019

The Lanczos Algorithm Under Few Iterations: Concentration and Location of the Ritz Values

We study the Lanczos algorithm where the initial vector is sampled unifo...
01/20/2022

Reproducibility in Learning

We introduce the notion of a reproducible algorithm in the context of le...
03/28/2020

Coping With Simulators That Don't Always Return

Deterministic models are approximations of reality that are easy to inte...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider some classic streaming problems: heavy hitters, approximate counting, approximation, finding a nonzero entry in a vector (for turnstile algorithms), counting the number of distinct elements in a stream. These problems were shown to have low-space randomized algorithms in [CCFC04, Mor78, Fla85, AMS99, IW05, MW10], respectively. All of these algorithms exhibit the property that when running the algorithm multiple times on the same stream, different outputs may result on the different executions.

For the sake of concreteness, let’s consider the problem of approximation: given a stream of poly() updates to a vector (the vector begins as the zero vector, and updates are of the form “increase the entry by ” or “decrease the entry by ”), output an approximation of the norm of the vector. There exists a celebrated randomized algorithm for this problem [AMS99]. This algorithm has the curious property that running the same algorithm multiple times on the same stream may result in different approximations. That is, if Alice runs the algorithm on the same stream as Bob (but using different randomness), Alice may get some approximation of the norm (such as 27839.8), and Bob (running the same algorithm, but with your own randomness) may get a different approximation (such as 27840.2). The randomized algorithm has the guarantee that both of the approximations will be close to the true value. However, interestingly, Alice and Bob end up with slightly different approximations. Is this behavior inherent? That is, could there exist an algorithm which, while being randomized, for all streams with high probability both Alice and Bob will end up with the same approximation for the norm?

Such an algorithm, which when run on the same stream multiple times outputs the same output with high probability is called pseudo-deterministic. The main question we tackle in this paper is:

What streaming problems have low-memory pseudo-deterministic algorithms?

1.1 Our Contributions

This paper is the first to investigate pseudo-determinism in the context of streaming algorithms. We show certain problems have pseudo-deterministic algorithms substantially faster than the optimal deterministic algorithm, while other problems do not.

1.1.1 Lower Bounds

Find-Support-Elem:

We show pseudo-deterministic lower bounds for finding a nonzero entry in a vector in the turnstile model. Specifically, consider the problem Find-Support-Elem of finding a nonzero entry in a vector for a turnstile algorithm (the input is a stream of updates of the form “increase entry by 1” or “decrease entry by 1”, and we wish to find a nonzero entry in the final vector). We show this problem does not have a low-memory pseudo-deterministic algorithm:

Theorem 1.1.

There is no pseudo-deterministic algorithm for Find-Support-Elem which uses memory.

This is in contrast with the work of [MW10], which shows a randomized algorithm for the problem using polylogarithmic space.

Theorem 1.1 can be viewed as showing that any low-memory algorithm for Find-Support-Elem must have an input where the output

(viewed as a random variable depending on the randomness used by

) must have at least a little bit of entropy. The algorithms we know for Find-Support-Elem have a very high amount of entropy in their outputs (the standard algorithms, for an input which is the all 1s vector, will find a uniformly random entry). Is this inherent, or can the entropy of the output be reduced? We show that this is inherent: for every low memory algorithm there is an input such that has high entropy.

Theorem 1.2.

Every randomized algorithm for Find-Support-Elem using space must have an input such that has entropy at least .

So, in particular, an algorithm using space must have outputs with entropy , which is maximal up to constant factors.

We also show analogous lower bounds for the problem Find-Duplicatein which the input is a stream of integers between and , and the goal is to output a number which appears at least twice in the stream:

Theorem 1.3.

Every randomized algorithm for Find-Duplicate using space must have an input such that has entropy at least .

Techniques

To prove a pseudo-deterministic lower bound for Find-Support-Elem, the idea is to show that if a pseudo-deterministic algorithm existed for Find-Support-Elem, then there would also exist a pseudo-deterministic one-way communication protocol for the problem One-Way-Find-Duplicate, where Alice has a subset of of size , and so does Bob, and they wish to find an element which they share.

To prove a lower bound on the one-way communication problem One-Way-Find-Duplicate, we show that if such a pseudo-deterministic protocol existed, then Bob can use Alice’s message to recover many () elements of her input (which contains much more information than one short message). The idea is that using Alice’s message, Bob can find an element they have in common. Then, he can remove the element he found that they have in common from his input, and repeat to find another element they have in common (using the original message Alice sent, so Alice does not have to send another message). After repeating times, he will have found many elements which Alice has.

It may not be immediately obvious where pseudo-determinism is being used in this proof. The idea is that because the algorithm is pseudo-deterministic, the element which Bob finds as the intersection with high probability does not depend on the randomness used by Alice. That is, let be the sequence of elements which Bob finds. Because the algorithm is pseudo-deterministic, there exists a specific sequence such that with high probability this will be the sequence of elements which Bob finds. Notice that a randomized (but not pseudo-deterministic) algorithm for One-Way-Find-Duplicatewould result in different sequences on different executions.

When the sequence is determined in advance, we can use a union bound and argue that with high probability, one of Alice’s messages will likely work on all of Bob’s inputs. If is not determined in advance, then it’s not possible to use a union bound.

Proving a lower bound on the entropy of the output of an algorithm for Find-Support-Elem uses a similar idea, but is more technically involved. It is harder to ensure that Bob’s later inputs will be able to succeed with Alice’s original message. The idea, at a very high level, is to have Alice send many messages (but not too many), so that Bob’s new inputs will not strongly depend on any part of Alice’s randomness, and also to have Alice send additional messages to keep Bob from going down a path where Alice’s messages will no longer work.

This lower bound technique may seem similar to the way one would show a deterministic lower bound. It’s worth noting that for certain problems, deterministic lower bounds do not generalize to pseudo-deterministic lower bounds; see our results on pseudo-deterministic upper bounds for some examples and intuition for why certain problems remain hard in the pseudo-deterministic setting while others do not.

Sketching lower bounds for pseudo-deterministic norm estimation:

The known randomized algorithms (such as [AMS99]) for approximating the norm of a vector in a stream rely on sketching, i.e., storing where is a random matrix where and outputting the norm of . More generally, an abstraction of this framework is the setting where one has a distribution over matrices and a function . One then stores a sketch of the input vector where and outputs . By far, most streaming algorithms fall into this framework and in fact some recent work [LNW14, AHLW16] proves under some caveats and assumptions that low-space turnstile streaming algorithms imply algorithms based on low-dimensional sketches. Since sketching-based streaming algorithms are provably optimal in many settings, it motivates studying whether there are low-dimensional sketches of from which the norm can be estimated pseudo-deterministically.

We prove a lower bound on the dimension of sketches from which the norm can be estimated pseudo-deterministically:

Theorem 1.4.

Suppose is a distribution over matrices and is a function from to such that for all , when :

  • approximates the norm of to within a constant factor with high probability,

  • takes a unique value with high probability.

Then must be .

As an extension, we also show that

Theorem 1.5.

For every constant , every randomized sketching algorithm for norm estimation using a -dimensional sketch, there is a vector such that the output entropy of is at least . Furthermore, there is a randomized algorithm using a -dimensional sketch with output entropy at most on all input vectors.

Techniques

The first insight in our lower bound is that if there is a pseudo-deterministic streaming algorithm for norm estimation in space, then that means there is a fixed function such that approximates and is a randomized algorithm to compute with high probability. The next step uses a result in the work of [HW13] to illustrate a (randomized) sequence of vectors only depending on such that any linear sketching-based algorithm that uses sublinear dimensional sketches outputs an incorrect approximation to the norm of some vector in that sequence with constant probability, thereby implying a dimension lower bound.

1.1.2 Upper Bounds

On the one hand, all the problems considered so far were such that

  1. There were “low-space” randomized algorithms.

  2. The pseudo-deterministic and deterministic space complexity were “high” and equal up to logarithmic factors.

This raises the question if there are natural problems where pseudo-deterministic algorithms outperform deterministic algorithms (by more than logarithmic factors). We answer this question in the affirmative.

We illustrate several natural problems where the pseudo-deterministic space complexity is strictly smaller than the deterministic space complexity.

The first problem is that of finding a nonzero row in a matrix given as input in a turnstile stream. Our result for this problem has the bonus of giving a natural problem where the pseudo-deterministic streaming space complexity is strictly sandwiched between the deterministic and randomized streaming space complexity.

In the problem Find-Nonzero-Row, the input is an matrix streamed in the turnstile model, and the goal is to output an such that the row of the matrix is nonzero.

Theorem 1.6.

The randomized space complexity for Find-Nonzero-Row is , the pseudo-deterministic space complexity for Find-Nonzero-Row is , and the deterministic space complexity for Find-Nonzero-Row is .

The idea behind the proof of Theorem 1.6 is to sample a random vector , and then deterministically find a nonzero entry of . With high probability, if a row of is nonzero, then the corresponding entry of will be nonzero as well.

Discussion:

Roughly speaking, in this problem there is a certain structure that allows us to use randomness to “hash” pieces of the input together, and then apply a deterministic algorithm on the hashed pieces. The other upper bounds we show for pseudo-deterministic algorithms also have a structure which allows us to hash, and then use a deterministic algorithm. It is interesting to ask if there are natural problems which have faster pseudo-deterministic algorithms than the best deterministic algorithms, but for which the pseudo-deterministic algorithms follow a different structure.

The next problems we show upper bounds for are estimating frequencies in a length- stream of elements from a large universe up to error , and that of estimating the inner product of two vectors and in an insertion-only stream of length- up to error . We show a separation between the deterministic and (weak) pseudo-deterministic space complexity in the regime where .

Theorem 1.7.

There is a pseudo-deterministic algorithm for point query estimation and inner product estimation that uses bits of space. On the other hand, any deterministic algorithm needs bits of space.

1.2 Related work

Pseudo-deterministic algorithms were introduced by Gat and Goldwasser [GG11]. Such algorithms have since been studied in the context of standard (sequential algorithms) [Gro15, OS16], average case algorithms [Hol17], parallel algorithms [GG15]

, decision tree algorithms

[GGR13, Gol19], interactive proofs [GGH17], learning algorithms [OS18], approximation algorithms [OS18, DPV18], and low space algorithms [GL19]. In this work, we initiate the study of pseudo-determinism in the context of streaming algorithms (and in the context of one-way communication complexity).

The problem of finding duplicates in a stream of integers between 1 and was first considered by [GR09], where an bits of space algorithm is given, later improved by [JST11] to bits. We show that in contrast to these low space randomized algorithms, a pseudo-deterministic algorithm needs significantly more space in the regime where the length of the stream is, say, . [KNP17] shows optimal lower bounds for randomized algorithms solving the problem.

The method of -sampling to sample an index of a turnstile vector with probability proportional to its mass, whose study was initiated in [MW10], is one way of outputting an element from the support of a turnstile stream. A line of work [FIS08, MW10, JST11, AKO10], ultimately leading to an optimal algorithm in [JW18] and tight lower bounds in [KNP17], characterizes the space complexity of randomized algorithms to output an element from the support of a turnstile vector as , in contrast with the space lower bounds we show for algorithms constrained to a low entropy output.

1.3 Open Problems

Morris Counters:

In [Mor78], Morris showed that one can approximate (up to a multiplicative error) the number of elements in a stream with up to elements using bits of space. Does there exist an bits of space pseudo-deterministic algorithm for the problem?

-norm estimation:

In this work, we show that there are no low-dimensional pseudo-deterministic sketching algorithms for estimating the -norm of a vector. However, we do not show a turnstile streaming lower bound for pseudo-deterministic algorithms, which motivates the following question. Does there exist a space pseudo-deterministic algorithm for -norm estimation?

Multi-pass streaming lower bounds:

All the streaming lower bounds we prove are in the single pass model, i.e., where the algorithm receives the stream exactly once. How do these lower bounds extend to the multi-pass model, where the algorithm receives the stream multiple times? All of the pseudo-deterministic streaming lower bounds in this paper do not even extend to 2-pass streaming algorithms.

1.4 Table of complexities

In the below table, we outline the known space complexity of various problems considered in our work.

Problem Randomized Deterministic Pseudo-deterministic
Morris Counters ,
Find-Duplicate
-approximation (streaming)
-approximation (sketching) ,
Find-Nonzero-Row
Table 1: Table of space complexities.

2 Preliminaries

A randomized algorithm is called pseudo-deterministic if for every valid input , when running the algorithm twice on , the same output is obtained with probability at least . Equivalently (up to amplification of error probabilities), one can think of an algorithm as pseudo-deterministic if for every input , there is a unique value such that with probability at least 2/3 the algorithm outputs on input

Definition 2.1 (Pseudo-deterministic).

A (randomized) algorithm is called pseudo-deterministic if for all valid inputs , the algorithm satisfies

An extension of pseudo-determinism is that of -entropy randomized algorithms [GL19]. Such algorithms have the guarantee that for every input , the distribution (over a random choice of randomness ) has low entropy, in particular bounded by .

Another extension of pseudo-determinism is that of -pseudo-deterministic algorithms, from [Gol19]. Intuitively speaking, any algorithm is -pseudo-deterministic if for every valid input, with high probability the algorithm outputs one of options (so, a 1-pseudo-deterministic algorithm is the same as a standard pseudo-deterministic algorithm, since it outputs the one unique option with high probability):

Definition 2.2 (-pseudo-deterministic).

We say that an algorithm is -pseudo-deterministic if for all valid inputs , there is a set of size at most , such that

For the purposes of this work, we define a simple notion that we call a -concentrated algorithm.

Definition 2.3.

We say that an algorithm is -concentrated if for all valid inputs , there is some output such that .

The reason for making this definition is that any -entropy randomized algorithm, and any -pseudo-deterministic algorithm is -concentrated. Thus, showing an impossibility result for -concentrated algorithms also shows an impossibility result for -entropy and -pseudo-deterministic algorithms. Indeed, in this work, we use space lower bounds against -concentrated algorithms to simultaneously conclude space lower bounds against low entropy and multi-pseudo-deterministic algorithms.

Definition 2.4.

A turnstile streaming algorithm is one where there is a vector , and the input is a stream of updates of the form “increase the coordinate of by ” or “decrease the coordinate of by ”. The goal is to compute something about the final vector, after all of the updates.

We use a pseudorandom generator for space-bounded computation due to Nisan [Nis92], which we recap below.

Theorem 2.5.

There is a function such that

  1. Any bit of for any input can be computed in space.

  2. For all functions from to some set such that is computable by a finite state machine on states, the total variation distance between the random variables and where is uniformly drawn from and is uniformly drawn from is at most .

3 Find-Duplicate: Pseudo-deterministic lower bounds

Consider the following problem: the input is a stream of integers between and . The goal is to output a number which appears at least twice in the stream. Call this problem Find-Duplicate. Recall that this problem has been considered in the past literature, specifically in [GR09, JST11, KNP17], where upper and lower bounds for randomized algorithms have been shown.

Indeed, we know the following is true from [GR09, JST11].

Theorem 3.1.

Find-Duplicate has an algorithm which uses memory and succeeds with all but probability .

We formally define a pseudo-deterministic streaming algorithm and show a pseudo-deterministic lower bound for Find-Duplicate to contrast with the randomized algorithm from thm:rand-alg-find-dup.

Definition 3.2 (Pseudo-deterministic Streaming Algorithm).

A pseudo-deterministic streaming algorithm is a (randomized) streaming algorithm such that for all valid input streams , the algorithm satisfies .

One can also think of a pseudo-deterministic streaming algorithm as an algorithm such that for every valid input stream , there exists some valid output such that the algorithm outputs with probability at least (one would have to amplify the success probability using repetition to see that this alternate notion is the same as the definition above).

Definition 3.3 (Find-Duplicate).

Define Find-Duplicate to be the streaming problem where the input is a stream of length consisting of up to , and the output must be an integer which has occured at least twice in the string.

Theorem 3.4.

Find-Duplicate has no pseudo-deterministic algorithm with memory .

Proof Overview:

In order to prove thm:psd-lower-bound, we introduce two communication complexity problems — One-Way-Find-Duplicate and One-Way-Partial-Recovery:

In the One-Way-Find-Duplicate problem, Alice has a list of integers between and , and so does Bob. Alice sends a message to Bob, after which Bob must output an integer which is in both Alice’s and Bob’s list. Formally:

Definition 3.5 (One-Way-Find-Duplicate).

Define One-Way-Find-Duplicate to be the one-way communication complexity problem where Alice has input and Bob has input , where . The goal is for Bob to output an element in .

The idea is that one can reduce One-Way-Find-Duplicate to Find-Duplicate. So, our new goal will be to show that One-Way-Find-Duplicate requires high communication. To do so, we will show that it is possible to reduce a different problem, denoted One-Way-Partial-Recovery(defined below), to One-Way-Find-Duplicate. Informally, in the One-Way-Partial-Recovery problem, Alice has a list of integers between and . Bob does not have an input. Alice sends a message to Bob, after which Bob must output distinct elements which are all in Alice’s list. Formally:

Definition 3.6 (One-Way-Partial-Recovery).

Define One-Way-Partial-Recovery to be the one-way communication complexity problem where Alice has input and Bob has no input. The goal is for Bob to output a set satisfying and .

We will show in Claim 1 that a low memory pseudo-deterministic algorithm for Find-Duplicate implies a low-communication pseudo-deterministic algorithm for One-Way-Find-Duplicate, and in Claim 2 that a low-communication pseudo-deterministic algorithm for One-Way-Find-Duplicate implies a low communication algorithm for One-Way-Partial-Recovery. Finally, in Claim 3, we show that One-Way-Partial-Recovery cannot be solved with low communication. Combining the claims yields Theorem 3.4.

Proof of Theorem 3.4.
Claim 1.

A pseudo-deterministic algorithm for Find-Duplicate with space and success probability implies a pseudo-deterministic communication protocol for One-Way-Find-Duplicate with communication and success probability at least .

Proof.

To prove the above claim, we construct a protocol for One-Way-Find-Duplicate from a streaming algorithm for Find-Duplicate. Given an instance of One-Way-Find-Duplicate, Alice can stream her input set of integers in increasing order, and simulate the streaming algorithm for Find-Duplicate. Then, she sends the current state of the algorithm (which is at most bits) to Bob, who continues the execution of the streaming algorithm. At the end, the streaming algorithm outputs a repetition with probability , which means the element showed up in both Alice and Bob’s lists. Note that for a given input to Alice and Bob, Bob outputs a unique element with high probability because the streaming algorithm is pseudo-deterministic. ∎

Claim 2.

A pseudo-deterministic one-way communication protocol for One-Way-Find-Duplicate with communication and failure probability implies a pseudo-deterministic communication protocol for One-Way-Partial-Recovery with communication and failure probability.

Proof.

We will show how to use a protocol for One-Way-Find-Duplicate to solve the instance of One-Way-Partial-Recovery.

Suppose we have an instance of One-Way-Partial-Recovery. Alice sends the same message to Bob as if the input was an instance of One-Way-Find-Duplicate, which is valid since in both of these problems, Alice’s input is a list of length of integers between and .

Now, Bob’s goal is to use the message sent by Alice to recover elements of Alice. Let be the (initially empty) set of elements of Alice’s input that Bob knows and let be a set of elements in disjoint from , where we initially set to . While the size of is less than , Bob simulates the protocol of One-Way-Find-Duplicate with Alice’s message and input . This will result in Bob finding a single element in Alice’s input that is (i) in , and (ii) not in . Bob adds to , and deletes from . Once the size of is , Bob outputs .

If Alice has the set as her input, define to be the output which the pseudo-deterministic algorithm for One-Way-Find-Duplicate outputs with high probability when Alice’s input is and Bob’s input is . Now, set , and . Note that these (for through ) are the sets which, assuming the pseudo-deterministic algorithm never errs during the reduction (where we say the algorithm errs if it does not output the unique element which is guaranteed to be output with high probability), Bob will use as his inputs for the simulated executions of One-Way-Find-Duplicate. The pseudo-deterministic algorithm does not err on any of the except with probability at most , by a union bound. If Bob succeeds on all of the , that means that the sequence of inputs which will be his inputs for the simulated executions of One-Way-Find-Duplicate are indeed . So, since we have shown with high probability the algorithms succeeds on all of the , and therefore with high probability the are also Bob’s inputs for the simulated executions of One-Way-Find-Duplicate, we see that with high probability Bob will succeed on all of the inputs he tries to simulate executions of One-Way-Find-Duplicate with.

Note that we used the union bound over all the for through . All of these are a function of . In particular, notice that by definition, the do not depend on the randomness chosen by Alice. ∎

Claim 3.

Every pseudo-deterministic One-Way-Partial-Recovery protocol which succeeds with probability at least requires bits of communication.

Proof.

We prove this lower bound by showing that a protocol for One-Way-Partial-Recovery can be used to obtain a protocol with exactly the same communication for the problem where Alice is given a string in as input, she sends a message to Bob, and Bob must exactly recover from Alice’s message with probability at least 2/3. This problem has a lower bound of bits of communication.

Suppose there exists a pseudo-deterministic algorithm for One-Way-Partial-Recovery. Given such a pseudo-deterministic protocol that succeeds with probability at least , there is a function such that (a set with elements) is Bob’s output after the protocol with probability at least when Alice is given as input.

We will construct sets to be subsets of of size such that for any , is not a subset of . To do so, we use the probabilistic method: set be random subsets of of size . The probability that is contained for fixed is at most . Thus, by a union bound, the probability that for any , is contained is at most , a quantity which is strictly less than when is , so satisfying the desired guarantee exist.

Alice and Bob can (ahead of time) agree on an encoding of -bit strings that is an injective function from to . Now, if Alice is given a -bit string as input, she can send a message to Bob according to the pseudo-deterministic protocol for One-Way-Partial-Recovery by treating her input as . Bob then recovers with probability at least 2/3, and can use it to recover since there is unique in which is contained. Since is injective, Bob can also recover with probability .

This reduction establishes a lower bound of on the pseudo-deterministic communication complexity of One-Way-Partial-Recovery, which is an lower bound. ∎

Combining claim:find-rep-communication, claim:find-rep-to-part-rec and claim:part-rec-lower-bound completes the proof of thm:psd-lower-bound. ∎

It is worth noting that the problem has pseudo-deterministic algorithms with sublinear space if one allows multiple passes through the input. Informally, a -pass streaming algorithm is a streaming algorithm which, instead of seeing the stream only once, gets to see the stream times.

Claim 4.

There is a -pass deterministic streaming algorithm that uses memory for the Find-Duplicate problem.

Proof.

At the start of -th pass, the algorithm maintains a candidate interval of width from which it seeks to find a repeated element. At the very beginning, this candidate interval is . In the -th pass, first partition the interval into equal sized intervals , each of whose width (the width of an interval is ) is and count the number of elements of the stream that lie in each such subinterval – this count must exceed the width of at least one subinterval . Update to and proceed to the next pass. After passes, this interval will contain at most 1 integer. ∎

4 Entropy Lower Bound for Find-Duplicate

Theorem 4.1.

Every zero-error randomized algorithm for Find-Duplicate that is -concentrated must use space.

By zero error, we mean that the algorithm never outputs a number which is not repeated. With probability one it either outputs a valid output, or .

Proof.

We use a reduction similar to that of the pseudo-deterministic case (cf. Proof of claim:find-rep-to-part-rec). Using the exact same reduction from the proof of claim:find-rep-communication, we get that a -concentrated streaming algorithm for Find-Duplicate using space must give us a -concentrated protocol for One-Way-Find-Duplicate with communication complexity . If we can give a way to convert such a protocol for One-Way-Find-Duplicate into an -communication protocol for One-Way-Partial-Recovery, the desired lower bound on follows from the lower bound on communication complexity of One-Way-Partial-Recovery from claim:part-rec-lower-bound. We will now show how to make such a conversion by describing a protocol for One-Way-Partial-Recovery.

Alice sends Bob messages according to the protocol for One-Way-Find-Duplicate (that is, she simulates the protocol for One-Way-Find-Duplicate  a total of times). Bob’s goal is to use these messages to recover at least input elements of Alice. Towards this goal, he maintains a set of elements recovered so far, (initially empty), and a family of ‘active sets’ (initially containing the set ). While the size of is smaller than , Bob simulates the remainder of the One-Way-Find-Duplicate protocol on every possible pair where is a set in and is one of the messages of Alice. For each such pair where the protocol is successful in finding a duplicate element , Bob adds to , removes from and adds to .

We now wish to prove that this protocol indeed lets Bob recover elements of Alice. Suppose Alice has input . For each set , define be an element of that has probability at least of being outputted by Bob on input at the end of a One-Way-Find-Duplicate protocol. Let and be defined for . Note that are predetermined: it is a function of Alice’s input (and, in particular, not a function of the randomness she uses when choosing her messages). For a fixed , the probability of failure to recover from any of Alice’s messages is at most . A failure to fill in with elements implies that for some , Bob failed to recover from all of Alice’s messages. The probability that such a failure happens for a specific is at most . By setting the constant in the to be large enough, we can have this be at most , and so by a union bound the probability that there is an such that is not recovered by Bob is at most .

Thus, we obtain a protocol for One-Way-Partial-Recovery with communication complexity , and so , completing the proof. ∎

We obtain the following as immediate corollaries:

Corollary 4.2.

Any zero-error -entropy randomized algorithm for Find-Duplicate must use space.

Corollary 4.3.

Any zero-error -pseudo-deterministic algorithm for Find-Duplicate must use space.

Below we show that the above lower bound is tight up to log factors.

Theorem 4.4.

For all , there exists a zero-error randomized algorithm for Find-Duplicate using space (where hides factors polylogarithmic in ) that is -concentrated.

Proof.

Define the following algorithm for Find-Duplicate: pick a random number in , then remember the element of the stream, and see if appears again later in the stream. If it does, return . Otherwise return .

The -concentrated algorithm algorithm is as follows: Run copies of Algorithm independently (in parallel), and then output the minimum of the outputs.

We are left to show that this algorithm is indeed -concentrated.

Define to be a function where is the total number of times which shows up in the stream, and define . Note that then, the probability that is outputted by algorithm is , since will be outputted if chooses to remember one of the first .

Consider the smallest such that . We will show that the probability that the output is less than with high probability. It will follow that the algorithm is -concentrated, since of the smallest elements, at most outputs are possible (since if , then is not a possible output). So, we will see that with high probability, one of at most outputs (namely, the valid outputs less than or equal to ) will be outputted with high probability. And hence, at least one of them will be outputted with probability at least .

The probability that the output is at most in a single run of algorithm is . So, the probability that in runs of algorithm, in at least one of them an element which is at most is outputted is , which is polynomially small in . Hence, with high probability an element which is at most (and there are valid outputs less than ) will be outputted.

4.1 Getting Rid of the Zero Error Requirement

A downside of thm:nuanced-find-rep is that it shows a lower bound only for zero-error algorithms. In this section, we strengthen the theorem by getting rid of that requirement:

Theorem 4.5.

Every randomized algorithm for Find-Duplicate that is -concentrated and errs with probability at most must use space (for all ).

Proof overview:

We begin by outlining why the approach of thm:nuanced-find-rep does not work without the zero-error requirement. Recall that the idea in the proof was to have Alice send many messages (for One-Way-Find-Duplicate) to Bob, and Bob simulates the One-Way-Find-Duplicate  algorithm (using simulated inputs he creates for himself) using these messages to find elements in Alice’s input.

The problem is that the elements we end up removing from Bob’s simulated input111recall that Bob simulates an input to the One-Way-Find-Duplicate problem, and then he repeatedly finds elements he shares with Alice, removes them from the “fake” input, and reconstructs a large fraction Alice’s inputs depend on Alice’s messages, and therefore we can’t use a union bound to bound the probability that the protocol failed for a certain simulated input. So, we want the elements we remove from Bob’s fake input not to depend on the inputs Alice sent. One idea to achieve this is to have Alice send a bunch of messages (for finding a shared element), and then Bob will remove the element that gets output the largest number of times (by simulating the protocol with each of the many messages Alice sent). The issue with this is that if the two most common outputs have very similar probability, the outputted element depends not only on Alice’s input, but also on the randomness she uses when choosing what messages to send to Bob. This makes it again not possible to use a union bound.

There are two new ideas to fix this issue. The first is to use the following “Threshold” technique: Bob will pick a random “threshold” T between and (where we wish to show a lower bound on -concentrated algorithms, and is the total number of messages Alice sends to Bob). He simulates the algorithm for One-Way-Find-Duplicate with all messages Alice sent him, and gets a list of outputs. Then, he will consider the “real” output to be the lexicographically first output where there are more than copies of in the list (note that since the algorithm is -concentrated, its very unlikely for no such element to exist).

Now, it follows that with high probability, the shared element does not really depend on the messages. This is because with all but probability approximately , the threshold is far (more than away) from the the frequency of every element in . We note that we pick since from noise we would expect to have the frequencies of elements in change by up to , depending on the randomness of . We want the threshold to be further than that from the expected frequencies, so that with high probability there will be no element which sometimes has frequency more than and sometimes has frequency less than , depending on Alice’s messages (recall that the goal is to make the outputs depend as little as possible on Alice’s messages, but to only depend on shared randomness and on Alice’s input).

This is still not enough for us: we still cannot use a union bound, as fraction of the time Bob’s output will depends on Alice’s message (and not just her input). The next idea resolves this. What Alice will do is send additional pieces of information: telling Bob where the chosen thresholds are bad, and what threshold to use instead. We assume that we have shared randomness so Alice knows all of the thresholds that will be chosen by Bob (note heavy-recovery is hard, even in the presence of shared randomness, so the lower bound is sufficient with shared randomness). Now, Alice can tell for which executions there the threshold chosen will be too close to the likelihood of an element. So, Alice will send approximately additional pieces of information: telling Bob where the chosen thresholds are bad, and what threshold to use instead. By doing so, Alice has guaranteed that a path independent of her messages will be taken.

To recap, idea 1 is to use the threshold technique so that with probability what Bob does doesn’t depend on Alice’s messages (only on her input). Idea 2 is to have Alice tell Bob where these bad situations are, and how to fix them.

The total amount of information Alice sends (ignoring logs) is , (where is the message size we are assuming exists for pseudo-deterministically finding a shared element, and k is the number of messages Alice sends). The factor follows since of the times, short messages will be sent to Bob due to a different threshold. A threshold requires bits to describe, which can be dropped since we are ignoring log factors. Setting , we conclude that Alice sends a total of bits. This establishes a contradiction, since we need bits to solve One-Way-Partial-Recovery. So, whenever , we can pick a such that we get a contradiction.

Proof.

Below we write the full reduction written out as an algorithm for One-Way-Find-Duplicate.

  • Alice Creates messages for One-Way-Find-Duplicate, and sends them to Bob (Call these messages of type A).

  • Additionally, Alice looks at the thresholds in the shared randomness. every time there is a threshold that is close (within ) of the expected number of times a certain will be outputted on the corresponding input (that is, for each fake input Bob will try, Alice checks if the probability of outputting some is close to – to be precise, say she checks if its probability of being outputted, assuming a randomly chosen message by Alice, is close to ), she sends a message to Bob informing him about the bad threshold, and suggests a good threshold to be used instead (call these messages of type B). Notice that these messages do not depend on the messages of type A that Alice sends, and that each such message is of size .

  • Bob sets to be the simulated input

  • Bob uses each of the messages of type that Alice sent, along with , to construct a list of outputs.

  • Bob looks at the shared randomness to find a threshold (if Alice has informed him it is a bad threshold, use the threshold Alice suggests instead), and consider the lexicographically minimal output y that is contained in the multiset more than times.

  • Bob removes from the fake input and repeat the last three steps of the algorithm (this time using a new threshold).

Claim 5.

The above protocol solves One-Way-Partial-Recoverywith high probability using bits.

Proof.

First we show that the total number of bits communicated is . Notice that the total number of messages of type that are sent is . We assume that each of these is of size at most , giving us a total of bits sent in messages of type . Under the assumption that , we see that this is total bits for messages of type .

We now count the total number of bits communicated in messages of type . Each message of type is of size (it is describing a single element, and a number corresponding to which execution the message is relevant for, each requiring bits). So, we wish to show that with high probability the total number of messages of type is . The total number of messages of type that will be sent is , since for every input, the probability that the randomly chosen threshold (which is sampled using public randomness) is more than away from the frequency of every output is . Note that since , and we assume .

We are now left to show the protocol correctly solves One-Way-Partial-Recovery with high probability. We will first show that, after fixing Alice’s input and the public randomness, with high probability there will be a single sequence of inputs that Bob will try that will occur with high probability (that is, there is a sequence of ’s that Bob goes through with high probability). To do this, consider a certain input that Bob tries. We will bound the probability that there are two values and such that both and have probability at least of being outputted. Suppose there exists two such and that means that at least one of them (say , without loss of generality) has to be the output of more than of the executions with probability more than , but less than . Additionally, we know that the expected number of times that will be outputted of the times is more than away from (otherwise Alice will pick a different value of such that this will be true, and send that value to Bob in a message of type ). However, the probability of being more than away from the expectation, by a Chernoff bound, is (asymptotically) less than .

Notice also, that by the assumption that the algorithm in -concentrated, there will always be an output which is expected to appear at least of the time. Also, since the threshold is at most , the probability that appeared fewer than times is exponentially low in , and so with high probability there will always exist a which was outputted on more than of the executions, so in the second to last step, the multiset will always have an element that appears at least times.

Hence, by a union bound over all inputs that Bob tries, with high probability there will be a single sequence of inputs which Bob goes through (which depends only only on the public thresholds and Alice’s input).

We will show that each generated by Bob is an element in Alice’s input with high probability. Notice that the that Bob picks has appeared more than times out of , where is at least . If is not a valid output then its probability of being outputted is . The probability it is outputted at least once is at most . Taking a union bound over the inputs that Bob tries (of which there are ), we get that the probability that there is an invalid at any point is at most 1/10. So, with probability 9/10, no invalid is ever outputted. ∎

5 Entropy lower bounds for finding a support element

Consider the turnstile model of streaming, where a vector starts out as and receives updates of the form ‘increment by 1’ or ‘decrement by 1’, and the goal of outputting a nonzero coordinate of . This is a well studied problem and a common randomized algorithm to solve this problem in a small amount of space is known as sampling [FIS08]. sampling uses polylogarithmic space and outputs a uniformly random coordinate from the support of . A natural question one could ask is whether the output of any low space randomized algorithm is necessarily close to uniform, i.e., has high entropy. We answer this affirmatively and show a nearly tight tradeoff between the space needed to solve this problem and the entropy of the output of a randomized algorithm under the assumption that the algorithm is not allowed to output anything outside the support222We note that using similar ideas to those in Subsection 4.1, the zero error requirement could be removed. We omit this adaptation since it is very similar to that of Subsection 4.1.

Theorem 5.1.

Every zero-error randomized algorithm for Find-Support-Elem that is -concentrated must use space.

We only provide a sketch of the proof and omit details since they are nearly identical to the proof of thm:nuanced-find-rep.

Proof Sketch.

Let be such an algorithm that uses space. Just like the proof of thm:nuanced-find-rep, the way we show this lower bound is by illustrating that can be used to obtain an -communication protocol for One-Way-Partial-Recovery, which combined with claim:part-rec-lower-bound yields the desired result.

For every element in Alice’s input set , she streams ‘increment by 1’ and runs independent copies of on the input. She then sends the states of each these independent runs of to Bob, which is at most bits, to Bob. Bob maintains a set of states , initially filled with all of Alice’s messages. While he has not yet recovered elements, Bob picks a message and recovers in using algorithm . And for each , Bob resumes on state and streams ‘decrement by 1’ and adds the new state to , and deletes from .

The proof of correctness for why Bob indeed eventually recovers elements of is identical to that in the proof of thm:nuanced-find-rep, thus giving a protocol for One-Way-Partial-Recovery and proving the statement. ∎

We can immediately conclude the following.

Corollary 5.2.

Any zero-error -entropy randomized algorithm for Find-Support-Elem must use space.

Corollary 5.3.

Any zero-error -pseudo-deterministic algorithm for Find-Support-Elem must use space.

This lower bound is also tight up to polylogarithmic factors due to an algorithm nearly identical to the one from thm:find-rep-low-entropy-algo. In particular, we have:

Theorem 5.4.

For all , there exists a zero-error randomized algorithm for Find-Duplicate using space that is -concentrated.

6 Space complexity of pseudo-deterministic -norm estimation

In this section, we once again consider the pseudo-deterministic complexity of norm estimation in the sketching model. The algorithmic question here is to design a distribution over matrices along with a function so that for any :

Further, we want to be a pseudo-deterministic function; i.e., we want to be a unique number with high probability.

Theorem 6.1.

The pseudo-deterministic sketching complexity of norm estimation is .

The following query problem is key to our lower bound.

Definition 6.2 ( adaptive attack).

Let be some constant. Let be an matrix with real-valued entries and be some function. Now, consider the query model where an algorithm is allowed to specify a vector as a query and is given as a response. The goal of the algorithm is to output such that

in as few queries as possible. We call this algorithmic problem the -adaptive attack problem.

We use a theorem on adaptive attacks on sketches proved in [HW13].

Theorem 6.3.

There is a -query protocol to solve the adaptive attack problem with probability at least , i.e., the problem in def:adaptive-attack when .

Proof of thm:l2-main.

Suppose is a distribution over sketching matrices and is a function mapping to with the property that the pair gives a pseudo-deterministic sketching algorithm for norm estimation. Henceforth, we use to denote a random matrix sampled from . Then there is a function such that:

  1. is an -approximation of the norm.

  2. On every input , with probability at least for some constant .

We will show that must be by deducing a contradiction when . Let be a parameter to be chosen later. Let be the (random) sequence of vectors in obtained by the adaptive query protocol from thm:adaptive-attack based on responses where , and let be the (random) output of the protocol. Note that the guarantee that hinges on assuming . From the guarantees of thm:adaptive-attack, for any fixed matrix and function such that for all , it is true with probability at least that . On the other hand, for any sequence of fixed vectors , for all with probability at least . Call the event as . Let

be the probability density function of

and let be the probability density function of . This results in the following two estimates of .

On the one hand,

and on the other hand,

The contradiction arises since cannot simultaneously be at least and at most , and hence cannot be .

Corollary 6.4.

For any constant , any -concentrated sketching algorithm that where the sketching matrix is can be turned into a pseudo-deterministic one by running independent copies of the sketch and outputting the majority answer. Thus, as an upshot of thm:l2-main we obtain a lower bound of on -concentrated algorithms for pseudo-deterministic -norm estimation in the sketching model.

In contrast to cor:tight-ell_2-lower which says that -concentrated algorithms for estimation in the sketching model need near linear dimension, we show that there is an -dimension

-concentrated sketching algorithm to solve the problem, thus exhibiting a ‘phase transition’.

Theorem 6.5.

There is a distribution over matrices and a function when For every constant , there is an -space -concentrated sketching algorithm for -norm estimation.

Proof.

Let the true norm of the input vector be . Run the classic sketching algorithm of [AMS99] for randomized norm estimation with error and failure probability where is the desired approximation ratio. This uses a sketch of dimension . Now, we describe the function we use. Take the output of the sketching algorithm of [AMS99] and return the number obtained by zeroing out all its bits beyond the first significant bits.333The parameters and are chosen purely for safety reasons First, the outputted number is a approximation. Further, for each input, the output is one of two candidates with probability for every constant . This is because [AMS99] produces a -approximation to , and there are only two candidates for the most significant bits of any real number that lies in an interval . ∎

7 Pseudo-deterministic Upper Bounds

7.1 Finding a nonzero row

Given updates to an matrix (where we assume ) that is initially in a turnstile stream such that all entries of are always in range , the problem Find-Nonzero-Row is to either output an index such that the th row of is nonzero, or output none if

is the zero matrix.

Theorem 7.1.

The randomized space complexity for Find-Nonzero-Row is , the pseudo-deterministic space complexity for Find-Nonzero-Row is , and the deterministic space complexity for Find-Nonzero-Row is .

Proof.

We first will show a randomized space algorithm for the problem, then we will show pseudo-deterministic upper and lower bounds, and then show the deterministic lower bound.

Randomized algorithm for Find-Nonzero-Row.

A randomized algorithm for this problem is given below. Note that the version of the algorithm as stated below does not have the desirable space guarantee, but we will show how to use a pseudorandom generator of Nisan [Nis92] to convert the below algorithm to one that uses low space.

  1. Sample a random -dimensional vector where each entry is an independently drawn integer in and store it.

  2. Simulate a turnstile stream which maintains . In particular, consider the -dimensional vector , which is initially , and for each update to of the form “add to ”, add to . We run an -sampling algorithm [FIS08] on this simulated stream updating , and return the output of the -sampler, which is close in total variation distance to a uniformly random element in the support of .

In the above algorithm, step:sample-vec is not low-space as stated. Before we give a way to perform step:sample-vec in space, we prove the correctness of the above algorithm. Suppose is a nonzero row of , then let be an index where is nonzero. Suppose all coordinates of except for the -th coordinate have been sampled, there is at most one value for for which is , and there is at most a probability that equals , which means if is a nonzero row, then is nonzero except with probability at most . In fact, by taking a union bound over all nonzero rows we can conclude that the set of nonzero rows and the set of nonzero indices of are exactly the same, except with probability bounded by .

Now we turn our attention to implementing step:sample-vec in low space. Towards doing so we use Nisan’s pseudorandom generator for space bounded computation in a very similar manner to [Ind06].

Instead of sampling bits to store , we sample and store a uniformly random seed of length and add to when an update “add to ” is received, where is the function from thm:nisan-prg that maps the random seed to a sequence bits. To prove the algorithm is still correct if we use the pseudorandom vector instead of the uniformly random vector , we must show that when is nonzero, then is nonzero with probability at least . Towards this, for a fixed -dimensional vector , consider the following finite state machine. The states are labeled by pairs where is in and is in . The FSM takes a -dimensional vector as input, starts at state , and transitions from state to until it reaches a state . The FSM then outputs . This establishes that for a fixed , the function is computable by an FSM on states, and hence from thm:nisan-prg, and are close in total variation distance, which means when is nonzero, then is nonzero except with probability bounded by .

A pseudo-deterministic algorithm and lower bound for Find-Nonzero-Row

The pseudo-deterministic algorithm is very similar to the randomized algorithm from the previous section.

  1. Sample a random -dimensional vector where each entry is an independently drawn integer in . Store and maintain .

  2. Output the smallest index such that is nonzero.

Storing takes space, and maintaining takes space. Recall from the discussion surrounding the randomized algorithm that the set of nonzero indices of and the set of nonzero rows were equal with probability , which establishes correctness of the above pseudo-deterministic algorithm. The space complexity is thus , which is equal to from the assumption that .

A pseudo-deterministic lower bound of follows immediately from cor:multi-psd-findsupelem since Find-Nonzero-Row specialized to the case is the same as Find-Support-Elem.

Lower Bound for deterministic algorithms.

An bit space lower bound for deterministic algorithms follows from a reduction to the communication complexity problem of Equality. Alice and Bob are each given bit strings as input, which they interpret as matrices, and respectively, where each entry is a chunk of length . Suppose a deterministic algorithm takes bits of space to solve this problem. We will show that this can be converted to a -bit communication protocol to solve Equality. Alice runs on a turnstile stream updating matrix initialized at 0 by adding to for all in . Alice then sends the bits corresponding to the state of the algorithm to Bob and he continues running on the updates ‘add to ’. outputs none if and only if and thus Bob outputs the answer to Equality depending on the output of . Due to a communication complexity lower bound of on Equality, must be .

7.2 Point Query Estimation and Inner Product Estimation

In this section, we give pseudo-deterministic algorithms that beat the deterministic lower bounds for two closely related streaming problems — point query estimation and inner product estimation.

Point Query Estimation.

Given a parameter and a stream of elements where each element comes from a universe , followed by a query , output such that where is the frequency of element in the stream.

Inner Product Estimation.

Given a parameter and a stream of updates to (initially -valued) -dimensional vectors and in an insertion-only stream444A stream where only increments by positive numbers are promised., output estimate satisfying .

In the above problems, we will be interested in the regime where .

Our main result regarding a pseudo-deterministic algorithm for point query estimation is:

Theorem 7.2.

There is an -space pseudo-deterministic algorithm for point query estimation with the following precise guarantees. For every sequence in , there is a sequence such that

  1. For all , where is the frequency of in the stream.

  2. Except with probability , for all outputs on query .

We remark that the deterministic complexity of the problem is (see Theorem 7.7).

Towards establishing thm:point-query, we recall two facts.

Theorem 7.3 (Misra–Gries algorithm [Mg82]).

Given a parameter and a length- stream of elements in , there is a deterministic -space algorithm that given any query , outputs such that where is the number of occurrences of in the stream. An additional guarantee that the algorithm satisfies is the following, which we call permutation invariance. Consider the stream

and for any permutation , consider the stream

When the algorithm is given the first stream as input, let denote its output on query , and when the algorithm is given the second stream as input, let denote its output on query . The algorithm has the guarantee that