A fundamental paradigm in the design and analysis of symmetric encryption schemes is the following two-step process: (1) Design a symmetric encryption scheme assuming the availability of a uniformly-random permutation; (2) Analyze the security of the scheme assuming that the permutation is switched to a uniformly-random function.
Step (1) relies on the widely-believed existence of pseudorandom permutations (see, for example, [1, 2]), which are efficiently-computable and efficiently-invertible keyed permutations over that are computationally indistinguishable from a uniformly-random permutation in a standard cryptographic sense, where is the set of all possible keys . Pseudorandom permutations are realized via a variety of known practical constructions, such as the well-studied and standardized Advanced Encryption Standard for which .
Step (2) relies on the fact that a uniformly-random function can serve as a perfectly-secure one-time pad for the encryption of an exponentially-large number of messages. For example, assuming that two parties secretly share a uniformly-random permutationover (this would correspond to actually sharing a key for a pseudorandom permutation), they can use the widely-deployed counter mode for the encryption of multiple messages, and encrypt their th message as the pair . Modifying the scheme by replacing its random permutation with a random function enables to argue that an attacker observing a sequence of ciphertexts obtains no information on their corresponding messages . Note, however, that these ciphertexts result from the modified scheme that uses the function , and not from the original one that uses the permutation . Thus, it must be argued that the security of the modified scheme provides a meaningful guarantee for the security of the original one.
The switching lemma. The security of the modified scheme and that of the original scheme are tied together via a simple argument, commonly referred to as the “switching lemma”. This lemma captures the advantage of distinguishing between a random permutation and a random function. For an algorithm (an attacker) that observes ciphertexts, this translates to upper bounding its advantage in distinguishing a sequence of values that are sampled uniformly with replacement from (corresponding to the values in the case of a random function ) from a sequence of values that are sampled uniformly without replacement from (corresponding to the values in the case of a random permutation ). The distinguishing advantage of such an algorithm is defined by the dissimilarity between the distribution of its output as induced under the two cases. Note that the total variation distance between these two distributions is , and this serves as a tight bound on the distinguishing advantage when no restrictions are placed on the distinguisher.
This implies, in particular, that encryption in the widely-deployed counter mode cannot be used when the number is approaching messages. In fact, the switching lemma is applicable, and places rather similar bounds on the number of encrypted messages, not only for symmetric encryption in the above-described counter mode but also for other fundamental modes of encryption. We refer the reader to the work of Jaeger and Tessaro  for an in-depth discussion of the cryptographic applications on the switching lemma.
The streaming switching lemma. As discussed above, the bound provided by the switching lemma is tight when no restrictions are placed on the distinguisher. Specifically, the following simple algorithm achieves the bound: When given a sequence of values as input, the algorithm outputs if there is some value that appears more than once (i.e., if a “collision” exists), and outputs otherwise. Note that when given a sequence of values that are sampled uniformly with replacement this algorithm outputs
with probability, and when given a sequence of values that are sampled uniformly without replacement this algorithm always outputs . However, a significant drawback of this algorithm is that it needs an internal memory of size bits for storing the entire sequence in order to identify whether or not a collision exists.
This observation motivated Jaeger and Tessaro  to refine the framework of the switching lemma by restricting the amount of internal memory used by the distinguisher. That is, they analyzed the advantage of distinguishing the above two distributions where: (1) the values are provided one by one in a streaming manner, and (2) the internal memory of the distinguisher is restricted to at most bits. The most interesting regime is where there is a noticeable gap between and , which is motivated by the fact that large amounts of data cannot always be stored in their entirety.
Known bounds. Jaeger and Tessaro proved a conditional upper bound on the distinguishing advantage of any streaming algorithm that uses at most bits of internal memory. Specifically, they introduced a combinatorial conjecture regarding certain hypergraphs, and showed that based on their conjecture the advantage of any such distinguisher is at most , when measured as the KL divergence between the output distributions of the memory-bounded streaming algorithm under the two cases. Applying Pinker’s inequality, this implies an upper bound of when measured via the total variation distance, which is more standard for cryptographic applications.
In a follow-up work, Dinur  proved an unconditional upper bound of on the distinguishing advantage of any such algorithm, when measured as the total variation distance between the output distributions of the memory-bounded streaming algorithm under the two cases. Note that this should be compared to the upper bound on the total variation distance obtained by applying Pinsker’s inequality to the result of Jaeger and Tessaro.
Dinur’s result is based on reducing the task of distinguishing between these two distributions via a memory-bounded algorithm to constructing communication-efficient protocols for the two-party set-disjointness problem. Three decades of extensive research on the communication complexity of this canonical problem (e.g., [5, 6, 7]) have recently led to new lower bounds  on which Dinur relied via his reduction.
Our contributions. We present an information-theoretic and unconditional proof showing that the distinguishing advantage of any streaming algorithm that uses at most bits of internal memory is at most , measured via KL divergence as in the work of Jaeger and Tessaro . When for any constant , we obtain an improved upper bound of which is asymptotically tight with respect to the KL divergence.
Moreover, we prove our results within a more refined framework that considers the accumulated memory usage of streaming algorithms throughout their computation, and not only their worst-case memory usage. This shows that any non-negligible advantage must be obtained by using a substantial amount of internal memory on average throughout the computation, and not only in the worst case.
Ii Setup and Main Results
All logarithms in this paper are to the natural base unless denoted otherwise in a subscript. For two probability distributionsand on a common discrete alphabet , where is absolutely continuous with respect to , the KL-divergence is defined as . For probability distributions and on a common discrete alphabet , where is absolutely continuous with respect to , we further define the conditional divergence as . The mutual information between and with respect to the probability distribution is , where and .
Setup. For stating our results we briefly describe the notion of memory-bounded streaming indistinguishability, introduced by Jaeger and Tessaro , as well as our refinement that considers accumulated memory usage. For an algorithm and a sequence , , the streaming computation of on is defined via the following process:
Set , where is the empty string.
We abuse notation and denote the output of this computation by . Following Jaeger and Tessaro, we say that an algorithm is -memory-bounded if for every input and for every it holds that , where is the bit length of the internal state . For our purpose of considering accumulated memory usage, we naturally extend this notion to that of an -memory-bounded algorithm, where for every input and for every it holds that . From this point, and without loss of generality, we assume that for any -memory-bounded algorithm it holds that for all and that it holds that .111For any sequence , we may recursively define by and . Then, any -memory-bounded algorithm with internal states can be transformed into an -memory-bounded algorithm with internal states by defining if and defining otherwise, where is the input sequence (i.e., can always be stored explicitly together with the previous state instead of updating the state to ). Note that perfectly simulates the execution of for any input, and thus achieve the same distinguishing advantage.
From this point on we let and denote the probability distributions on corresponding to sampling the sequence uniformly with and without replacement, respectively, from . Namely, under we have that for , whereas under we have that are the first entries of a uniform random permutation on . The distribution of the algorithm’s output under (respectively ) is denoted by (respectively ).
Main results. The following theorem states our main result, upper bounding the distinguishing advantage of any memory-bounded streaming algorithm, when measured via KL divergence:
For any , and such that for all , and for any -memory-bounded algorithm it holds that
In particular, when for any constant , then also , and we obtain the following corollary:
For any , constant , and such that for all , and for any -memory-bounded algorithm it holds that
Finally, for this range of parameters we observe that our bound is nearly tight:
For any , and such that for all , there exists an -memory-bounded algorithm for which
Iii Proof of Theorem 1
Our proof is based on an induction argument showing that , where the mutual information is computed with respect to , and is the state of the internal memory at step of the computation. Then, we leverage the fact that
form a Markov chain in this order, and thatdue the the memory constraints, in order to derive an information bottleneck  upper bound on .
Iii-a An Induction Argument
We prove the following lemma which is similar to a lemma proved by Jaeger and Tessaro .
Let and be two distributions on , where the induced marginals satisfy for all , and in addition (i.e., under the distribution the random variables
the random variablesare independent, where each is distributed according to the distribution ). For a streaming computation performed by the algorithm , let be the random variable corresponding to the state produced in the th step of the computation. Then
where the mutual information is computed with respect to the joint distribution
where the mutual information is computed with respect to the joint distribution, induced by .
Proof. By definition of we have that . Moreover, since is obtained by processing , the data processing inequality yields
Applying the chain rule, yields
where (1) follows from the fact that is memoryless such that under this distribution is statistically independent of , and (2) follows from the assumption that . Thus, by induction we obtain that . Recalling that and that , our claim follows.
Iii-B An Information-Bottleneck Argument
We make use of the following functions:
For the binary entropy function (with respect to the natural basis) is
and we let be its inverse restricted to .
For we let .
For we let
and for we let .
We claim that is non-decreasing over , that is non-decreasing and convex, and that for every it holds that . We defer the proofs to Section V. We state and prove our main technical lemma.
Let be integers, let be the random process of sampling elements of uniformly without replacement. Denote , , and let be a random variable such that form a Markov chain in this order. Then, it holds that
Proof. We first note that
Consequently, we derive a lower bound on in terms of . To that end, we first compute the distribution for and . We have
It follows that
Defining the random variables , we further write
where in the last step we used the convexity of . Next, using the fact that conditioning reduces entropy, we note that
Note that dictate the elements that belong to . Let be the order at which these elements appear. Together, and completely determine , and vice versa. We have that
Recalling that and , and using the convexity of , we obtain
Next, we simplify the bound of Lemma 5.
In the setting of Lemma 5, if and then
To that end, we will use the following well-known estimate (proved using Stirling’s approximation, e.g.):
In particular, for large enough it holds that
Let . Using the monotonicity of , the bound of Lemma 5 reads as
Let . Due to the the convexity of it holds that
where . Recall that and that is increasing at , and hence
and we can further upper bound (8) by
Denoting and recalling the definition of , we further develop (9)
where in (10) we used the inequality that holds for every . Since we can estimate
Since we assumed that , it also holds that , and we conclude that
Iii-C Application to Our Setup
Finally, we can derive Theorem 1 from Corollary 6. Recall that and designate the probability distributions corresponding to sampling uniformly without and with replacement, respectively, from . Thus, for all , and furthermore, is a memoryless distribution. Thus, the conditions of Lemma 4 hold, and , where the mutual information is with respect to . Now, recalling that forms a Markov chain in this order, and that under we have that is a random process of sampling elements of uniformly without replacement, and that by the constraints on the internal memory, we can apply Corollary 6 to obtain
Iv Proof of Theorem 3
Informally, given such that for all we construct an -memory-bounded algorithm that stores a list of values that it saw, where every new value is added to the list if the state size allows it. More formally, with a loss of at most bits per each , we may assume that is of the form for an integer for all . We remind that we assume that for all and that , thus it holds that for all and it holds that . We also assume without loss of generality that (i.e., the final output of is a single bit). For we define the computation as follows:
If , output or according to whether or , respectively.
Else, if the first bit of is , output .
Else, parse .
If , output .
Else, if , output .
Else, output .
Note that for this algorithm and that , so it holds that
and this settles the proof of Theorem 3.
V Proofs for the properties of and
In this section we give proofs for the properties of and that we used.
We start by showing that is increasing over . Indeed , so as long as . Now, we show is increasing over . Recall that , and the claim follows from the fact that is increasing and is increasing over . Next, we show that is convex by showing that its derivative is increasing. It holds that
Computing the derivative, for we get
So is increasing, thus is increasing and is convex as claimed. Finally, we show that for every it holds that . When it simply holds that . When , it holds that , thus we need to show that . Let . Then, (when ), so is concave over . Together with the fact that we get that when .
-  O. Goldreich, Foundations of Cryptography – Volume 1: Basic Techniques. Cambridge University Press, 2001.
-  J. Katz and Y. Lindell, Introduction to Modern Cryptography (2nd Edition). CRC Press, 2014.
-  J. Jaeger and S. Tessaro, “Tight time-memory trade-offs for symmetric encryption,” in Advances in Cryptology – EUROCRYPT, 2019, pp. 467–497.
-  I. Dinur, “On the streaming indistinguishability of a random permutation and a random function,” To Appear in Advances in Cryptology – EUROCRYPT, 2020.
-  Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar, “An information statistics approach to data stream and communication complexity,” J. Comput. Syst. Sci., vol. 68, no. 4, pp. 702–732, 2004.
-  B. Kalyanasundaram and G. Schnitger, “The probabilistic communication complexity of set intersection,” SIAM J. Discrete Math., vol. 5, no. 4, pp. 545–557, 1992.
-  A. A. Razborov, “On the distributional complexity of disjointness,” Theor. Comput. Sci., vol. 106, no. 2, pp. 385–390, 1992.
-  M. Göös and T. Watson, “Communication complexity of set-disjointness for all probabilities,” Theory of Computing, vol. 12, no. 1, pp. 1–23, 2016.
-  N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in 37th Annual Allerton Conference on Communications, Control, and Computing, 1999, pp. 368–377.
-  F. MacWilliams and N. Sloane, The Theory of Error-Correcting Codes. Elsevier Science, 1977.