# An Information-Theoretic Proof of the Streaming Switching Lemma for Symmetric Encryption

Motivated by a fundamental paradigm in cryptography, we consider a recent variant of the classic problem of bounding the distinguishing advantage between a random function and a random permutation. Specifically, we consider the problem of deciding whether a sequence of q values was sampled uniformly with or without replacement from [N], where the decision is made by a streaming algorithm restricted to using at most s bits of internal memory. In this work, the distinguishing advantage of such an algorithm is measured by the KL divergence between the distributions of its output as induced under the two cases. We show that for any s=Ω(log N) the distinguishing advantage is upper bounded by O(q · s / N), and even by O(q · s / N log N) when q ≤ N^1 - ϵ for any constant ϵ > 0 where it is nearly tight with respect to the KL divergence.

## Authors

• 1 publication
• 19 publications
• 3 publications
• ### Time-Space Tradeoffs for Distinguishing Distributions and Applications to Security of Goldreich's PRG

In this work, we establish lower-bounds against memory bounded algorithm...
02/17/2020 ∙ by Sumegha Garg, et al. ∙ 0

• ### Bregman Divergence Bounds and the Universality of the Logarithmic Loss

A loss function measures the discrepancy between the true values and the...
10/14/2018 ∙ by Amichai Painsky, et al. ∙ 0

• ### On the Universality of the Logistic Loss Function

A loss function measures the discrepancy between the true values (observ...
05/10/2018 ∙ by Amichai Painsky, et al. ∙ 0

• ### Concentration and Confidence for Discrete Bayesian Sequence Predictors

Bayesian sequence prediction is a simple technique for predicting future...
06/29/2013 ∙ by Tor Lattimore, et al. ∙ 0

• ### On the Upper Bound of the Kullback-Leibler Divergence and Cross Entropy

This archiving article consists of several short reports on the discussi...
11/19/2019 ∙ by Min Chen, et al. ∙ 0

• ### Distributed Source Simulation With No Communication

We consider the problem of distributed source simulation with no communi...
06/17/2019 ∙ by Tomer Berg, et al. ∙ 0

• ### A note on the neighbour-distinguishing index of digraphs

In this note, we introduce and study a new version of neighbour-distingu...
09/23/2019 ∙ by Eric Sopena, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

A fundamental paradigm in the design and analysis of symmetric encryption schemes is the following two-step process: (1) Design a symmetric encryption scheme assuming the availability of a uniformly-random permutation; (2) Analyze the security of the scheme assuming that the permutation is switched to a uniformly-random function.

Step (1) relies on the widely-believed existence of pseudorandom permutations (see, for example, [1, 2]), which are efficiently-computable and efficiently-invertible keyed permutations over that are computationally indistinguishable from a uniformly-random permutation in a standard cryptographic sense, where is the set of all possible keys . Pseudorandom permutations are realized via a variety of known practical constructions, such as the well-studied and standardized Advanced Encryption Standard for which .

Step (2) relies on the fact that a uniformly-random function can serve as a perfectly-secure one-time pad for the encryption of an exponentially-large number of messages. For example, assuming that two parties secretly share a uniformly-random permutation

over (this would correspond to actually sharing a key for a pseudorandom permutation), they can use the widely-deployed counter mode for the encryption of multiple messages, and encrypt their th message as the pair . Modifying the scheme by replacing its random permutation with a random function enables to argue that an attacker observing a sequence of ciphertexts obtains no information on their corresponding messages . Note, however, that these ciphertexts result from the modified scheme that uses the function , and not from the original one that uses the permutation . Thus, it must be argued that the security of the modified scheme provides a meaningful guarantee for the security of the original one.

The switching lemma. The security of the modified scheme and that of the original scheme are tied together via a simple argument, commonly referred to as the “switching lemma”. This lemma captures the advantage of distinguishing between a random permutation and a random function. For an algorithm (an attacker) that observes ciphertexts, this translates to upper bounding its advantage in distinguishing a sequence of values that are sampled uniformly with replacement from (corresponding to the values in the case of a random function ) from a sequence of values that are sampled uniformly without replacement from (corresponding to the values in the case of a random permutation ). The distinguishing advantage of such an algorithm is defined by the dissimilarity between the distribution of its output as induced under the two cases. Note that the total variation distance between these two distributions is , and this serves as a tight bound on the distinguishing advantage when no restrictions are placed on the distinguisher.

This implies, in particular, that encryption in the widely-deployed counter mode cannot be used when the number is approaching messages. In fact, the switching lemma is applicable, and places rather similar bounds on the number of encrypted messages, not only for symmetric encryption in the above-described counter mode but also for other fundamental modes of encryption. We refer the reader to the work of Jaeger and Tessaro [3] for an in-depth discussion of the cryptographic applications on the switching lemma.

The streaming switching lemma. As discussed above, the bound provided by the switching lemma is tight when no restrictions are placed on the distinguisher. Specifically, the following simple algorithm achieves the bound: When given a sequence of values as input, the algorithm outputs if there is some value that appears more than once (i.e., if a “collision” exists), and outputs otherwise. Note that when given a sequence of values that are sampled uniformly with replacement this algorithm outputs

with probability

, and when given a sequence of values that are sampled uniformly without replacement this algorithm always outputs . However, a significant drawback of this algorithm is that it needs an internal memory of size bits for storing the entire sequence in order to identify whether or not a collision exists.

This observation motivated Jaeger and Tessaro [3] to refine the framework of the switching lemma by restricting the amount of internal memory used by the distinguisher. That is, they analyzed the advantage of distinguishing the above two distributions where: (1) the values are provided one by one in a streaming manner, and (2) the internal memory of the distinguisher is restricted to at most bits. The most interesting regime is where there is a noticeable gap between and , which is motivated by the fact that large amounts of data cannot always be stored in their entirety.

Known bounds. Jaeger and Tessaro proved a conditional upper bound on the distinguishing advantage of any streaming algorithm that uses at most bits of internal memory. Specifically, they introduced a combinatorial conjecture regarding certain hypergraphs, and showed that based on their conjecture the advantage of any such distinguisher is at most , when measured as the KL divergence between the output distributions of the memory-bounded streaming algorithm under the two cases. Applying Pinker’s inequality, this implies an upper bound of when measured via the total variation distance, which is more standard for cryptographic applications.

In a follow-up work, Dinur [4] proved an unconditional upper bound of on the distinguishing advantage of any such algorithm, when measured as the total variation distance between the output distributions of the memory-bounded streaming algorithm under the two cases. Note that this should be compared to the upper bound on the total variation distance obtained by applying Pinsker’s inequality to the result of Jaeger and Tessaro.

Dinur’s result is based on reducing the task of distinguishing between these two distributions via a memory-bounded algorithm to constructing communication-efficient protocols for the two-party set-disjointness problem. Three decades of extensive research on the communication complexity of this canonical problem (e.g., [5, 6, 7]) have recently led to new lower bounds [8] on which Dinur relied via his reduction.

Our contributions. We present an information-theoretic and unconditional proof showing that the distinguishing advantage of any streaming algorithm that uses at most bits of internal memory is at most , measured via KL divergence as in the work of Jaeger and Tessaro [3]. When for any constant , we obtain an improved upper bound of which is asymptotically tight with respect to the KL divergence.

Moreover, we prove our results within a more refined framework that considers the accumulated memory usage of streaming algorithms throughout their computation, and not only their worst-case memory usage. This shows that any non-negligible advantage must be obtained by using a substantial amount of internal memory on average throughout the computation, and not only in the worst case.

## Ii Setup and Main Results

Notation.

All logarithms in this paper are to the natural base unless denoted otherwise in a subscript. For two probability distributions

and on a common discrete alphabet , where is absolutely continuous with respect to , the KL-divergence is defined as . For probability distributions and on a common discrete alphabet , where is absolutely continuous with respect to , we further define the conditional divergence as . The mutual information between and with respect to the probability distribution is , where and .

Setup. For stating our results we briefly describe the notion of memory-bounded streaming indistinguishability, introduced by Jaeger and Tessaro [3], as well as our refinement that considers accumulated memory usage. For an algorithm and a sequence , , the streaming computation of on is defined via the following process:

• Set , where is the empty string.

• For :

• Let .

• Output .

We abuse notation and denote the output of this computation by . Following Jaeger and Tessaro, we say that an algorithm is -memory-bounded if for every input and for every it holds that , where is the bit length of the internal state . For our purpose of considering accumulated memory usage, we naturally extend this notion to that of an -memory-bounded algorithm, where for every input and for every it holds that . From this point, and without loss of generality, we assume that for any -memory-bounded algorithm it holds that for all and that it holds that .111For any sequence , we may recursively define by and . Then, any -memory-bounded algorithm with internal states can be transformed into an -memory-bounded algorithm with internal states by defining if and defining otherwise, where is the input sequence (i.e., can always be stored explicitly together with the previous state instead of updating the state to ). Note that perfectly simulates the execution of for any input, and thus achieve the same distinguishing advantage.

From this point on we let and denote the probability distributions on corresponding to sampling the sequence uniformly with and without replacement, respectively, from . Namely, under we have that for , whereas under we have that are the first entries of a uniform random permutation on . The distribution of the algorithm’s output under (respectively ) is denoted by (respectively ).

Main results. The following theorem states our main result, upper bounding the distinguishing advantage of any memory-bounded streaming algorithm, when measured via KL divergence:

###### Theorem 1

For any , and such that for all , and for any -memory-bounded algorithm it holds that

 DKL(PA||QA)≤(1+o(1))⋅∑q−1i=1si+q⋅log2NNlog2(N/q).

In particular, when for any constant , then also , and we obtain the following corollary:

###### Corollary 2

For any , constant , and such that for all , and for any -memory-bounded algorithm it holds that

 DKL(PA||QA)≤(1+o(1))⋅∑q−1i=1si+q⋅log2Nϵ⋅Nlog2N.

Finally, for this range of parameters we observe that our bound is nearly tight:

###### Theorem 3

For any , and such that for all , there exists an -memory-bounded algorithm for which

 DKL(PA||QA)≥∑q−1i=1si−q⋅(log2N+1)Nlog2N.

## Iii Proof of Theorem 1

Our proof is based on an induction argument showing that , where the mutual information is computed with respect to , and is the state of the internal memory at step of the computation. Then, we leverage the fact that

form a Markov chain in this order, and that

due the the memory constraints, in order to derive an information bottleneck [9] upper bound on .

### Iii-a An Induction Argument

We prove the following lemma which is similar to a lemma proved by Jaeger and Tessaro [3].

###### Lemma 4

Let and be two distributions on , where the induced marginals satisfy for all , and in addition (i.e., under the distribution

are independent, where each is distributed according to the distribution ). For a streaming computation performed by the algorithm , let be the random variable corresponding to the state produced in the th step of the computation. Then

 DKL(PA||QA) ≤q∑i=1DKL(PXi|Σi−1∥PXi|PΣi−1) =q∑i=1I(Xi;Σi−1),

where the mutual information is computed with respect to the joint distribution

, induced by .

Proof. By definition of we have that . Moreover, since is obtained by processing , the data processing inequality yields

 DKL(PΣq∥QΣq)≤DKL(PXqΣq−1∥QXqΣq−1).

Applying the chain rule, yields

 DKL(PXqΣq−1∥QXqΣq−1) =DKL(PΣq−1∥QΣq−1)+DKL(PXq|Σq−1∥QXq|Σq−1|PΣq−1) =DKL(PΣq−1∥QΣq−1)+DKL(PXq|Σq−1∥QXq|PΣq−1) (1) =DKL(PΣq−1∥QΣq−1)+DKL(PXq|Σq−1∥PXq|PΣq−1), (2)

where (1) follows from the fact that is memoryless such that under this distribution is statistically independent of , and (2) follows from the assumption that . Thus, by induction we obtain that . Recalling that and that , our claim follows.

### Iii-B An Information-Bottleneck Argument

We make use of the following functions:

• For the binary entropy function (with respect to the natural basis) is

 h2(x)=−xlog(x)−(1−x)log(1−x),

and we let be its inverse restricted to .

• For we let .

• For we let

 φ(t)=f(h−12(t))=−(1−h−12(t))log(1−h−12(t)),

and for we let .

We claim that is non-decreasing over , that is non-decreasing and convex, and that for every it holds that . We defer the proofs to Section V. We state and prove our main technical lemma.

###### Lemma 5

Let be integers, let be the random process of sampling elements of uniformly without replacement. Denote , , and let be a random variable such that form a Markov chain in this order. Then, it holds that

 I(W;Γ)≤logNN−i−NN−i⋅φ(log(Ni)N−I(V;Γ)N)

Proof. We first note that

 I(W;Γ) =H(W)−H(W|Γ)=logN−H(W|Γ), (3)

and that

 I(V;Γ) =H(V)−H(V|Γ) =logN!(N−i)!−H(V|Γ). (4)

Consequently, we derive a lower bound on in terms of . To that end, we first compute the distribution for and . We have

 Pr(W=j|Γ=γ) =Pr(W=j,j∉V|Γ=γ) =Pr(j∉V|Γ=γ)Pr(W=j|j∉V,Γ=γ) =Pr(j∉V|Γ=γ)Pr(W=j|j∉V) =1−Pr(j∈V|Γ=γ)N−i.

It follows that

 H(W|Γ=γ) =N∑j=1Pr(W=j|Γ=γ)log1Pr(W=j|Γ=γ) =N∑j=11−Pr(j∈V|Γ=γ)N−ilogN−i1−Pr(j∈V|Γ=γ) =log(N−i)+1N−iN∑j=1f(Pr(j∈V|Γ=γ)) ≥log(N−i)+1N−iN∑j=1φ(h2(Pr(j∈V|Γ=γ))). (5)

Defining the random variables , we further write

 N∑j=1φ(h2(Pr(j∈V|Γ=γ))) =N∑j=1φ(H(Aj|Γ=γ)) ≥Nφ(1NN∑j=1H(Aj|Γ=γ)), (6)

where in the last step we used the convexity of . Next, using the fact that conditioning reduces entropy, we note that

 N∑j=1H(Aj|Γ=γ) ≥N∑j=1H(Aj|A1,…,Aj−1,Γ=γ) =H(A1,…,AN|Γ=γ).

Note that dictate the elements that belong to . Let be the order at which these elements appear. Together, and completely determine , and vice versa. We have that

 H(A1,…,AN|Γ=γ) =H(A1,…,AN,π|Γ=γ)−H(π|A1,…,AN,Γ=γ) =H(V|Γ=γ)−H(π|A1,…,AN,Γ=γ) ≥H(V|Γ=γ)−log(i!).

Plugging this into (6) (using the monotonicity of ), and then into (5), we obtain

 H(W|Γ=γ) ≥log(N−i)+NN−iφ(H(V|Γ=γ)−log(i!)N).

Recalling that and , and using the convexity of , we obtain

 H(W|Γ)≥log(N−i)+NN−iφ(H(V|Γ)−log(i!)N), (7)

and the statement follows by plugging (7) into (3) using (4).

Next, we simplify the bound of Lemma 5.

###### Corollary 6

In the setting of Lemma 5, if and then

 I(W;Γ)≤(1+o(1))⋅I(V;Γ)+logNNlog(N/i)

Proof.

To that end, we will use the following well-known estimate (proved using Stirling’s approximation, e.g.

[10]):

 Nh2(iN)−12log(8i(1−iN))≤log(Ni) ≤Nh2(iN)−12log(2πi(1−iN))

In particular, for large enough it holds that

 log(Ni)≥Nh2(iN)−logN

Let . Using the monotonicity of , the bound of Lemma 5 reads as

 −log(1−α)−11−αφ(h2(α)−I(V;Γ)+logNN). (8)

Let . Due to the the convexity of it holds that

 h−12(h2(α)−β)≥α−g(α)⋅β,

where . Recall that and that is increasing at , and hence

 φ(h2(α)−β)≥f(α−g(α)⋅β),

and we can further upper bound (8) by

 −log(1−α)−11−αf(α−g(α)⋅β). (9)

Denoting and recalling the definition of , we further develop (9)

 −log(1−α)+1−α+g(α)⋅β1−αlog(1−α+g(α)⋅β) =−log(1−α)+(1+δ)log((1−α)(1+δ)) =δlog(1−α)+(1+δ)log(1+δ) ≤(1+δ)δ (10) =1+δ1−αg(α)β,

where in (10) we used the inequality that holds for every . Since we can estimate

 g(α) =(h−12)′(h2(α))=1/h′2(α) =1log(1−α)−logα =1+o(1)log(N/i).

Since we assumed that , it also holds that , and we conclude that

 I(W;Γ)≤(1+o(1))⋅I(V;Γ)+logNlog(N/i)

### Iii-C Application to Our Setup

Finally, we can derive Theorem 1 from Corollary 6. Recall that and designate the probability distributions corresponding to sampling uniformly without and with replacement, respectively, from . Thus, for all , and furthermore, is a memoryless distribution. Thus, the conditions of Lemma 4 hold, and , where the mutual information is with respect to . Now, recalling that forms a Markov chain in this order, and that under we have that is a random process of sampling elements of uniformly without replacement, and that by the constraints on the internal memory, we can apply Corollary 6 to obtain

 I(Xi+1;Σi)≤(1+o(1))⋅silog2+logNNlog(N/q),

and Theorem 1 follows by summing over all . This settles the proof of Theorem 1.

## Iv Proof of Theorem 3

Informally, given such that for all we construct an -memory-bounded algorithm that stores a list of values that it saw, where every new value is added to the list if the state size allows it. More formally, with a loss of at most bits per each , we may assume that is of the form for an integer for all . We remind that we assume that for all and that , thus it holds that for all and it holds that . We also assume without loss of generality that (i.e., the final output of is a single bit). For we define the computation as follows:

• If , output or according to whether or , respectively.

• Else, if the first bit of is , output .

• Else, parse .

• If , output .

• Else, if , output .

• Else, output .

Note that for this algorithm and that , so it holds that

 DKL(PA||QA) =P[A(X)=0]log(P[A(X)=0]Q[A(X)=0]) =−log(Q[A(X)=0]) =−log(q−1∏i=1(1−kiN)) =−q−1∑i=1log(1−kiN) ≥q−1∑i=1kiN ≥∑q−1i=1si−q⋅(log2N+1)Nlog2N,

and this settles the proof of Theorem 3.

## V Proofs for the properties of f and φ

In this section we give proofs for the properties of and that we used.

We start by showing that is increasing over . Indeed , so as long as . Now, we show is increasing over . Recall that , and the claim follows from the fact that is increasing and is increasing over . Next, we show that is convex by showing that its derivative is increasing. It holds that

 φ′(t)=f′(h−12(t))(h2)′(h−12(t)).

Thus, where

 p(x)=f′(x)(h2)′(x)=log(1−x)+1log(1−x)−logx.

Computing the derivative, for we get

 p′(x)=1−h(x)x(1−x)(log(1−x)−logx)2>0.

So is increasing, thus is increasing and is convex as claimed. Finally, we show that for every it holds that . When it simply holds that . When , it holds that , thus we need to show that . Let . Then, (when ), so is concave over . Together with the fact that we get that when .

## References

• [1] O. Goldreich, Foundations of Cryptography – Volume 1: Basic Techniques.   Cambridge University Press, 2001.
• [2] J. Katz and Y. Lindell, Introduction to Modern Cryptography (2nd Edition).   CRC Press, 2014.
• [3] J. Jaeger and S. Tessaro, “Tight time-memory trade-offs for symmetric encryption,” in Advances in Cryptology – EUROCRYPT, 2019, pp. 467–497.
• [4] I. Dinur, “On the streaming indistinguishability of a random permutation and a random function,” To Appear in Advances in Cryptology – EUROCRYPT, 2020.
• [5] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar, “An information statistics approach to data stream and communication complexity,” J. Comput. Syst. Sci., vol. 68, no. 4, pp. 702–732, 2004.
• [6] B. Kalyanasundaram and G. Schnitger, “The probabilistic communication complexity of set intersection,” SIAM J. Discrete Math., vol. 5, no. 4, pp. 545–557, 1992.
• [7] A. A. Razborov, “On the distributional complexity of disjointness,” Theor. Comput. Sci., vol. 106, no. 2, pp. 385–390, 1992.
• [8] M. Göös and T. Watson, “Communication complexity of set-disjointness for all probabilities,” Theory of Computing, vol. 12, no. 1, pp. 1–23, 2016.
• [9] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in 37th Annual Allerton Conference on Communications, Control, and Computing, 1999, pp. 368–377.
• [10] F. MacWilliams and N. Sloane, The Theory of Error-Correcting Codes.   Elsevier Science, 1977.