Differential privacy (DP) (Dwork et al., 2006) has arisen in the last decade into a strong de-facto standard for privacy-preserving computation in the context of statistical analysis. The success of DP is based, at least in part, on the availability of robust building blocks (e.g., the Laplace, exponential and Gaussian mechanisms) together with relatively simple rules for analyzing complex mechanisms built out of these blocks (e.g., composition and robustness to post-processing). The inherent tension between privacy and utility in practical applications has sparked a renewed interest into the development of further rules leading to tighter privacy bounds. A trend in this direction is to find ways to measure the privacy introduced by sources of randomness that are not accounted for by standard composition rules. Generally speaking, these are referred to as privacy amplification rules, with prominent examples being amplification by subsampling (Chaudhuri and Mishra, 2006; Kasiviswanathan et al., 2011; Li et al., 2012; Beimel et al., 2013, 2014; Bun et al., 2015; Balle et al., 2018; Wang et al., 2019), shuffling (Erlingsson et al., 2019; Cheu et al., 2019; Balle et al., 2019) and iteration (Feldman et al., 2018).
Motivated by these considerations, in this paper we initiate a systematic study of privacy amplification by stochastic post-processing. Specifically, given a DP mechanism producing (probabilistic) outputs in and a Markov operator defining a stochastic transition between and , we are interested in measuring the privacy of the post-processed mechanism producing outputs in . The standard post-processing property of DP states that is at least as private as . Our goal is to understand under what conditions the post-processed mechanism is strictly more private than . Roughly speaking, this amplification should be non-trivial when the operator “forgets” information about the distribution of its input . Our main insight is that, at least when , the forgetfulness of from the point of view of DP can be measured using similar tools to the ones developed to analyze the speed of convergence, i.e. mixing, of the Markov process associated with .
In this setting, we provide three types of results, each associated with a standard method used in the study of convergence for Markov processes. In the first place, Section 3 provides DP amplification results for the case where the operator
satisfies a uniform mixing condition. These include standard conditions used in the analysis of Markov chains on discrete spaces, including the well-known Dobrushin coefficent and Doeblin’s minorization condition(Levin and Peres, 2017). Although in principle uniform mixing conditions can also be defined in more general non-discrete spaces (Del Moral et al., 2003), most Markov operators of interest in do not exhibit uniform mixing since the speed of convergence depends on how far apart the initial inputs are. Convergence analyses in this case rely on more sophisticated tools, including Lyapunov functions (Meyn and Tweedie, 2012), coupling methods (Lindvall, 2002) and functional inequalities (Bakry et al., 2013).
Following these ideas, Section 4 investigates the use of coupling methods to quantify privacy amplification by post-processing under Rényi DP (Mironov, 2017). These methods apply to operators given by, e.g., Gaussian and Laplace distributions, for which uniform mixing does not hold. Results in this section are intimately related to the privacy amplification by iteration phenomenon studied in (Feldman et al., 2018) and can be interpreted as extensions of their main results to more general settings. In particular, our analysis provides sharper bounds when iterating strict contractions and leads to an exponential improvement on the privacy amplification by iteration of Noisy SGD in the strongly convex case.
Our last set of results concerns the case where is replaced by a family of operators forming a Markov semigroup (Bakry et al., 2013). This is the natural setting for continuous-time Markov processes, and includes diffusion processes defined in terms of stochastic differential equations (Øksendal, 2003). In Section 5 we associate (a collection of) diffusion mechanisms to a diffusion semigroup. Interestingly, these mechanisms are, by construction, closed under post-processing in the sense that . We show the Gaussian mechanism falls into this family – since Gaussian noise is closed under addition – and also present a new mechanism based on the Ornstein-Uhlenbeck process which in many cases has better mean squared error than the Gaussian mechanism. Our main result on diffusion mechanisms provides a generic Rényi DP guarantee based on an intrinsic notion of sensitivity derived from the geometry induced by the semigroup. The proof relies on a heat flow argument reminiscent of the analysis of mixing in diffusion processes based on functional inequalities (Bakry et al., 2013).
We start by introducing notation and concepts that will be used throughout the paper. We write , and .
Let be a measurable space with sigma-algebra and base measure . We write
to denote the set of probability distributions on. Given a probability distribution and a measurable event we write
for a random variable, denote its expectation under by , and can get back its distribution as . Given two distributions (or, in general, arbitrary measures) we write to denote that is absolutely continuous with respect to , in which case there exists a Radon-Nikodym derivative . We shall reserve the notation to denote the density of with respect to the base measure. We also write to denote the set of couplings between and ; i.e. is a distribution on with marginals and . The support of a distribution is .
We will use to denote the set of Markov operators defining a stochastic transition map between and and satisfying that is measurable for every measurable . Markov operators act on distributions on the left through , and on functions on the right through , which can also be written as with . The kernel of a Markov operator (with respect to ) is the function associating with the density of with respect to a fixed measure.
A popular way to measure dissimilarity between distributions is to use Csiszár divergences , where is convex with . Taking yields the total variation distance , and the choice with gives the hockey-stick divergence , which satisfies
It is easy to check that is monotonically decreasing and . All Csiszár divergences satisfy joint convexity and the data processing inequality for any Markov operator . Rényi divergences111Rényi divergences do not belong to the family of Csiszár divergences. are another way to compare distributions. For the Rényi divergence of order is defined as , and also satisfies the data processing inequality. Finally, to measure similarity between we sometimes use the -Wasserstein distance:
A mechanism is a randomized function that takes a dataset over some universe of records and returns a (sample from) distribution . We write to denote two databases differing in a single record. We say that satisfies222This divergence characterization of DP is due to (Barthe and Olmedo, 2013). -DP if (Dwork et al., 2006). Furthermore, we say that satisfies -RDP if (Mironov, 2017).
3 Amplification From Uniform Mixing
We start our analysis of privacy amplification by stochastic post-processing by considering settings where the Markov operator satisfies one of the following uniform mixing conditions.
Let be a Markov operator, and . We say that is:
-Dobrushin if ,
-Dobrushin if ,
-Doeblin if there exists a distribution such that for all ,
-ultra-mixing if for all we have and .
Most of these conditions arise in the context of mixing analyses in Markov chains. In particular, the Dobrushin condition can be tracked back to (Dobrushin, 1956), while Doeblin’s condition was introduced earlier (Doeblin, 1937) (see also (Nummelin, 2004)). Ultra-mixing is a strengthening of Doeblin’s condition used in (Del Moral et al., 2003). The -Dobrushin is, on the other hand, new and is designed to be a generalization of Dobrushin tailored for amplification under the hockey-stick divergence.
It is not hard to see that Dobrushin’s is the weakest among these conditions, and in fact we have the implications summarized in Figure 2 (see Lemma 3). This explains why the amplification bounds in the following result are increasingly stronger, and in particular why the first two only provide amplification in , while the last two also amplify the parameter.
The implications in Figure 2 hold.
That -Dobrushin implies -Dobrushin follows directly from .
To see that -Doeblin implies -Dobrushin we observe that the kernel of a -Doeblin operator must satisfy for any . Thus, we can use the characterization of in terms of a minimum to get
Finally, to get the -Doeblin condition for an operator satisfying -ultra-mixing we recall from (Del Moral et al., 2003, Lemma 4.1) that for such an operator we have that is satisfied for any probability distribution and . Thus, taking to have full support we obtain Doeblin’s condition with . ∎
Let be an -DP mechanism. For a given Markov operator , the post-processed mechanism satisfies:
-DP with if is -Dobrushin,
-DP with if is -Dobrushin with333We take the convention whenever , in which case the -Dobrushin condition is obtained with respect to the divergence . ,
-DP with and if is -Doeblin,
-DP with and if is -ultra-mixing.
A few remarks about this result are in order. First we note that (2) is stronger than (1) since the monotonicity of hockey-stick divergences implies . Also note how in the results above we always have , and in fact the form of is the same as obtained under amplification by subsampling when, e.g., a -fraction of the original dataset is kept. This is not a coincidence since the proofs of (3) and (4) leverage the overlapping mixtures technique used to analyze amplification by subsampling in (Balle et al., 2018). However, we note that for (3) we can have even with . In fact the Doeblin condition only leads to an amplification in if .
For convenience, we split the proof of Theorem 3 into four separate statements, each corresponding to one of the claims in the theorem.
Recall that a Markov operator is -Dobrushin if .
Let be an -DP mechanism. If is a -Dobrushin Markov operator, then the composition is -DP.
This follows directly from the strong Markov contraction lemma established by Cohen et al. (1993) in the discrete case and by Del Moral et al. (2003) in the general case (see also (Raginsky, 2016)). In particular, this lemma states that for any divergence in the sense of Csiszár we have . Letting and for some and applying this inequality to yields the result. ∎
Next we prove amplification when is a -Dobrushin operator. Recall that a Markov operator is -Dobrushin if . We will require the following technical lemmas in the proof of Theorem 3.
Let denote the fact . If is -Dobrushin, then we have
Note that the condition on can be written as . This shows that by hypothesis the condition already holds for the distributions with . Thus, all we need to do is prove that these distributions are extremal for among all distributions with . Let and define and . Working in the discrete setting for simplicity, we can write , with an equivalent expression for . Now we use the joint convexity of to write
Let . Then we have
Define to be set of points where is dominated by , and let denote its complementary. Then we have the identities
Thus we obtain the desired result since
Let be an -DP mechanism and let . If is a -Dobrushin Markov operator, then the composition is -DP.
Fix and for some and let . We start by constructing overlapping mixture decompositions for and as follows. First, define the function and let be the probability distribution with density , where we used Lemma 3. Now note that by construction we have the inequalities
Assuming without loss of generality that , these inequalities imply that we can construct probability distributions and such that
Now we observe that the distributions and defined in this way have disjoint support. To see this we first use the identity to see that
Thus we have . A similar argument applied to shows that on the other hand , and thus .
Finally, we proceed to use the mixture decomposition of and and the condition to bound as follows. By using the mixture decompositions we get
where . Thus, applying the definition of , using the linearity of Markov operators, and the monotonicity we obtain the bound:
where the last inequality follows from Lemma 3. ∎
Recall that a Markov operator is -Doeblin if there exists a distribution such that for all . The proof of amplification for -Doeblin operators further leverages overlapping mixture decompositions like the one used in Theorem 3, but this time the mixture arises at the level of the kernel itself.
Let be an -DP mechanism. If is a -Doeblin Markov operator, then the composition is -DP with and .
Fix and for some . Let be a witness that is -Doeblin and let be the constant Markov operator given by for all . Doeblin’s condition implies that the following is again a Markov operator:
Thus, we can write as the mixture and then use the advanced joint convexity property of (Balle et al., 2018, Theorem 2) with to obtain the following:
where . Finally, using the immediate bounds and , we get
Our last amplification result applies to operators satisfying the ultra-mixing condition of Del Moral et al. (2003). We say that a Markov operator is -ultra-mixing if for all we have and . The proof strategy is based on the ideas from the previous proof, although in this case the argument is slightly more technical as it involves a strengthening of the Doeblin condition implied by ultra-mixing that only holds under a specific support.
Let be an -DP mechanism. If is a -ultra-mixing Markov operator, then the composition is -DP with and .
Fix and for some . The proof follows a similar strategy as the one used in Theorem 3, but coupled with the following consequence of the ultra-mixing property: for any probability distribution and we have (Del Moral et al., 2003, Lemma 4.1). We use this property to construct a collection of mixture decompositions for as follows. Let and take and . By the ultra-mixing condition and the argument used in the proof of Theorem 3, we can show that
is a Markov operator from into . Here is the constant Markov operator . Furthermore, the expression for and the definition of imply that
Now note that the mixture decompositions and and the advanced joint convexity property of (Balle et al., 2018, Theorem 2) with yield
where . Using (1) we can expand the remaining divergence above as follows:
where we used the definition of and joint convexity. Since was arbitrary, we can now take the limit to obtain the bound . ∎
We conclude this section by noting that the conditions in Definition 1, despite being quite natural, might be too stringent for proving amplification for DP mechanisms on, say, . One way to see this is to interpret the operator as a mechanism and to note that the uniform mixing conditions on can be rephrased in terms of local DP (LDP) (Kasiviswanathan et al., 2011) properties (see Table 2)444The blanket condition is a necessary condition for LDP introduced in (Balle et al., 2019) to analyze privacy amplification by shuffling. where the supremum is taken over any pair of inputs (instead of neighboring ones). This motivates the results on next section, where we look for finer conditions to prove amplification by stochastic post-processing.
4 Amplification From Couplings
In this section we turn to coupling-based proofs of amplification by post-processing under the Rényi DP framework. Our first result is a measure-theoretic generalization of the shift-reduction lemma in (Feldman et al., 2018)
which does not rely on the vector-space structure of the underlying space.
Given a coupling with , we construct a transport Markov operator with kernel555Here we use the convention . , where and . It is immediate to verify from the definition that is a Markov operator satisfying the transport property .
Let , and . For any distribution and coupling we have
Let and be as in the statement, and let . Note that taking and to be the corresponding transport operators we have . Now, given a let denote the marginal of on the second coordinate. In particular, if
denotes the joint distribution ofand , then we have . Thus, by the data processing inequality we have
The final step is to expand the RHS of the derivation above as follows:
where the supremums are taken with respect to . ∎
Note that this result captures the data-processing inequality for Rényi divergences since taking yields . The next examples illustrate the use of this theorem to obtain amplification by operators corresponding to the addition of Gaussian and Laplace noise.
Example 1 (Tightness).
To show that (2) is tight we consider the simple scenario of adding Gaussian noise to the output of a Gaussian mechanism. In particular, suppose for some function with global -sensitivity and the Markov operator is given by . The post-processed mechanism is given by , which satisfies -RDP. We now show how this result also follows from Theorem 4. Given two datasets we write and with . We take for some to be determined later, and couple and through a translation , yielding a coupling with and a transport operator with kernel . Plugging these into (2) we get
Finally, taking with yields .
Example 2 (Iterated Laplace).
To illustrate the flexibility of this technique, we also apply it to get an amplification result for iterated Laplace noise, in which Laplace noise is added to the output of a Laplace mechanism. We begin by noting a negative result that there is no amplification in the -DP regime.
Let for some function with global -sensitivity and let the Markov operator be given by . The post-processed mechanism does not achieve -DP for any . Note that achieves -DP and achieves -DP.
This can be shown by directly analyzing the distribution arising from the sum of two independent laplace variables. Let denote this distribution. In the following equations, we assume . Due to symmetry around the origin, densities at negative values can be found by looking instead at the corresponding positive location.
The integration on the middle term varies between the cases and . Finishing this derivation and replacing with to account for both positive and negative values, we get a complete expression for our density.