Significant research challenges for statistical learning include efficiency, robustness to noise (stochasticity) and adversarial manipulation, and preserving training data privacy. In this paper we study techniques for meeting these challenges simultaneously. In particular, we examine the following problem.
Summary of setting. A Bayesian statistician () wants to communicate results about some data to a third party (), but without revealing the data itself. More specifically: (i) selects a model family () and a prior (). (ii) is allowed to see and and is computationally unbounded. (iii) observes data x and calculates the posterior . (iv) performs repeated queries to . (v) responds by sampling from the posterior .
We show that if or is chosen appropriately, the resulting mechanism satisfies generalized differential privacy and indistinguishability properties. The main idea we pursue is that robustness and privacy are inherently linked through smoothness. Learning algorithms that are smooth mappings—their output (e.g., a spam filter) does not significantly vary with small perturbations to their input (e.g.,
similar training corpora)—enjoy robustness. Intuitively under smoothness, training outliers have reduced influence while it is also difficult for an adversary to leverage knowledge of the learning process to discover unknown information about the data. This suggests that robustness and privacy may be simultaneously achieved and perhaps are deeply linked. We show that under mild assumptions, this is indeed true for the posterior distribution.
The study of learning, security, robustness and privacy, and their relationships, is timely. Interest in adversarial learning is accelerating Joseph et al. (2013) while differential privacy has brought data privacy onto firm theoretical footing Dwork et al. (2006); McSherry & Talwar (2007); Duchi et al. (2013). In practice, security and privacy online are in tension with learning and are of growing economic and societal concern. Our works aims towards a unified understanding of learning in adversarial environments.
We generalise differential privacy to arbitrary dataset distances, outcome spaces, and distribution families.
Under certain regularity conditions on the prior distribution or likelihood family , we show that the posterior distribution is robust: small changes in the dataset result in small posterior changes;
We introduce a novel posterior sampling mechanism that is private. Unlike other common mechanisms, our approach sits squarely in the non-private (Bayesian) learning framework without modification;
We introduce the notion of dataset distinguishability for which we provide finite-sample bounds for our mechanism
We provide some classical examples of conjugate distributions where our assumptions hold.
Section 1.1 discusses related work. Section 2 specifies the problem setting and our main assumptions. Section 3 proves results on robustness of Bayesian learning with a number of examples. Section 4 bounds the ability of the adversary to discriminate datasets. Examples of distributions for which our assumptions hold are given in Section 5. We conclude the paper with Section 6. Proofs of the main theorems are given in the appendix, while those of non-essential lemmas are given in the supplement.
1.1 Related Work
In Bayesian statistical decision theoryDeGroot (1970); Berger (1985); Bickel & Doksum (2001), learning is cast as a statistical inference problem and decision-theoretic criteria are used as a basis for assessing, selecting and designing procedures. In particular, for a given cost function, the Bayes-optimal procedure minimises the Bayes risk under a particular prior distribution.
In an adversarial setting, this is extended to a minimax risk, by assuming that the prior distribution is selected arbitrarily by nature. In the field of robust statistics, the minimax asymptotic bias of a procedure incurred within an -contamination neighbourhood is used as a robustness criterion giving rise to the notion of a procedure’s influence function and breakdown point to characterise robustness Huber (1981); Hampel et al. (1986). In a Bayesian context, robustness appears in several guises including minimax risk, robustness of the posterior within -contamination neighbourhoods, and robust priors Berger (1985). In this context Grünwald & Dawid (2004) demonstrated the link between robustness in terms of the minimax expected score of the likelihood function and the (generalized) maximum entropy principle, whereby nature is allowed to select a worst-case prior.
Differential privacy, first proposed by Dwork et al. (2006)
, has achieved prominence in the theory of computer science, databases, and more recently learning communities. Its success is largely due to the semantic guarantee of privacy it formalises. Differential privacy is normally defined with respect to a randomised mechanism for responding to queries. Informally, a mechanism preserves differential privacy if perturbing one training instance results in small a change to the probabilities of the mechanism.
A popular approach for achieving differential privacy is the exponential mechanism McSherry & Talwar (2007) which generalises the Laplace mechanism of adding Laplace noise to released statistics Dwork et al. (2006). This releases a response with probability exponential in a score function measuring distance to the non-private response. An alternate approach, employed for privatising regularised ERM Chaudhuri et al. (2011), is to alter the inferential procedure itself, in that case by adding a random term to the primal objective. Unlike previous studies, our mechanisms do not require modification to the underlying learning framework.
In a different direction, Duchi et al. (2013) provided information-theoretic bounds for private learning, by modelling the protocol for interacting with an adversary as an arbitrary conditional distribution, rather than restricting it to specific mechanisms. These bounds can be seen as complementary to ours.
Little research in differential privacy has focused on the Bayesian paradigm. Williams & McSherry (2010) applied probabilistic inference to improve the utility of differentially private releases by computing posteriors in a noisy measurement model.
Smoothness of the learning map, achieved for Bayesian inference here by appropriate concentration of the prior, is related to algorithmic stability
which is used in statistical learning theory to establish error ratesBousquet & Elisseeff (2002). Rubinstein et al. (2012) used the -uniform stability of the SVM to calibrate the level of noise for using the Laplace mechanism to achieve differential privacy for the SVM. Hall et al. (2013) extended this technique to adding Gaussian process noise for differentially private release of infinite-dimensional functions lying in an RKHS.
Finally, Dwork & Lei (2009) made the first connection between (frequentist) robust statistics and differential privacy, developing mechanisms for the interquartile, median and -robust regression. While robust statistics are designed to operate near an ideal distribution, they can have prohibitively high global, worst-case sensitivity. In this case privacy was still achieved by performing a differentially-private test on local sensitivity before release Dwork & Smith (2009). Little further work has explored robustness and privacy, and no general connection is known.
2 Problem Setting
We consider the problem of a Bayesian statistician () communicating statistical findings to an untrusted third party (). While wants to convey useful statistical information to any queries, but without revealing private information about the original data (e.g., how many people suffer from a disease or vote for a particular party). In so doing, must also preserve the local privacy of users represented in the dataset. This requires finding a query response mechanism for communicating information that strikes a good balance between utility and privacy. In this paper, we study the inherent privacy and robustness properties of Bayesian inference and explore the question of whether can select a prior distribution so that a computationally unbounded cannot obtain private information from queries.
We begin with our notation. Let be the set of all possible datasets. For example, if is a finite alphabet, then we might have , i.e., the set of all possible observation sequences over .
Central to notions of privacy and robustness, is the concept of distance between datasets. Firstly, the effect of dataset perturbation on learning depends on the amount of noise as quantified by some distance. Secondly, the amount that an attacker can learn from queries can be quantified in terms of the distance of his guesses to the true dataset. To model these situations, we equip with a pseudo-metric111Meaning that does not necessarily imply . . Using pseudo-metrics, we considerably generalise previous work on differential privacy, which considers only the special case of Hamming distance.
This paper focuses on the Bayesian inference setting, where the statistician constructs a posterior distribution from a prior distribution and a training dataset . More precisely, we assume that data have been drawn from some distribution on , parametrised by , from a family of distributions . defines a parameter set indexing a family of distributions on , where is an appropriate -algebra on :
and where we use to denote the corresponding densities222I.e., the Radon-Nikodym derivative of with respect to some dominating measure when necessary. To perform inference in the Bayesian setting, selects a prior measure on reflecting ’s subjective beliefs about which is more likely to be true, a priori; i.e., for any measurable set , represents ’s prior belief that . In general, the posterior distribution after observing is:
where is the corresponding marginal density given by:
While the choice of the prior is generally arbitrary, this paper shows that its careful selection can yield good privacy guarantees.
We first recall the idea of differential privacy Dwork (2006). This states that on similar datasets, a randomised query response mechanism yields (pointwise) similar distributions. We adopt the view of mechanisms as conditional distributions under which differential privacy can be seen as a measure of smoothness. In our setting, conditional distributions conveniently correspond to posterior distributions. These can also be interpreted as the distribution of a mechanism that uses posterior sampling, to be introduced in Section 4.2.
Definition 1 (-differential privacy).
A conditional distribution on is -differentially private if, for all and for any
for all in the hamming- neighbourhood of . That is, there is at most one such that .
As a first step, we generalise this definition to arbitrary dataset spaces that are not necessarily product spaces. To do so, we introduce the notion of differential privacy under a pseudo-metric on the space of all datasets.
Definition 2 (-differential privacy under .).
A conditional distribution on is -differentially private under a pseudo-metric if, for all and for any , then:
If and we use the Hamming distance , this definition is analogous to standard -differential privacy. In fact, when considering only - differential privacy or -privacy, it is an equivalent notion.333Making the definition wholly equivalent is possible, but results in an unnecessarily complex definition.
For -DP, let ; i.e., they only differ in one element. Then, from standard DP, we have and so obtain . By induction, this holds for any pair. Similarly, for -DP, by induction we obtain . ∎
Definition 1 allows for privacy against a very strong attacker , who attempts to match the empirical distribution of the true dataset by querying the learned mechanism and comparing its responses to those given by distributions simulated using knowledge of the mechanism and knowledge of all but one datum—narrowing the dataset down to a hamming-1 ball. Indeed this requirement is sometimes too strong since it may come at the price of utility. Our Definition 2 allows for a much broader encoding of the attacker’s knowledge via the selected pseudo-metric.
2.2 Our Main Assumptions
In the sequel, we show that if the distribution family or prior is such that close datasets have similar probabilities, then its posterior distributions are close. In that case, it is difficult for a third party to use such a posterior to distinguish the true dataset from similar datasets.
To formalise these notions, we introduce two possible assumptions one could make on the smoothness of the family with respect to some metric on . The first assumption states that the likelihood is smooth for all parameterizations of the family:
Assumption 1 (Lipschitz continuity).
Let be a metric on . There exists such that, for any :
However, it may be difficult for this assumption to hold uniformly over . This can be seen by a counterexample for the Bernoulli family of distributions. Consequently, we relax it by only requiring that ’s prior probability is concentrated in the parts of the family for which the likelihood is smoothest:
Assumption 2 (Relaxed Lipschitz continuity).
Let be a metric on and let
be the set of parameters for which Lipschitz continuity holds with Lipschitz constant . Then there is some constant such that, for all :
By not requiring uniform smoothness, this weaker assumption is easier to meet but still yields useful guarantees. In fact, in Section 5, we demonstrate that this assumption is satisfied by several example distribution families.
To make our assumptions concrete, we now fix the distance function to be the absolute log-ratio,
which is a proper metric on .This particular choice of distance yields guarantees on differential privacy and indistinguishability.
We next show that verifying our assumptions for a distribution of a single random variable lifts to a corresponding property for the product distribution on i.i.d. samples.
and constant (or ). Further, if and differ in at most items, the assumption holds with the same pseudo-metric but with constant (or ) instead.
3 Robustness of the Posterior Distribution
We now show that the above assumptions provide guarantees on the robustness of the posterior. That is, if the distance between two datasets is small, then so too is the distance between the two resulting posteriors, and . We prove this result for the case where we measure the distance between the posteriors in terms of the well-known KL-divergence:
The following theorem shows that any distribution family and prior satisfying one of our assumptions is robust, in the sense that the posterior does not change significantly with small changes to the dataset. It is notable that our mechanisms are simply tuned through the choice of prior.
When is the absolute log-ratio distance (2.7), is a prior distribution on and and are the respective posterior distributions for datasets , the following results hold:
Under a metric and satisfying Assumption 1,
Note that the second claim bounds the KL divergence in terms of ’s prior belief that is small, which is expressed via the constant . The larger is, the less prior mass is placed in large and so the more robust inference becomes. Of course, choosing to be too large may decrease efficiency.
4 Privacy Properties of the Posterior Distribution
We next examine the differential privacy of the posterior distribution. We show in Section 4.1 that this can be achieved under either of our assumptions. The result can also be interpreted as the differential privacy of a posterior sampling mechanism for responding to queries, which is described in Section 4.2. Finally, Section 4.3 introduces an alternative notion of privacy: dataset distinguishability. We prove a high-probability bound on the sample complexity of distinguishability under our assumptions.
4.1 Differential Privacy of Posterior Distributions
We consider our generalised notion of differential privacy for posterior distributions (Definition 2); and show that the type of privacy exhibited by the posterior depends on which assumption holds.
4.2 Posterior Sampling Query Model
Given that we have a full posterior distribution, we use it to define an algorithm achieving privacy. In this framework, we allow the adversary to submit a set of queries which are mappings from parameter space to some arbitrary answer set ; i.e.,, . If we know the true parameter , then we would reply to any query with . However, since is unknown, we must select a method for conveying the required information. There are three main approaches that we are aware of. The first is to marginalise out. The second is to use the maximum a posteriori value of . The final, which we employ here, is to use sampling; i.e., to reply to each query using a different sampled from the posterior.
This sample-based query model is presented in Algorithm 1. First, the algorithm calculates the posterior distribution . Then, for the received query , the algorithm draws a sample from the posterior distribution and responds with .
In this context, Theorem 2 can be interpreted as proving differential privacy for the posterior sampling mechanism for the case when the response set is the parameter set; i.e., and .
As a further illustration, we provide the example of querying conditional expectations.
Let each model in the family define a distribution on the product space , such for any , . In addition, let (with appropriate algebras ) and write for point and its two components. A conditional expectation query would require an answer to the question:
where the parameter is unknown to the questioner. In this case, the answer set would be identical to , while would index the values in .
4.3 Distinguishability of Datasets
A limitation of the differential privacy framework is that it does not give us insight on the amount of effort required by an adversary to obtain private information. In fact, an adversary wishing to breach privacy, needs to distinguish from alternative datasets . Within the posterior sampling query model, has to decide whether ’s posterior is or . However, he can only do so within some neighbourhood of the original data. In this section, we bound his error in determining the posterior in terms of the number of queries he performs. This is analogous to the dataset-size bounds on queries in interactive models of differential privacy Dwork et al. (2006).
Let us consider an adversary querying to sample . This is the most powerful query possible under the model shown in Algorithm 1. Then, the adversary needs only to construct the empirical distribution to approximate the posterior up to some sample error. By bounds on the KL divergence between the empirical and actual distributions we can bound his power in terms of how many samples he needs in order to distinguish between and .
Due to the sampling model, we first require a finite sample bound on the quality of the empirical distribution. The adversary could attempt to distinguish different posteriors by forming the empirical distribution on any sub-algebra .
For any , let be a finite partition of the sample space , of size , generating the -algebra . Let be i.i.d. samples from a probability measure on , let be the restriction of on and let be the empirical measure on . Then, with probability at least :
Of course, the adversary could choose any arbitrary estimatorto guess . Appendix A describes how one might apply Le Cam’s method to obtain lower bounds rates in this case. We defer a detailed discussion of this issue to future work.
We can combine this bound on the adversary’s estimation error with Theorem 1’s bound on the KL divergence between posteriors resulting from similar data to obtain a measure of how fine a distinction between datasets the adversary can make after a finite number of draws from the posterior:
Consequently, either smoother likelihoods (i.e., decreasing ), or a larger concentration on smoother likelihoods (i.e., increasing ), both increases the effort required by the adversary and reduces the sensitivity of the posterior. Note that, unlike the results obtained for differential privacy of the posterior sampling mechanism, these results have the same algebraic form under both assumptions.
5 Examples satisfying our assumptions
In what follows we study, for different choices of likelihood and corresponding conjugate prior, what constraints must be placed on the prior’s concentration to guarantee a desired level of privacy. These case studies closely follow the pattern in differential privacy research where the main theorem for a new mechanism are sufficient conditions on (e.g., Laplace) noise levels to be introduced to a response in order to guarantee a level of -differential privacy.
First consider exponential families, of the form
where is the base measure, is the distribution’s natural parameter corresponding to , is the distribution’s sufficient statistic, and is its log-partition function. For distributions in this family, under the absolute log-ratio distance, the family of parameters of Assumption 2 must satisfy, for all : . If the left-hand side has an amenable form, then we can quantify the set for which this requirement holds. Particularly, for distributions where is constant and is scalar (e.g., Bernoulli, exponential, and Laplace), this requirement simplifies to . One can then find the supremum of the left-hand side independent from , yielding a simple formula for the feasible for any . Here are some examples.
Lemma 3 (Exponential conjugate prior).
Lemma 4 (Laplace conjugate prior).
The Laplace distribution and Laplace conjugate prior , , , satisfies Assumption 2 with parameters and metric
Lemma 5 (Beta-Binomial conjugate prior).
Lemma 6 (Normal distribution).
Lemma 7 (Discrete Bayesian networks).
Consider a family of discrete Bayesian networks on
of discrete Bayesian networks onvariables. More specifically, each member , is a distribution on a finite space and we write for the probability of any outcome in . We also let be the distance between and . If is the smallest probability assigned to any one sub-event, then Assumption 1 is satisfied with .
We have provided a unifying framework for private and secure inference in a Bayesian setting. Under simple but general assumptions, we have shown that Bayesian inference is both robust and private in a certain sense. In particular, our results establish that generalised differential privacy can be achieved while using only existing constructs in Bayesian inference. Our results merely place concentration conditions on the prior. This allows us to use a general posterior sampling mechanism for responding to queries.
Due to its relative simplicity on top of non-private inference, our framework may thus serve as a fundamental building block for more sophisticated, general Bayesian inference. As an additional step towards this goal, we have demonstrated the application of our framework to deriving analytical expressions for well-known distribution families, and for discrete Bayesian networks. Finally, we bounded the amount of effort required of an attacker to breach privacy when observing samples from the posterior. This serves as a principled guide for how much access can be granted to querying the posterior, while still guaranteeing privacy.
We have not examined how privacy concerns relate to learning. While larger improves privacy, it also concentrates the prior so much that learning would be inhibited. Thus, should be chosen to optimise the trade-off between privacy and learning. However, we leave this issue for future work.
Appendix A The Le Cam Method
It is possible to apply standard minimax theory to obtain lower bounds on the rate of convergence of the adversary’s estimate to the true data. In order to do so, we can for example apply the method due to LeCam (1973), which places lower bounds on the expected distance between an estimator and the true parameter. In order to apply it in our case, we simply replace the parameter space with the dataset space.
Le Cam’s method assumes the existence of a family of probability measures indexed by some parameter, with the parameter space being equipped with a pseudo-metric. In our setting, we use Le Cam’s method in a slightly unorthodox, but very natural manner. Define the family of probability measures on to be:
the family of posterior measures in the parameter space, for a specific prior . Consequently, now plays the role of the parameter space, while is used as the metric. The original family plays no further role in this construction, other than a way to specify the posterior distributions from the prior.
Now let be an arbitrary estimator of the unknown data . As in Le Cam, we extend to subsets of so that
Now we can re-state the following well-known Lemma for our specific setting.
Lemma 8 (Le Cam’s method).
Let be an estimator of on taking values in the metric space . Suppose that there are well-separated subsets such that . Suppose also that are subsets of such that for . Then:
This lemma has an interesting interpretation in our case. The quantity
is the expected distance between the real data and the guessed data when is drawn from the posterior distribution.
Consequently, it is possible to apply this method pretty much directly to obtain results for specific families of posteriors. As shown by e.g., Yu (1997), even in simple scenarios the lower bound on the minimax estimation rate is .
Appendix B Proofs of examples
Proof of Lemma 3.
We first compute the absolute log-ratio distance for any and according to the exponential likelihood function:
Thus, under Assumption 2, using , the set of feasible parameters for any is . Therefore the assumption requires the prior to adequately support this range, but because the CDF at of the exponential prior with parameter is simply given by , every such prior satisfies the assumption with . ∎
Proof of Lemma 4.
For any and , the absolute log-ratio distance for this distribution can be bounded as
where the inequality follows from the triangle inequality applied to . Thus, if we use , the set of feasible parameters for Assumption 2 is and . Again we can use an an exponential prior with rate parameter for the inverse scale, , and any prior on to obtain the second part of Assumption 2. Every such prior satisfies the assumption with . These similarities are not surprising considering that if then . ∎
Proof of Lemma 5.
Here, we consider data drawn from a binomial distribution with a beta prior on its proportion parameter, . Thus, the likelihood and prior functions are
where , and is the beta function. The resulting posterior is a beta-binomial distribution. Again we consider the application of Assumption 2 to this beta-binomial distribution. For this purpose, we must quantify the parameter sets for a given according to a distance function. The absolute log-ratio distance between the binomial likelihood function for any pair of arguments, and , is
where . By substituting this distance into the supremum of Eq. (2.5), we seek feasible values of for which the supremum is non-negative; here, we explore the case where . Without loss of generality, we assume , and thus require that
However, by the definition of , the ratio is in fact the slope of the chord from to on the function . Since the function is concave in , this slope achieves its maximum and minimum at its boundary values; i.e., it is maximised for and and minimised for and . Thus, the ratio attains a maximum value of and a minimum of for which the above supremum is simply . From Eq. (B.1), we therefore have, for all :
We want to bound . We know that: where is the complement of . We selected , so is composed of two symmetric intervals: and . In addition, the mass must concentrate at , as we have .
Due to symmetry, the mass outside of is two times that is the first interval. This is:
Since it holds that for all :
This is bounded above by simply appyling the max bound for integrals.
If we use i.e. the desired upper limit we have:
Finally, we have that and hence, so that we have: . We want to upper bound this by . Solving for , we obtain
Proof of Lemma 6.
Proof of Lemma 7.
It is instructive to first examine the case where all variables are independent and we have a single observation. Then and
Consequently, if is the smallest probability assigned to any one sub-event, then .
In the general case, we have observations sequences and dependent variables. To take the network connectivity into account, let be such that and define: and . Using a similar argument to (B.2), it is easy to see that in this case . ∎
Appendix C Collected Proofs
Proof of Lemma 1.
For Assumption 1, the proof follows directly from the definition of the absolute log-ratio distance; namely,
This can be reduced from to if only items differ since if .
Proof of Theorem 1.
Let us now tackle claim (1.i). First, we can decompose the KL-divergence into two parts.
From Ass. 1, for all so:
Combining this with (C.1) we obtain
Claim (1.ii) is dealt with similarly. Once more, we can break down the distance in parts. Let . Then , as , while and from Ass 2. We can thus partition into disjoint sets corresponding to uniformly sized intervals of size indexed by . We bound the divergence on each partition and sum over .
via the geometric series. This holds for any size parameter and is convex for , . Thus, there is an optimal choice for that minimizes this bound. Differentiating w.r.t and setting the result to yields where is the unique non-zero solution to . The optimal bound is then
As the is the unique positive solution to , and we define . ∎
Proof of Theorem 2.
For part (2.i), we assumed that there is an such that , , thus implying . Further, in the proof of Theorem 1, we showed that for all . From Eq. 2.2, we can then combine these to bound the posterior of any as follows for all :