Probabilistic models trained via Bayesian inference are widely and successfully used in application domains where privacy is invaluable, from text analysis (Blei et al., 2003; Goldwater and Griffiths, 2007), to personalization (Salakhutdinov and Mnih, 2008), to medical informatics (Husmeier et al., 2006), to MOOCs (Piech et al., 2013). In these applications, data scientists must carefully balance the benefits and potential insights from data analysis against the privacy concerns of the individuals whose data are being studied (Daries et al., 2014).
Dwork et al. (2006) placed the notion of privacy-preserving data analysis on a solid foundation by introducing differential privacy (Dwork and Roth, 2013), an algorithmic formulation of privacy which is a gold standard for privacy-preserving data-driven algorithms. Differential privacy measures the privacy “cost” of an algorithm. When designing privacy-preserving methods, the goal is to achieve a good trade-off between privacy and utility, which ideally improves with the amount of available data.
As observed by Dimitrakakis et al. (2014) and Wang et al. (2015b), Bayesian posterior sampling behaves synergistically with differential privacy because it automatically provides a degree of differential privacy under certain conditions. However, there are substantial gaps between this elegant theory and the practical reality of Bayesian data analysis. Privacy-preserving posterior sampling is hampered by data inefficiency, as measured by asymptotic relative efficiency (ARE). In practice, it generally requires artificially selected constraints on the spaces of parameters as well as data points. Its privacy properties are also not typically guaranteed for approximate inference.
This paper identifies these gaps between theory and practice, and begins to mend them via an extremely simple alternative technique based on the workhorse of differential privacy, the Laplace mechanism (Dwork et al., 2006). Our approach is equivalent to a generalization of Zhang et al. (2016)’s recently and independently proposed algorithm for beta-Bernoulli systems. We provide a theoretical analysis and empirical validation of the advantages of the proposed method. We extend both our method and Dimitrakakis et al. (2014); Wang et al. (2015b)’s one posterior sample (OPS) method to the case of approximate inference with privacy-preserving MCMC. Finally, we demonstrate the practical applicability of this technique by showing how to use a privacy-preserving HMM model to analyze sensitive military records from the Iraq and Afghanistan wars leaked by the Wikileaks organization. Our primary contributions are as follows:
We analyze the privacy cost of posterior sampling for exponential family posteriors via OPS.
We explore a simple Laplace mechanism alternative to OPS for exponential families.
Under weak conditions we establish the consistency of the Laplace mechanism approach and its data efficiency advantages over OPS.
We extend the OPS and Laplace mechanism methods to approximate inference via MCMC.
We demonstrate the practical implications with a case study on sensitive military records.
We begin by discussing preliminaries on differential privacy and its application to Bayesian inference. Our novel contributions will begin in Section 3.1.
2.1 Differential Privacy
Differential privacy is a formal notion of the privacy of data-driven algorithms. For an algorithm to be differentially private the probabilities of the outputs of the algorithms may not change much when one individual’s data point is modified, thereby revealing little information about any one individual’s data. More precisely, a randomized algorithmis said to be -differentially private if
for all measurable subsets of the range of and for all datasets , differing by a single entry (Dwork and Roth, 2013). If , the algorithm is said to be -differentially private.
2.1.1 The Laplace Mechanism
One straightforward method for obtaining -differential privacy, known as the Laplace mechanism (Dwork et al., 2006), adds Laplace noise to the revealed information, where the amount of noise depends on , and a quantifiable notion of the sensitivity to changes in the database. Specifically, the sensitivity for function is defined as
for all datasets , differing in at most one element. The Laplace mechanism adds noise via
where is the dimensionality of the range of . The mechanism is -differentially private.
2.1.2 The Exponential Mechanism
The exponential mechanism (McSherry and Talwar, 2007) aims to output responses of high utility while maintaining privacy. Given a utility function that maps database /output pairs to a real-valued score, the exponential mechanism produces random outputs via
where the sensitivity of the utility function is
in which are pairs of databases that differ in only one element.
2.1.3 Composition Theorems
A key property of differential privacy is that it holds under composition, via an additive accumulation.
If is -differentially private, and is -differentially private, then is (-differentially private.
This allows us to view the total and of our procedure as a privacy “budget” that we spend across the operations of our analysis. There also exists an “advanced composition” theorem which provides privacy guarantees in an adversarial adaptive scenario called -fold composition, and also allows an analyst to trade an increased for a smaller in this scenario (Dwork et al., 2010). Differential privacy is also immune to data-independent post-processing.
2.2 Privacy and Bayesian Inference
Suppose we would like a differentially private draw of parameters and latent variables of interest from the posterior , where is the private dataset. We can accomplish this by interpreting posterior sampling as an instance of the exponential mechanism with utility function , i.e. the log joint probability of the chosen assignment and the dataset (Wang et al., 2015b). We then draw via
where the sensitivity is
in which and differ in one element. If the data points are conditionally independent given ,
where is the prior and is the likelihood term for data point . Since the prior does not depend on the data, and each data point is associated with a single log-likelihood term in , from the above two equations we have
This gives us the privacy cost of posterior sampling:
If , releasing one sample from the posterior distribution with any prior is -differentially private.
Wang et al. (2015b) derived this form of the result from first principles, while noting that the exponential mechanism can be used, as we do here. Although they do not explicitly state the theorem, they implicitly use it to show two noteworthy special cases, referred to as the One Posterior Sample (OPS) procedure. We state the first of these cases:
If , releasing one sample from the posterior distribution with any prior is -differentially private.
This follows directly from Theorem 2.2, since if , .
Under the exponential mechanism, provides an adjustable knob trading between privacy and fidelity. When
, the procedure samples from a uniform distribution, giving away no information about. When , the procedure reduces to sampling from the posterior . As approaches infinity the procedure becomes increasingly likely to sample the
assignment with the highest posterior probability. Assuming that our goal is to sample rather than to find a mode, we would capat in the above procedure in order to correctly sample from the true posterior. More generally, if our privacy budget is , and , for integer , we can draw posterior samples within our budget.
As observed by Huang and Kannan (2012), the exponential mechanism can be understood via statistical mechanics. We can write it as a Boltzmann distribution (a.k.a. a Gibbs measure)
where is the energy of state in a physical system, and is the temperature of the system (in units such that Boltzmann’s constant is one). Reducing
corresponds to increasing the temperature, which can be understood as altering the distribution such that a Markov chain moves through the state space more rapidly.
3 Privacy for Exponential Families: Exponential vs Laplace
|Mechanism||Sensitivity||is||Release||ARE||Pay Gibbs cost|
By analyzing the privacy cost of sampling from exponential family posteriors in the general case we can recover the privacy properties of many standard distributions. These results can be applied to full posterior sampling, when feasible, or to Gibbs sampling updates, as we discuss in Section 4. In this section we analyze the privacy cost of sampling from exponential family posterior distributions exactly (or at an appropriate temperature) via the exponential mechanism, following Dimitrakakis et al. (2014) and Wang et al. (2015b), and via a method based on the Laplace mechanism, which is a generalization of Zhang et al. (2016). The properties of the two methods are compared in Table 1.
3.1 The Exponential Mechanism
Consider exponential family models with likelihood
is a vector of sufficient statistics for data point, and is a vector of natural parameters. For i.i.d. data points, we have
Further suppose that we have a conjugate prior which is also an exponential family distribution,
where is a scalar, the number of prior “pseudo-counts,” and is a parameter vector. The posterior is proportional to the prior times the likelihood,
To compute the sensitivity of the posterior, we have
From Equation 9, we obtain
A posterior sample at temperature ,
has privacy cost , by the exponential mechanism. As an example, consider a beta-Bernoulli model,
where is the beta function. Given binary-valued data points
from the Bernoulli distribution, the posterior is
The sufficient statistics for each data point are . The natural parameters for the likelihood are , and . The exponential mechanism sensitivity for a truncated version of this model, where , can be computed from Equation 13,
Note that if
, corresponding to a standard untruncated beta distribution, the sensitivity is unbounded. This makes intuitive sense because some datasets are impossible ifor , which violates differential privacy.
3.2 The Laplace Mechanism
One limitation of the exponential mechanism / OPS approach to private Bayesian inference is that the temperature of the approximate posterior is fixed for any that we are willing to pay, regardless of the number of data points (Equation 10). While the posterior becomes more accurate as increases, and the OPS approximation becomes more accurate by proxy, the OPS approximation remains a factor of flatter than the posterior at data points. This is not simply a limitation of the analysis. An adversary can choose data such that the dataset-specific privacy cost of posterior sampling approaches the worst case given by the exponential mechanism as increases, by causing the posterior to concentrate on the worst-case (see the supplement for an example).
Here, we provide a simple Laplace mechanism alternative for exponential family posteriors, which becomes increasingly faithful to the true posterior with data points, as increases, for any fixed privacy cost , under general assumptions. The approach is based on the observation that for exponential family posteriors, as in Equation 11, the data interacts with the distribution only through the aggregate sufficient statistics,
. If we release privatized versions of these statistics we can use them to perform any further operations that we’d like, including drawing samples, computing moments and quantiles, and so on. This can straightforwardly be accomplished via the Laplace mechanism:
where is a projection onto the space of sufficient statistics, if the Laplace noise takes it out of this region. For example, if the statistics are counts, the projection ensures that they are non-negative. The sensitivity of the aggregate statistics is
where , differ in at most one element. Note that perturbing the sufficient statistics is equivalent to perturbing the parameters, which was recently and independently proposed by Zhang et al. (2016)
for beta-Bernoulli models such as Bernoulli naive Bayes.
A comparison of Equations 17 and 13 reveals that the L1 sensitivity and exponential mechanism sensitivities are closely related. The L1 sensitivity is generally easier to control as it does not involve or but otherwise involves similar terms to the exponential mechanism sensitivity. For example, in the beta posterior case, where is a binary indicator vector, the L1 sensitivity is 2. This should be contrasted to the exponential mechanism sensitivity of Equation 15, which depends heavily on the truncation point, and is unbounded for a standard untruncated beta distribution. The L1 sensitivity is fixed regardless of the number of data points , and so the amount of Laplace noise to add becomes smaller relative to the total as increases.
illustrates the differences in behavior between the two privacy-preserving Bayesian inference algorithms for a beta distribution posterior with Bernoulli observations. The OPS estimator requires the distribution be truncated, here at. This controls the exponential mechanism sensitivity, which determines the temperature of the distribution, i.e. the extent to which the distribution is flattened, for a given . Here, . In contrast, the Laplace mechanism achieves privacy by adding noise to the sufficient statistics, which in this case are the pseudo-counts of successes and failures for the posterior distribution. In Figure 2 we illustrate the fidelity benefits of posterior sampling based on the Laplace mechanism instead of the exponential mechanism as the amount of data increases. In this case the exponential mechanism performs better than the Laplace mechanism only when the number of data points is very small (approximately ), and is quickly overtaken by the Laplace mechanism sampling procedure. As increases the accuracy of sampling from the Laplace mechanism’s approximate posterior converges to the performance of samples from the true posterior at the current number of observations , while the exponential mechanism behaves similarly to the posterior with fewer than observations. We show this formally in the next subsection.
3.3 Theoretical Results
First, we show that the Laplace mechanism approximation of exponential family posteriors approaches the true posterior distribution evaluated at data points. Proofs are given in the supplementary.
For a minimal exponential family given a conjugate prior, where the posterior takes the form , where denotes this posterior with a natural parameter vector , if there exists a such that these assumptions are met:
The data comes i.i.d. from a minimal exponential family distribution with natural parameter
is in the interior of
The function has all derivatives for in the interior of
is finite for
The prior is integrable and has support on a neighborhood of
then for any mechanism generating a perturbed posterior against a noiseless posterior where comes from a distribution that does not depend on the number of data observations and has finite covariance, this limit holds:
The Laplace mechanism on an exponential family satisfies the noise distribution requirements of Lemma 11 when the sensitivity of the sufficient statistics is finite and either the exponential family is minimal, or if the exponential family parameters are identifiable.
These assumptions correspond to the data coming from a distribution where the Laplace regularity assumptions hold and the posterior satisfies the asymptotic normality given by the Bernstein-von Mises theorem. For example, in the beta-Bernoulli setting, these assumptions hold as long as the success parameteris in the open interval . For or , the relevant parameter is not in the interior of
, and the result does not apply. In the setting of learning a normal distribution’s mean
where the varianceis known, the assumptions of Lemma 11 always hold, as the natural parameter space is an open set. However, Corollary 2 does not apply in this setting because the sensitivity is infinite (unless bounds are placed on the data). Our efficiency result, in Theorem 3.1, follows from Lemma 11 and the Bernstein-von Mises theorem.
Under the assumptions of Lemma 11, the Laplace mechanism has an asymptotic posterior of from which drawing a single sample has an asymptotic relative efficiency of 2 in estimating , where is the Fisher information at .
Above, the asymptotic posterior refers to the normal distribution, whose variance depends on , that the posterior distribution approaches as increases. This ARE result should be contrasted to that of the exponential mechanism (Wang et al., 2015b).
The exponential mechanism applied to the exponential family with temperature parameter has an asymptotic posterior of and a single sample has an asymptotic relative efficiency of in estimating , where is the Fisher information at .
Here, the ARE represents the ratio between the variance of the estimator and the optimal variance achieved by the posterior mean in the limit. Sampling from the posterior itself has an ARE of 2, due to the stochasticity of sampling, which the Laplace mechanism approach matches. These theoretical results provide an explanation for the difference in the behavior of these two methods as increases seen in Figure 2. The Laplace mechanism will eventually approach the true posterior and the impact of privacy on accuracy will diminish when the data size increases. However, for the exponential mechanism with , the ratio of variances between the sampled posterior and the true posterior given data points approaches , making the sampled posterior more spread out than the true posterior even as grows large.
So far we have compared the ARE values for sampling, as an apples-to-apples comparison. In reality, the Laplace mechanism has a further advantage as it releases a full posterior with privatized parameters, while the exponential mechanism can only release a finite number of samples with a finite , which we discuss in Remark 1.
Under the the assumptions of Lemma 11, by using the full privatized posterior instead of just a sample from it, the Laplace mechanism can release the privatized posterior’s mean, which has an asymptotic relative efficiency of 1 in estimating .
4 Private Gibbs Sampling
We now shift our discussion to the case of approximate Bayesian inference. While the analysis of Dimitrakakis et al. (2014) and Wang et al. (2015b) shows that posterior sampling is differentially private under certain conditions, exact sampling is not in general tractable. It does not directly follow that approximate sampling algorithms such as MCMC are also differentially private, or private at the same privacy level. Wang et al. (2015b) give two results towards understanding the privacy properties of approximate sampling algorithms. First, they show that if the approximate sampler is “close” to the true distribution in a certain sense, then the privacy cost will be close to that of a true posterior sample:
If procedure which produces samples from distribution is -differentially private, then any approximate sampling procedures that produces a sample from such that for any is -differentially private.
Unfortunately, it is not in general feasible to verify the convergence of an MCMC algorithm, and so this criterion is not generally verifiable in practice. In their second result, Wang et al. study the privacy properties of stochastic gradient MCMC algorithms, including stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011) and its extensions. SGLD is a stochastic gradient method with noise injected in the gradient updates which converges in distribution to the target posterior.
In this section we study the privacy cost of MCMC, allowing us to quantify the privacy of many real-world MCMC-based Bayesian analyses. We focus on the case of Gibbs sampling, under exponential mechanism and Laplace mechanism approaches. By reinterpreting Gibbs sampling as an instance of the exponential mechanism, we obtain the “privacy for free” cost of Gibbs sampling. Metropolis-Hastings and annealed importance sampling also have privacy guarantees, which we show in the supplementary materials.
4.1 Exponential Mechanism
We consider the privacy cost of a Gibbs sampler, where data are behind the privacy wall, current sampled values of parameters and latent variables are publicly known, and a Gibbs update is a randomized algorithm which queries our private data in order to randomly select a new value for the current variable . The transition kernel for a Gibbs update of is
where refers to all entries of except , which are held fixed, i.e. . This update can be understood via the exponential mechanism:
with utility function , over the space of possible assignments to , holding fixed. A Gibbs update is therefore -differentially private, with . This update corresponds to Equation 6 except that the set of responses for the exponential mechanism is restricted to those where . Note that
as the worst case is computed over a strictly smaller set of outcomes. In many cases each parameter and latent variable is associated with only the th data point , in which case the privacy cost of a Gibbs scan can be improved over simple additive composition. In this case a random sequence scan Gibbs pass, which updates all ’s exactly once, is -differentially private by parallel composition (Song et al., 2013). Alternatively, a random scan Gibbs sampler, which updates a random out of ’s, is -differentially private from the privacy amplification benefit of subsampling data (Li et al., 2012).
4.2 Laplace Mechanism
Suppose that the conditional posterior distribution for a Gibbs update is in the exponential family. Having privatized the sufficient statistics arising from the data for the likelihoods involved in each update, via Equation 16, and publicly released them with privacy cost , we may now perform the update by drawing a sample from the approximate conditional posterior, i.e. Equation 11 but with replaced by . Since the privatized statistics can be made public, we can also subsequently draw from an approximate posterior based on with any other prior (selected based on public information only), without paying any further privacy cost. This is especially valuable in a Gibbs sampling context, where the “prior” for a Gibbs update often consists of factors from other variables and parameters to be sampled, which are updated during the course of the algorithm.
In particular, consider a Bayesian model where a Gibbs sampler interacts with data only via conditional posteriors and their corresponding likelihoods that are exponential family distributions. We can privatize the sufficient statistics of the likelihood just once at the beginning of the MCMC algorithm via the Laplace mechanism with privacy cost , and then approximately sample from the posterior by running the entire MCMC algorithm based on these privatized statistics without paying any further privacy cost. This is typically much cheaper in the privacy budget than exponential mechanism MCMC which pays a privacy cost for every Gibbs update, as we shall see in our case study in Section 5
. The MCMC algorithm does not need to converge to obtain privacy guarantees, unlike the OPS method. This approach applies to a very broad class of models, including Bayesian parameter learning for fully-observed MRF and Bayesian network models. Of course, for this technique to be useful in practice, the aggregate sufficient statistics for each Gibbs update must be large relative to the Laplace noise. For latent variable models, this typically corresponds to a setting with many data points per latent variable, such as the HMM model with multiple emissions per timestep which we study in the next section.
5 Case Study: Wikileaks Iraq & Afghanistan War Logs
A primary goal of this work is to establish the practical feasibility of privacy-preserving Bayesian data analysis using complex models on real-world datasets. In this section we investigate the performance of the methods studied in this paper for the analysis of sensitive military data. In July and October 2010, the Wikileaks organization disclosed collections of internal U.S. military field reports from the wars in Afghanistan and Iraq, respectively. Both disclosures contained data from between January 2004 to December 2009, with 75,000 entries from the war in Afghanistan, and 390,000 entries from Iraq. Hillary Clinton, at that time the U.S. Secretary of State, criticized the disclosure, stating that it “puts the lives of United States and its partners’ service members and civilians at risk.”111Fallon, Amy (2010). “Iraq war logs: disclosure condemned by Hillary Clinton and Nato.” The Guardian. Retrieved on 2/22/2016. These risks, and the motivations for the leak, could potentially have been mitigated by releasing a differentially private analysis of the data, which protects the contents of each individual log entry while revealing high-level trends. Note that since the data are publicly available, although our models were differentially private, other aspects of this manuscript such as the evaluation may reveal certain information, as in other works such as Wang et al. (2015a, b).
The disclosed war logs each correspond to an individual event, and contain textual reports, as well as fields such as coarse-grained types (friendly action, explosive hazard, …), fine-grained categories (mine found/cleared, show of force, …), and casualty counts (wounded/killed/detained) for the different factions (Friendly, HostNation (i.e. Iraqi and Afghani forces), Civilian, and Enemy
, where the names are relative to the U.S. military’s perspective). We use the techniques discussed in this paper to privately infer a hidden Markov model on the log entries. The HMM was fit to the non-textual fields listed above, with one timestep per month, and one HMM chain per region code. A naive Bayes conditional independence assumption was used in the emission probabilities for simplicity and parameter-count parsimony. Each field was modeled via a discrete distribution per latent state, with casualty counts binarized (versus ), and with wounded/killed/detained and Friendly/HostNation
features combined, respectively, via disjunction of the binary values. This decreased the number of features to privatize, while slightly increasing the size of the counts per field to protect and simplifying the model for visualization purposes. After preprocessing to remove empty timesteps and near-empty region codes (see the supplementary), the median number of log entries per region/timestep pair was 972 for Iraq, and 58 for Afghanistan. The number of log entries per timestep was highly skewed for Afghanistan, due to an increase in density over time.
The models were trained via Gibbs sampling, with the transition probabilities collapsed out, following Goldwater and Griffiths (2007). We did not collapse out the naive Bayes parameters in order to keep the conditional likelihood in the exponential family. The details of the model and inference algorithm are given in the supplementary material. We trained the models for 200 Gibbs iterations, with the first 100 used for burn-in. Both privatization methods have the same overall computational complexity as the non-private sampler. The Laplace mechanism’s computational overhead is paid once up-front, and did not greatly affect the runtime, while OPS roughly doubled the runtime. For visualization purposes we recovered parameter estimates via the posterior mean based on the latent variable assignments of the final iteration, and we reported the most frequent latent variable assignments over the non-burn-in iterations. We trained a 2-state model on the Iraq data, and a 3-state model for the Afghanistan data, using the Laplace approach with total ( for each of 5 features).
Interestingly, when given 10 states, the privacy-preserving model only assigned substantial numbers of data points to these 2-3 states, while a non-private HMM happily fit a 10-state model to the data. The Laplace noise therefore appears to play the role of a regularizer, consistent with the noise being interpreted as a “random prior,” and along the lines of noise-based regularization techniques such as (Srivastava et al., 2014; van der Maaten et al., 2013), although of course it may correspond to more regularization than we would typically like. This phenomenon potentially merits further study, beyond the scope of this paper.
We visualized the output of the Laplace HMM for Iraq in Figures 3–5. State 1 shows the U.S. military performing well, with the most frequent outcomes for each feature being friendly action, cache found/cleared, and enemy casualties, while the U.S. military performed poorly in State 2 (explosive hazard, IED explosion, civilian casualties). State 2 was prevalent in most regions until the situation improved to State 1 after the troop surge strategy of 2007. This transition typically occurred after troops peaked in Sept.–Nov. 2007. The results for Afghanistan, in the supplementary, provide a critical lens on the US military’s performance, with enemy casualty rates (including detainments) lower than friendly/host casualties for all latent states, and lower than civilian casualties in 2 of 3 states.
We also evaluated the methods at prediction. A uniform random 10% of the timestep/region pairs were held out for 10 train/test splits, and we reported average test likelihoods over the splits. We estimated test log-likelihood for each split by averaging the test likelihood over the burned-in samples (Laplace mechanism), or using the final sample (OPS). All methods were given 10 latent states, and
was varied between 0.1 and 10. We also considered a naive Bayes model, equivalent to a 1-state HMM. The Laplace mechanism was superior to OPS for the naive Bayes model, for which the statistics are corpus-wide counts, corresponding to a high-data regime in which our asymptotic analysis was applicable. OPS was competitive with the Laplace mechanism for the HMM on Afghanistan, where the amount of data was relatively low. For the Iraq dataset, where there was more data per timestep, the Laplace mechanism outperformed OPS, particularly in the high-privacy regime. For OPS, privacy atis only guaranteed if MCMC has converged. Otherwise, from Section 4.1, the worst case is an impractical (200 iterations of latent variable and parameter updates with worst-case cost ). OPS only releases one sample, which harmed the coherency of the visualization for Afghanistan, as latent states of the final sample were noisy relative to an estimate based on all 100 post burn-in samples (Figure 7). Privatizing the Gibbs chain at a privacy cost of would avoid this.
This paper studied the practical limitations of using posterior sampling to obtain privacy “for free.” We explored an alternative based on the Laplace mechanism, and analyzed it both theoretically and empirically. We illustrated the benefits of the Laplace mechanism for privacy-preserving Bayesian inference to analyze sensitive war records. The study of privacy-preserving Bayesian inference is only just beginning. We envision extensions of these techniques to other approximate inference algorithms, as well as their practical application to sensitive real-world data sets. Finally, we have argued that asymptotic efficiency is important in a privacy context, leading to an open question: how large is the class of private methods that are asymptotically efficient?
The work of K. Chaudhuri and J. Geumlek was supported in part by NSF under IIS 1253942, and the work of M. Welling was supported in part by Qualcomm, Google and Facebook. We also thank Mijung Park, Eric Nalisnick, and Babak Shahbaba for helpful discussions.
Blei et al. (2003)
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).
Latent Dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022.
- Brown (1986) Brown, L. D. (1986). Fundamentals of statistical exponential families with applications in statistical decision theory. Lecture Notes-Monograph Series, 9:i–279.
- Chen (2007) Chen, X. (2007). A new generalization of Chebyshev inequality for random vectors. arXiv preprint arXiv:0707.0805.
- Daries et al. (2014) Daries, J. P., Reich, J., Waldo, J., Young, E. M., Whittinghill, J., Ho, A. D., Seaton, D. T., and Chuang, I. (2014). Privacy, anonymity, and big data in the social sciences. Communications of the ACM, 57(9):56–63.
- Dimitrakakis et al. (2014) Dimitrakakis, C., Nelson, B., Mitrokotsa, A., and Rubinstein, B. I. (2014). Robust and private Bayesian inference. In Algorithmic Learning Theory (ALT), pages 291–305. Springer.
- Dwork et al. (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, pages 265–284. Springer.
- Dwork and Roth (2013) Dwork, C. and Roth, A. (2013). The algorithmic foundations of differential privacy. Theoretical Computer Science, 9(3-4):211–407.
- Dwork et al. (2010) Dwork, C., Rothblum, G. N., and Vadhan, S. (2010). Boosting and differential privacy. In The 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 51–60.
- Fang et al. (2000) Fang, K.-T., Geng, Z., and Tian, G.-L. (2000). Statistical inference for truncated Dirichlet distribution and its application in misclassification. Biometrical journal, 42(8):1053–1068.
- Goldwater and Griffiths (2007) Goldwater, S. and Griffiths, T. (2007). A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL), pages 744–751.
- Huang and Kannan (2012) Huang, Z. and Kannan, S. (2012). The exponential mechanism for social welfare: Private, truthful, and nearly optimal. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 140–149. IEEE.
- Husmeier et al. (2006) Husmeier, D., Dybowski, R., and Roberts, S. (2006). Probabilistic modeling in bioinformatics and medical informatics. Springer Science & Business Media.
- Kass et al. (1990) Kass, R., Tierney, L., and Kadane, J. (1990). The validity of posterior expansions based on laplace’s method. Bayesian and likelihood methods in statistics and econometrics, 7:473.
- Li et al. (2012) Li, N., Qardaji, W., and Su, D. (2012). On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, pages 32–33. ACM.
- McSherry and Talwar (2007) McSherry, F. and Talwar, K. (2007). Mechanism design via differential privacy. In Foundations of Computer Science (FOCS), 2007 IEEE 48th Annual Symposium on, pages 94–103. IEEE.
- Neal (2001) Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2):125–139.
- Piech et al. (2013) Piech, C., Huang, J., Chen, Z., Do, C., Ng, A., and Koller, D. (2013). Tuned models of peer assessment in MOOCs. In Proceedings of the 6th International Conference on Educational Data Mining, pages 153–160.
- Salakhutdinov and Mnih (2008) Salakhutdinov, R. and Mnih, A. (2008). Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 880–887.
- Song et al. (2013) Song, S., Chaudhuri, K., and Sarwate, A. D. (2013). Stochastic gradient descent with differentially private updates. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pages 245–248. IEEE.
Srivastava et al. (2014)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov,
Dropout: A simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958.
- van der Maaten et al. (2013) van der Maaten, L., Chen, M., Tyree, S., and Weinberger, K. Q. (2013). Learning with marginalized corrupted features. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 410–418.
- Wang et al. (2015a) Wang, Y., Wang, Y.-X., and Singh, A. (2015a). Differentially private subspace clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1000–1008.
- Wang et al. (2015b) Wang, Y.-X., Fienberg, S. E., and Smola, A. (2015b). Privacy for free: Posterior sampling and stochastic gradient Monte Carlo. Proceedings of The 32nd International Conference on Machine Learning (ICML), pages 2493––2502.
- Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688.
Zhang et al. (2016)
Zhang, Z., Rubinstein, B., and Dimitrakakis, C. (2016).
On the differential privacy of Bayesian inference.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI).
Appendix A Adversarial Data Experiment
In this appendix we describe an additional simulation experiment which supplements the analysis performed in the main manuscript.
Wang et al. (2015b)’s analysis finds that the privacy cost of posterior sampling does not directly improve with the number of data points , unless the analyst deliberately modifies the posterior by changing the temperature before sampling. In Figure 8 we report an experiment showing that this result is not just a limitation of the analysis: there do exist cases where the dataset-specific privacy cost of posterior sampling can approach the exponential mechanism worst case of as the number of observations increases.
In the experiment, we consider a beta distribution posterior, symmetrically truncated at , with Bernoulli observations. We simulate an adversary who greedily selects data points to add to a dataset to increase the dataset-specific privacy cost of posterior sampling. The dataset-specific “local” privacy parameter is computed via a grid search over the Bernoulli success parameter and Bernoulli outcomes , , for the case where the adversary adds a success, or a failure, and the adversary selects the success/failure outcome with the highest local . The adversary is able to make the dataset-specific approach the worst case by manipulating the partition function of the posterior. The exponential mechanism’s worst case for posterior sampling, , corresponds to a sum of two cost terms. We must pay a cost of from to the difference of log-likelihood terms, as we can always draw the worst-case (e.g., when is on the truncation boundary), plus another in the worst case due to the difference of log partition-functions terms, which the adversary can alter up to the worst case, as they do in Figure 8. This is described formally in the supplementary of (Wang et al., 2015b).
Appendix B Proofs of Theoretical Results
Here we provide proofs for the results presented in Section 3.3.
b.1 Proof of Laplace Mechanism Asymptotic Kl-Divergence
Our results hold specifically over the class of exponential families. A family of distributions parameterized by which has the form
is said to be an exponential family. Breaking down this structure into its parts, is a vector known as the natural parameters for the distribution and lies in some space . represents a vector of sufficient statistics that fully capture the information needed to determine how likely is under this distribution.
represents the log-normalizer, a term used to make this a probability distribution sum to one over all possibilities of. is a base measure for this family, independent of which distribution in the family is used.
As we are interested in learning , we are considering algorithms that generate a posterior distribution for . The exponential families always have a conjugate prior family which is itself an exponential family. When speaking of these prior and posterior distributions,
becomes the random variable and we introduce a new vector of natural parametersin a space to parameterize these distributions. To ease notation, we will express this conjugate prior exponential family as , which is simply a relabelling of the exponential family structure. The posterior from this conjugate prior is often written in an equivalent form
where the vector and the scalar together specify the vector of natural parameters for this distribution. From the interaction of and on the posterior, one can see that this prior acts like observations with average sufficient statistics have already been observed. This parameterization with and has many nice intuitive properties, but our proofs center around the natural parameter vector for this prior.
These two forms for the posterior can be reconciled by letting and . This definition for the natural parameters and sufficient statistics fully specify the exponential family the posterior resides in, with defined as the appropriate log-normalizer for this distribution (and is merely a constant). We note that the space of is not the full space , as the last component of is a function of the previous components. Plugging in these expressions for and we get the following form for the conjugate prior:
We begin by defining minimal exponential families, a special class of exponential families with nice properties. To be minimal, the sufficient statistics must be linearly independent. We will later relax the requirement that we consider only minimal exponential families.
An exponential family of distributions generating a random variable with is said to be minimal if s.t. s.t.
Next we present a few simple algebraic results of minimal exponential families.
For two distributions from the same minimal exponential family,
where are the natural parameters of and , and is the log-normalizer for the exponential family.
A minimal exponential family distribution satisfies these equalities:
For a minimal exponential family distribution, its log-normalizer is a strictly convex function over the natural parameters. This implies a bijection between and .
These are standard results coming from some algebraic manipulations as seen in (Brown, 1986), and we omit the proof of these lemmas. Lemma 6 immediately leads to a useful corollary about minimal families and their conjugate prior families.
For a minimal exponential family distribution, the conjugate prior family given in equation (22) is also minimal.
forms the sufficient statistics for the conjugate prior. Since is strictly convex, there can be no linear relationship between the components of and . Definition 1 applies.
Our next result looks at sufficient conditions for getting a KL divergence of 0 in the limit when adding a finite perturbance vector to the natural parameters. The limit is taken over , which will later be tied to the amount of data used in forming the posterior. As we now discuss posterior distributions also forming exponential families, our natural parameters will now be denoted by and the random variables are now .
Let denote the distribution from an exponential family of natural parameter , and let be a constant vector of the same dimensionality as , and let be a sequence of natural parameters. If for every on the line segment connecting and we have the spectral norm for some constant , then
Proof: This follows from noticing that equation (23) in Lemma 4 becomes the first-order Taylor approximation of centered at . From Taylor’s theorem, there exists between and such that is equal to the error of this approximation.
From rearranging equation (23),
Using this substitution in (24) gives
Solving for then gives the desired result:
This provides the heart of our results: If is small for all connecting and , then we can conclude that is small with respect to . We wish to show that for arising from observing data points we have approaching 0 as grows. To achieve this, we will analyze a relationship between the norm of the natural parameter and the covariance of the distribution it parameterizes. This relationship shows that posteriors with plenty of observed data have low covariance over , which permits us to use Lemma 8 to bound the KL divergence of our perturbed posteriors. Before we reach this relationship, first we prove that our posteriors have a well-defined mode, as our later relationship will require this mode to be well-behaved.
Let be a likelihood function for and let there be a conjugate prior , where both distributions are minimal exponential families. Let be the space of natural parameters , and be the space of . Furthermore, assume is the parameterization arising from the natural conjugate prior, such that . If the following conditions hold:
is in the interior of
is a real, continuous, and differentiable
exists, the distribution is normalizable.
is a well-defined function of , and is in the interior of .
Using our structure for the conjugate prior from (22), we can expand the expression .
We note that the first term is linear in , and that by minimality and Lemma 6, is strictly convex. This implies is strictly concave over . Thus any interior local maximum must also be the unique global maximum.
The gradient of with respect to is simple to compute.
This expression can be set to zero, and solving for shows it must satisfy
We remark by Lemma 5 that is equal to , and so this is the that generates a distribution with mean .
By the strict concavity, this is sufficient to prove is a unique local maximizer and thus the global maximum.
To see that must be in the interior of , we use the fact that is continuously differentiable. This means is a continuous function of . Since is in the interior of , we can construct an open neighborhood around . The preimage of an open set under a continuous function is also an open set, so this implies an open neighborhood exists around .
Now that we know is well defined for in the interior of , we can express our relationship on high magnitude posterior parameters and the covariance of the distribution over they generate.
Let be a likelihood function for and let there be a conjugate prior , where both distributions are minimal exponential families. Let be the space of natural parameters , and be the space of . Furthermore, assume is the parameterization arising from the natural conjugate prior, such that .
If such that the conditions of Lemma 9 hold for , and we have these additional assumptions,
the cone lies entirely in
is differentiable of all orders
s.t. all partial derivatives up to order 7 of have magnitude bounded by
such that we have
then there exists such that for the following bound holds :
This result follows from the Laplace approximation method for . The inner details of this approximation are show in Lemma 14. Here we show that our setting satisfies all the regularity assumptions for this approximation. First we define functions and .
With these definitions, we may now begin to check the assumptions of Lemma 14 hold. We copy these assumptions below, with a substitution of for and for . The full details of Lemma 14 can be found at the end of section B.1.
, a function of .
is in the interior of for all .
is continuously differentiable over the neighborhood .
has derivatives of all orders for , and all partial derivatives up to order 7 are bounded by some constant on this neighborhood.
such that we have .
exists for , the integral is finite.
We now show these conditions hold one-by-one. Let denote an arbitrary element of .
is a well-defined function (Lemma 9).
is in the interior of (Lemma 9).
follows the inverse of . This vector mapping has a Jacobian which assumption 4 guarantees has non-zero determinant on this neighborhood. This satisfies the Inverse Function Theorem to show is continuously differentiable.
has derivatives of all orders, and are suitably bounded as is composed of a linear term and the differentiable function , where we have bounded the derivatives of .
Assumption 4 from this lemma translates directly.
which exists by virtue of being in the space of valid natural parameters.
This completes all the requirements of Lemma 14, which guarantees the existence of and such that for any and any , if we let denote , we have:
We conclude by noting that is the covariance of the posterior with parameterization