We start by providing background information on the definitions of algorithmic privacy that we use, as well as the general formulation of the variational inference algorithm.
Differential privacy (DP) is a formal definition of the privacy properties of data analysis algorithms . A randomized algorithm is said to be -differentially private if for all measurable subsets of the range of and for all datasets , differing by a single entry. If , the algorithm is said to be -differentially private. Intuitively, the definition states that the output probabilities must not change very much when a single individual’s data is modified, thereby limiting the amount of information that the algorithm reveals about any one individual.
Concentrated differential privacy (CDP)  is a recently proposed relaxation of DP which aims to make privacy-preserving iterative algorithms more practical than for DP while still providing strong privacy guarantees. The CDP framework treats the privacy loss of an outcome,
as a random variable. An algorithm is-CDP if this privacy loss has mean , and after subtracting the resulting random variable
is subgaussian with standard deviation, i.e. . While -DP guarantees bounded privacy loss, and -DP ensures bounded privacy loss with probability , -CDP requires the privacy loss to be near w.h.p.
The general VI algorithm.
Consider a generative model that produces a dataset consisting of independent identically distributed items, where is an th observation, generated using a set of latent variables .
The generative model provides , where is the model parameters.
We also consider the prior distribution over the model parameters and the prior distribution over the latent variables .
Here, we focus on conjugate-exponential (CE) models111 A large class of models falls in the CE family including linear dynamical systems and switching models; Gaussian mixtures; factor analysis and probabilistic PCA; hidden Markov models and factorial HMMs; discrete-variable belief networks; and latent Dirichlet allocation (LDA), which we will use in Sec 3.
A large class of models falls in the CE family including linear dynamical systems and switching models; Gaussian mixtures; factor analysis and probabilistic PCA; hidden Markov models and factorial HMMs; discrete-variable belief networks; and latent Dirichlet allocation (LDA), which we will use in Sec 3., in which the variational updates are tractable. The CE family models satisfy the two conditions : (1) Complete-data likelihood is in exponential family: , and (2) Prior over is conjugate to the complete-data likelihood: , where natural parameters and sufficient statistics of the complete-data likelihood are denoted by and
, respectively. The hyperparameters are denoted by(a scalar) and
Variational inference for a CE family model iterates the following two steps in order to optimise the lower bound to the log marginal likelihood,
2 Privacy Preserving VI algorithm for CE family
The only place where the algorithm looks at the data is when computing the expected sufficient statistics in the first step. The expected sufficient statistics then dictates the expected natural parameters in the second step. So, perturbing the sufficient statistics leads to perturbing both posterior distributions and . Perturbing sufficient statistics in exponential families is also used in . Existing work focuses on privatising posterior distributions in the context of posterior sampling [5, 6, 7, 8]
, while our work focuses on privatising approximate posterior distributions for optimisation-based approximate Bayesian inference.
Suppose there are two neighbouring datasets and , where there is only one datapoint difference among them. We also assume that the dataset is pre-processed such that the norm of any datapoint is less than . The maximum difference in the expected sufficient statistics given the datasets, e.g., the L-1 sensitivity of the expected sufficient statistics is given by (assuming is a vector of length ) Under some models like LDA below, the expected sufficient statistic has a limited sensitivity, in which case we add noise to each coordinate of the expected sufficient statistics to compensate the maximum change.
3 Privacy preserving Latent Dirichlet Allocation (LDA)
The most successful topic modeling is based on LDA, where the generative process is given by .
Draw topics Dirichlet , for , where is a scalar hyperarameter.
For each document
Draw topic proportions Dirichlet , where is a scalar hyperarameter.
For each word
Draw topic assignments Discrete
Draw word Discrete
where each observed word is represented by an indicator vector (th word in the th document) of length , where is the number of terms in a fixed vocabulary set. The topic assignment latent variable is also an indicator vector of length , where is the number of topics.
The LDA falls into the CE family, where we think of as two types of latent variables : , and as model parameters : (1) Complete-data likelihood per document is in exponential family: where
; (2) Conjugate prior over: for . For simplicity, we assume hyperparameters and are set manually.
In VI, we assume the posteriors are : (1) Discrete for
, with variational parameters that capture the posterior probability of topic assignment,; (2) Dirichlet for ; and (3) Dirichlet for . In this case, the expected sufficient statistics is .
To privatise the variational inference for LDA, we perturb the expected sufficient statistics. While each document has a different document length , we limit the maximum length of any document to by randomly selecting words in a document if the number of words in the document is longer than .
We add Gaussian noise to each coordinate, then map to 0 if the perturbed coordinate becomes negative:
where is the th coordinate of a vector of length : , and is the sensitivity given by
since , , , and .
Private stochastic variational learning
In a large-scale data setting, it is impossible to handle the entire dataset at once. In such case, stochastic learning using noisy gradients computed on mini-batches of data, a.k.a., stochastic gradient descent (SGD) provides a scalable inference method. While there are a couple of prior work on differentially private SGD (e.g.,[10, 11]), privacy amplification due to subsampling combined with CDP composition (which will be described below) has not been used in the the context of variational inference or topic modeling before.
The privacy amplification theorem states the following.
(Theorem 1 in ) Any -DP mechanism running on a uniformly sampled subset of data with a sampling ratio guarantees differential privacy, where and
The privacy gain from subsampling allows us to use a much more relaxed privacy budget and the error tolerance per iteration, to achieve a reasonable level of ()-DP with a small sampling rate.
Furthermore, the zCDP composition allows a sharper analysis of the per-iteration privacy budget. We first convert DP to zCDP, then use the zCDP composition and finally convert zCDP back to DP (for comparison purposes), for which we use the following lemmas and proposition.
(Lemma 1.7 in ) If two mechanisms satisfy -zCDP and -zCDP, respectively, then their composition satisfies -zCDP.
(Proposition 1.3 in ) If provides -zCDP, then is -DP for any .
So, using Lemma 2 and 3, we obtain -zCDP after -composition of the Gaussian mechanism. Using Proposition 4, we convert -zCDP to -DP, where .
These seemingly complicated steps can be summarised into two simple steps. First, given a total privacy budget and total tolerance level , our algorithm calculates an intermediate privacy budget using the zCDP composition, which maps () to (),
Second, our algorithm calculates the per-iteration privacy budget using the privacy amplification theorem, which maps () to (),
Algorithm 1 summarizes our private topic modeling algorithm.
4 Experiments using Wikipedia data
We randomly downloaded documents from Wikipedia. We then tested our VIPS algorithm on the Wikipedia dataset with four different values of total privacy budget, using a mini-batch size , until the algorithm sees up to documents. We assumed there are topics, and we used a vocabulary set of approximately terms.
We compare our method to two baseline methods. First, in linear (Lin) composition (Theorem 3.16 of ), privacy degrades linearly with the number of iterations. This result is from the Max Divergence of the privacy loss random variable being bounded by a total budget. Hence, the linear composition yields (, )-DP. We use eq (6) to map () to . Second, advanced (Adv) composition (Theorem 3.20 of ), resulting from the Max Divergence of the privacy loss random variable being bounded by a total budget including a slack variable , yields -DP. Similarly, we use eq (6) to map () to .
As an evaluation metric, we compute the upper bound to the perplexity on held-out documents222We used the metric written in the python implementation by authors of .,
where is a vector of word counts for the th document, . In the above, we use the that was calculated during training. We compute the posteriors over and by performing the first step in our algorithm using the test data and the perturbed sufficient statistics we obtain during training. The per-word-perplexity is shown in Fig. 1. Due to privacy amplification, it is more beneficial to decrease the amount of noise to add when the mini-batch size is small. The zCDP composition results in a better accuracy than the advanced composition.
In Table 1, we show the top words in terms of assigned probabilities under a chosen topic in each method. We show topics as examples. Non-private LDA results in the most coherent words among all the methods. For the private LDA models with a total privacy budget fixed to , as we move from zCDP, to advanced, and to linear composition, the amount of noise added gets larger, and therefore more topics have less coherent words.
|Non-private||zCDP (eps=0.5)||Adv (eps=0.5)||Lin (eps=0.5)|
|topic 3:||topic 81:||topic 72:||topic 27:|
|topic 4:||topic 82:||topic 73:||topic 28:|
|topic 5:||topic 83:||topic 74:||topic 29:|
|topic 6:||topic 84:||topic 75:||topic 30:|
Using the same data, we then tested how perplexity changes as we change the mini-batch sizes. As shown in Fig. 1, due to privacy amplification, it is more beneficial to decrease the amount of noise to add when the mini-batch size is small. The zCDP composition results in a better accuracy than the advanced composition.
We have developed a practical privacy-preserving topic modeling algorithm which outputs accurate and privatized expected sufficient statistics and expected natural parameters. Our approach uses the zCDP composition analysis combined with the privacy amplification effect due to subsampling of data, which significantly decrease the amount of additive noise for the same expected privacy guarantee compared to the standard analysis.
-  Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9:211–407, August 2014.
-  C. Dwork and G. N. Rothblum. Concentrated Differential Privacy. ArXiv e-prints, March 2016.
-  M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, University College London, 2003.
-  Mijung Park, Jimmy Foulds, Kamalika Chaudhuri, and Max Welling. Practical privacy for expectation maximization. CoRR, abs/1605.06995, 2016.
-  Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin IP Rubinstein. Robust and private Bayesian inference. In Algorithmic Learning Theory (ALT), pages 291–305. Springer, 2014.
Zuhe Zhang, Benjamin Rubinstein, and Christos Dimitrakakis.
On the differential privacy of Bayesian inference.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI), 2016.
-  James R. Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacy-preserving bayesian data analysis. CoRR, abs/1603.07294, 2016.
-  Gilles Barthe, Gian Pietro Farina, Marco Gaboardi, Emilio Jesús Gallego Arias, Andy Gordon, Justin Hsu, and Pierre-Yves Strub. Differentially private Bayesian programming. CoRR, abs/1605.00283, 2016.
-  Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013.
-  Xi Wu, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey F. Naughton. Differentially private stochastic gradient descent for in-rdbms analytics. CoRR, abs/1606.04722, 2016.
-  Y.-X. Wang, S. E. Fienberg, and A. Smola. Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo. ArXiv e-prints, February 2015.
-  Ninghui Li, Wahbeh Qardaji, and Dong Su. On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ASIACCS ’12, pages 32–33, New York, NY, USA, 2012. ACM.
-  Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. CoRR, abs/1605.02065, 2016.
-  Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent dirichlet allocation. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 856–864. Curran Associates, Inc., 2010.