1 Background
We start by providing background information on the definitions of algorithmic privacy that we use, as well as the general formulation of the variational inference algorithm.
Differential privacy
Differential privacy (DP) is a formal definition of the privacy properties of data analysis algorithms [1]. A randomized algorithm is said to be differentially private if for all measurable subsets of the range of and for all datasets , differing by a single entry. If , the algorithm is said to be differentially private. Intuitively, the definition states that the output probabilities must not change very much when a single individual’s data is modified, thereby limiting the amount of information that the algorithm reveals about any one individual.
Concentrated differential privacy (CDP) [2] is a recently proposed relaxation of DP which aims to make privacypreserving iterative algorithms more practical than for DP while still providing strong privacy guarantees. The CDP framework treats the privacy loss of an outcome,
as a random variable. An algorithm is
CDP if this privacy loss has mean , and after subtracting the resulting random variableis subgaussian with standard deviation
, i.e. . While DP guarantees bounded privacy loss, and DP ensures bounded privacy loss with probability , CDP requires the privacy loss to be near w.h.p.The general VI algorithm.
Consider a generative model that produces a dataset consisting of independent identically distributed items, where is an th observation, generated using a set of latent variables . The generative model provides , where is the model parameters. We also consider the prior distribution over the model parameters and the prior distribution over the latent variables . Here, we focus on conjugateexponential (CE) models^{1}^{1}1
A large class of models falls in the CE family including linear dynamical systems and switching models; Gaussian mixtures; factor analysis and probabilistic PCA; hidden Markov models and factorial HMMs; discretevariable belief networks; and latent Dirichlet allocation (LDA), which we will use in Sec 3.
, in which the variational updates are tractable. The CE family models satisfy the two conditions [3]: (1) Completedata likelihood is in exponential family: , and (2) Prior over is conjugate to the completedata likelihood: , where natural parameters and sufficient statistics of the completedata likelihood are denoted by and, respectively. The hyperparameters are denoted by
(a scalar) and(a vector).
Variational inference for a CE family model iterates the following two steps in order to optimise the lower bound to the log marginal likelihood,
(1)  
(2)  
2 Privacy Preserving VI algorithm for CE family
The only place where the algorithm looks at the data is when computing the expected sufficient statistics in the first step. The expected sufficient statistics then dictates the expected natural parameters in the second step. So, perturbing the sufficient statistics leads to perturbing both posterior distributions and . Perturbing sufficient statistics in exponential families is also used in [4]. Existing work focuses on privatising posterior distributions in the context of posterior sampling [5, 6, 7, 8]
, while our work focuses on privatising approximate posterior distributions for optimisationbased approximate Bayesian inference.
Suppose there are two neighbouring datasets and , where there is only one datapoint difference among them. We also assume that the dataset is preprocessed such that the norm of any datapoint is less than . The maximum difference in the expected sufficient statistics given the datasets, e.g., the L1 sensitivity of the expected sufficient statistics is given by (assuming is a vector of length ) Under some models like LDA below, the expected sufficient statistic has a limited sensitivity, in which case we add noise to each coordinate of the expected sufficient statistics to compensate the maximum change.
3 Privacy preserving Latent Dirichlet Allocation (LDA)
The most successful topic modeling is based on LDA, where the generative process is given by [9].

Draw topics Dirichlet , for , where is a scalar hyperarameter.

For each document

Draw topic proportions Dirichlet , where is a scalar hyperarameter.

For each word

Draw topic assignments Discrete

Draw word Discrete


where each observed word is represented by an indicator vector (th word in the th document) of length , where is the number of terms in a fixed vocabulary set. The topic assignment latent variable is also an indicator vector of length , where is the number of topics.
The LDA falls into the CE family, where we think of as two types of latent variables : , and as model parameters : (1) Completedata likelihood per document is in exponential family: where
; (2) Conjugate prior over
: for . For simplicity, we assume hyperparameters and are set manually.In VI, we assume the posteriors are : (1) Discrete for
, with variational parameters that capture the posterior probability of topic assignment,
; (2) Dirichlet for ; and (3) Dirichlet for . In this case, the expected sufficient statistics is .Sensitivity analysis
To privatise the variational inference for LDA, we perturb the expected sufficient statistics. While each document has a different document length , we limit the maximum length of any document to by randomly selecting words in a document if the number of words in the document is longer than .
We add Gaussian noise to each coordinate, then map to 0 if the perturbed coordinate becomes negative:
(3) 
where is the th coordinate of a vector of length : , and is the sensitivity given by
(4) 
since , , , and .
Private stochastic variational learning
In a largescale data setting, it is impossible to handle the entire dataset at once. In such case, stochastic learning using noisy gradients computed on minibatches of data, a.k.a., stochastic gradient descent (SGD) provides a scalable inference method. While there are a couple of prior work on differentially private SGD (e.g.,
[10, 11]), privacy amplification due to subsampling combined with CDP composition (which will be described below) has not been used in the the context of variational inference or topic modeling before.The privacy amplification theorem states the following.
Theorem 1.
(Theorem 1 in [12]) Any DP mechanism running on a uniformly sampled subset of data with a sampling ratio guarantees differential privacy, where and
The privacy gain from subsampling allows us to use a much more relaxed privacy budget and the error tolerance per iteration, to achieve a reasonable level of ()DP with a small sampling rate.
Furthermore, the zCDP composition allows a sharper analysis of the periteration privacy budget. We first convert DP to zCDP, then use the zCDP composition and finally convert zCDP back to DP (for comparison purposes), for which we use the following lemmas and proposition.
Lemma 1.
Lemma 2.
(Lemma 1.7 in [13]) If two mechanisms satisfy zCDP and zCDP, respectively, then their composition satisfies zCDP.
Proposition 1.
(Proposition 1.3 in [13]) If provides zCDP, then is DP for any .
So, using Lemma 2 and 3, we obtain zCDP after composition of the Gaussian mechanism. Using Proposition 4, we convert zCDP to DP, where .
These seemingly complicated steps can be summarised into two simple steps. First, given a total privacy budget and total tolerance level , our algorithm calculates an intermediate privacy budget using the zCDP composition, which maps () to (),
(5) 
Second, our algorithm calculates the periteration privacy budget using the privacy amplification theorem, which maps () to (),
(6)  
Algorithm 1 summarizes our private topic modeling algorithm.
4 Experiments using Wikipedia data
We randomly downloaded documents from Wikipedia. We then tested our VIPS algorithm on the Wikipedia dataset with four different values of total privacy budget, using a minibatch size , until the algorithm sees up to documents. We assumed there are topics, and we used a vocabulary set of approximately terms.
We compare our method to two baseline methods. First, in linear (Lin) composition (Theorem 3.16 of [1]), privacy degrades linearly with the number of iterations. This result is from the Max Divergence of the privacy loss random variable being bounded by a total budget. Hence, the linear composition yields (, )DP. We use eq (6) to map () to . Second, advanced (Adv) composition (Theorem 3.20 of [1]), resulting from the Max Divergence of the privacy loss random variable being bounded by a total budget including a slack variable , yields DP. Similarly, we use eq (6) to map () to .
As an evaluation metric, we compute the upper bound to the perplexity on heldout documents
^{2}^{2}2We used the metric written in the python implementation by authors of [14].,where is a vector of word counts for the th document, . In the above, we use the that was calculated during training. We compute the posteriors over and by performing the first step in our algorithm using the test data and the perturbed sufficient statistics we obtain during training. The perwordperplexity is shown in Fig. 1. Due to privacy amplification, it is more beneficial to decrease the amount of noise to add when the minibatch size is small. The zCDP composition results in a better accuracy than the advanced composition.
In Table 1, we show the top words in terms of assigned probabilities under a chosen topic in each method. We show topics as examples. Nonprivate LDA results in the most coherent words among all the methods. For the private LDA models with a total privacy budget fixed to , as we move from zCDP, to advanced, and to linear composition, the amount of noise added gets larger, and therefore more topics have less coherent words.
Nonprivate  zCDP (eps=0.5)  Adv (eps=0.5)  Lin (eps=0.5)  

topic 3:  topic 81:  topic 72:  topic 27:  
david  0.0667  born  0.0882  fragment  0.0002  horn  0.0002  
king  0.0318  american  0.0766  gentleness  0.0001  shone  0.0001  
god  0.0304  name  0.0246  soit  0.0001  age  0.0001  
son  0.0197  actor  0.0196  render  0.0001  tradition  0.0001  
israel  0.0186  english  0.0179  nonproprietary  0.0001  protecting  0.0001  
bible  0.0156  charles  0.0165  westminster  0.0001  fils  0.0001  
hebrew  0.0123  british  0.0138  proceedings  0.0001  trip  0.0001  
story  0.0102  richard  0.0130  clare  0.0001  article  0.0001  
book  0.0095  german  0.0119  stronger  0.0001  interests  0.0001  
adam  0.0092  character  0.0115  hesitate  0.0001  incidents  0.0001  
topic 4:  topic 82:  topic 73:  topic 28:  
university  0.1811  wat  0.0002  mount  0.0034  american  0.0228  
press  0.0546  armed  0.0001  display  0.0011  born  0.0154  
oxford  0.0413  log  0.0001  animal  0.0011  john  0.0107  
italy  0.0372  fierce  0.0001  equipment  0.0011  name  0.0094  
jacques  0.0359  infantry  0.0001  cynthia  0.0009  english  0.0062  
cambridge  0.0349  sehen  0.0001  position  0.0008  actor  0.0061  
barbara  0.0280  selbst  0.0001  systems  0.0008  united  0.0058  
research  0.0227  clearly  0.0001  support  0.0008  british  0.0051  
murray  0.0184  bull  0.0001  software  0.0008  character  0.0051  
scientific  0.0182  recall  0.0001  heavy  0.0008  people  0.0048  
topic 5:  topic 83:  topic 74:  topic 29:  
association  0.0896  david  0.0410  david  0.0119  shelter  0.0001  
security  0.0781  jonathan  0.0199  king  0.0091  rome  0.0001  
money  0.0584  king  0.0188  god  0.0072  thick  0.0001  
joint  0.0361  samuel  0.0186  church  0.0061  vous  0.0001  
masters  0.0303  israel  0.0112  samuel  0.0054  leg  0.0001  
banks  0.0299  saul  0.0075  son  0.0051  considering  0.0001  
seal  0.0241  son  0.0068  israel  0.0039  king  0.0001  
gilbert  0.0235  dan  0.0067  name  0.0038  object  0.0001  
trade  0.0168  god  0.0053  century  0.0038  prayed  0.0001  
heads  0.0166  story  0.0048  first  0.0036  pilot  0.0001  
topic 6:  topic 84:  topic 75:  topic 30:  
law  0.0997  simon  0.0101  recognise  0.0001  despair  0.0001  
court  0.0777  cat  0.0008  comparison  0.0001  ray  0.0001  
police  0.0442  maison  0.0005  violates  0.0001  successfully  0.0001  
legal  0.0396  breach  0.0005  offices  0.0001  respectable  0.0001  
justice  0.0292  says  0.0005  value  0.0001  acute  0.0001  
courts  0.0229  dirty  0.0005  neighbor  0.0001  accompany  0.0001  
welcome  0.0204  rifle  0.0004  cetait  0.0001  assuming  0.0001  
civil  0.0178  door  0.0004  composed  0.0001  florence  0.0001  
signal  0.0170  property  0.0004  interests  0.0001  ambition  0.0001  
pan  0.0163  genus  0.0004  argue  0.0001  unreasonable  0.0001 
Using the same data, we then tested how perplexity changes as we change the minibatch sizes. As shown in Fig. 1, due to privacy amplification, it is more beneficial to decrease the amount of noise to add when the minibatch size is small. The zCDP composition results in a better accuracy than the advanced composition.
5 Conclusion
We have developed a practical privacypreserving topic modeling algorithm which outputs accurate and privatized expected sufficient statistics and expected natural parameters. Our approach uses the zCDP composition analysis combined with the privacy amplification effect due to subsampling of data, which significantly decrease the amount of additive noise for the same expected privacy guarantee compared to the standard analysis.
References
 [1] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9:211–407, August 2014.
 [2] C. Dwork and G. N. Rothblum. Concentrated Differential Privacy. ArXiv eprints, March 2016.
 [3] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, University College London, 2003.
 [4] Mijung Park, Jimmy Foulds, Kamalika Chaudhuri, and Max Welling. Practical privacy for expectation maximization. CoRR, abs/1605.06995, 2016.
 [5] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin IP Rubinstein. Robust and private Bayesian inference. In Algorithmic Learning Theory (ALT), pages 291–305. Springer, 2014.

[6]
Zuhe Zhang, Benjamin Rubinstein, and Christos Dimitrakakis.
On the differential privacy of Bayesian inference.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI)
, 2016.  [7] James R. Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacypreserving bayesian data analysis. CoRR, abs/1603.07294, 2016.
 [8] Gilles Barthe, Gian Pietro Farina, Marco Gaboardi, Emilio Jesús Gallego Arias, Andy Gordon, Justin Hsu, and PierreYves Strub. Differentially private Bayesian programming. CoRR, abs/1605.00283, 2016.
 [9] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013.
 [10] Xi Wu, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey F. Naughton. Differentially private stochastic gradient descent for inrdbms analytics. CoRR, abs/1606.04722, 2016.
 [11] Y.X. Wang, S. E. Fienberg, and A. Smola. Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo. ArXiv eprints, February 2015.
 [12] Ninghui Li, Wahbeh Qardaji, and Dong Su. On sampling, anonymization, and differential privacy or, kanonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ASIACCS ’12, pages 32–33, New York, NY, USA, 2012. ACM.
 [13] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. CoRR, abs/1605.02065, 2016.
 [14] Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent dirichlet allocation. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 856–864. Curran Associates, Inc., 2010.
Comments
There are no comments yet.