Variational Bayes In Private Settings (VIPS)

11/01/2016 ∙ by Mijung Park, et al. ∙ University of Amsterdam University of Maryland, Baltimore County University of California, San Diego 0

We provide a general framework for privacy-preserving variational Bayes (VB) for a large class of probabilistic models, called the conjugate exponential (CE) family. Our primary observation is that when models are in the CE family, we can privatise the variational posterior distributions simply by perturbing the expected sufficient statistics of the complete-data likelihood. For widely used non-CE models with binomial likelihoods, we exploit the Pólya-Gamma data augmentation scheme to bring such models into the CE family, such that inferences in the modified model resemble the private variational Bayes algorithm as closely as possible. The iterative nature of variational Bayes presents a further challenge since iterations increase the amount of noise needed. We overcome this by combining: (1) a relaxed notion of differential privacy, called concentrated differential privacy, which provides a tight bound on the privacy cost of multiple VB iterations and thus significantly decreases the amount of additive noise; and (2) the privacy amplification effect of subsampling mini-batches from large-scale data in stochastic learning. We empirically demonstrate the effectiveness of our method in CE and non-CE models including latent Dirichlet allocation, Bayesian logistic regression, and sigmoid belief networks, evaluated on real-world datasets.



There are no comments yet.


page 31

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian inference, which reasons over the uncertainty in model parameters and latent variables given data and prior knowledge, has found widespread use in data science application domains in which privacy is essential, including text analysis

(Blei et al., 2003), medical informatics (Husmeier et al., 2006), and MOOCS (Piech et al., 2013). In these applications, the goals of the analysis must be carefully balanced against the privacy concerns of the individuals whose data are being studied (Daries et al., 2014). The recently proposed Differential Privacy (DP) formalism provides a means for analyzing and controlling this trade-off, by quantifying the privacy “cost” of data-driven algorithms (Dwork et al., 2006b). In this work, we address the challenge of performing Bayesian inference in private settings, by developing an extension of the widely used Variational Bayes (VB) algorithm that preserves differential privacy. We provide extensive experiments across a variety of probabilistic models which demonstrate that our algorithm is a practical, broadly applicable, and statistically efficient method for privacy-preserving Bayesian inference.

The algorithm that we build upon, variational Bayes, is an optimisation-based approach to Bayesian inference which has origins in the closely related Expectation Maximisation (EM) algorithm (Dempster et al., 1977)

, although there are important differences between these methods. Variational Bayes outputs an approximation to the full Bayesian posterior distribution, while expectation maximisation performs maximum likelihood and MAP estimation, and hence outputs a point estimate of the parameters. Thus, VB performs fully Bayesian inference, while EM does not. It will nevertheless be convenient to frame our discussion on VB in the context of EM, following

Beal (2003)

. The EM algorithm seeks to learn the parameters of models with latent variables. Since directly optimising the log likelihood of observations under such models is intractable, EM introduces a lower bound on the log likelihood by rewriting it in terms of auxiliary probability distributions over the latent variables, and using a Jensen’s inequality argument. EM proceeds by iteratively alternating between improving the bound via updating the auxiliary distributions (the E-step), and optimising the lower bound with respect to the parameters (the M-step).

Alternatively, EM can instead be understood as an instance of the variational method, in which both the E- and M-steps are viewed as maximising the same joint objective function: a reinterpretation of the bound as a variational lower bound which is related to a quantity known as the variational free energy in statistical physics, and to the KL-divergence (Neal and Hinton, 1998). This interpretation opens the door to extensions in which simplifying assumptions are made on the optimization problem, thereby trading improved computational tractability against tightness of the bound. Such simplifications, for instance assuming that the auxiliary distributions are fully factorized, make feasible a fully Bayesian extension of EM, called Variational Bayesian EM (VBEM) (Beal, 2003)

. While VBEM has a somewhat similar algorithmic structure to EM, it aims to compute a fundamentally different object: an approximation to the entire posterior distribution, instead of a point estimate of the parameters. VBEM thereby provides an optimisation-based alternative to Markov Chain Monte Carlo (MCMC) simulation methods for Bayesian inference, and as such, frequently has faster convergence properties than corresponding MCMC methods. We collectively refer to VBEM and an intermediary method between it and EM, called

Variational EM (VEM), as variational Bayes.111Variational EM uses simplifying assumptions on the auxiliary distribution over latent variables, as in VBEM, but still aims to find a point estimate of the parameters, as in EM. See Sec. 2.5 for more details on variational Bayes, and Appendix D for more details on EM.

While the variational Bayes algorithm proves its usefulness by successfully solving many statistical problems, the standard form of the algorithm unfortunately cannot guarantee privacy for each individual in the dataset. Differential Privacy (DP) (Dwork et al., 2006b)

is a formalism which can be used to establish such guarantees. An algorithm is said to preserve differential privacy if its output is sufficiently noisy or random to obscure the participation of any single individual in the data. By showing that an algorithm satisfies the differential privacy criterion, we are guaranteed that an adversary cannot draw new conclusions about an individual from the output of the algorithm, by virtue of his/her participation in the dataset. However, the injection of noise into an algorithm, in order to satisfy the DP definition, generally results in a loss of statistical efficiency, as measured by accuracy per sample in the dataset. To design practical differentially private machine learning algorithms, the central challenge is to design a noise injection mechanism such that there is a good trade-off between privacy and statistical efficiency.

Iterative algorithms such as variational Bayes pose a further challenge, when developing a differentially private algorithm: each iteration corresponds to a query to the database which must be privatised, and the number of iterations required to guarantee accurate posterior estimates causes high cumulative privacy loss. To compensate for the loss, one needs to add a significantly high level of noise to the quantity of interest. We overcome these challenges in the context of variational Bayes by using the following key ideas:

  • Perturbation of the expected sufficient statistics: Our first observation is that when models are in the Conjugate Exponential (CE) family, we can privatise variational posterior distributions simply by perturbing the expected sufficient statistics of the complete-data likelihood. This allows us to make effective use of the per iteration privacy budget in each step of the algorithm.

  • Refined composition analysis: In order to use the privacy budget more effectively across many iterations, we calculate the cumulative privacy cost by using an improved composition analysis, the Moments Accountant (MA) method of Abadi et al. (2016)

    . This method maintains a bound on the log of the moment generating function of the privacy loss incurred by applying multiple mechanisms to the dataset, i.e. one mechanism for each iteration of a machine learning algorithm. This allows for a higher per-iteration budget than standard methods.

  • Privacy amplification effect from subsampling of large-scale data: Processing the entire dataset in each iteration is extremely expensive in the big data setting, and is not possible in the case of streaming data, for which the size of the dataset is assumed to be infinite. Stochastic learning algorithms provide a scalable alternative by performing parameter updates based on subsampled mini-batches of data, and this has proved to be highly advantageous for large-scale applications of variational Bayes (Hoffman et al., 2013). This subsampling procedure, in fact, has a further benefit of amplifying privacy. Our results confirm that subsampling works synergistically in concert with moments accountant composition to make effective use of an overall privacy budget. While there are several prior works on differentially private algorithms in stochastic learning (e.g. Bassily et al. (2014); Song et al. (2013); Wang et al. (2015, 2016); Wu et al. (2016)), the use of privacy amplification due to subsampling, together with MA composition, has not been used in the the context of variational Bayes before (Abadi et al. (2016)

    used this approach for privacy-preserving deep learning).

  • Data augmentation for the non-CE family models: For widely used non-CE models with binomial likelihoods such as logistic regression, we exploit the Pólya-Gamma data augmentation scheme (Polson et al., 2013) to bring such models into the CE family, such that inferences in the modified model resemble our private variational Bayes algorithm as closely as possible. Unlike recent work which involves perturbing and clipping gradients for privacy (Jälkö et al., 2017), our method uses an improved composition method, and also maintains the closed-form updates for the variational posteriors and the posterior over hyper-parameters, and results in an algorithm which is more faithful to the standard CE variational Bayes method. Several papers have used the Pólya-Gamma data augmentation method in order to perform Bayesian inference, either exactly via Gibbs sampling (Polson et al., 2013), or approximately via variational Bayes (Gan et al., 2015). However, this augmentation technique has not previously been used in the context of differential privacy.

Taken together, these ideas result in an algorithm for privacy-preserving variational Bayesian inference that is both practical and very broadly applicable. Our private VB algorithm makes effective use of the privacy budget, both per iteration and across multiple iterations, and with further improvements in the stochastic variational inference setting. Our algorithm is also extremely general, and applies to the broad class of CE family models, as well as non-CE models with binomial likelihoods. We present extensive empirical results demonstrating that our algorithm can preserve differential privacy while maintaining a high degree of statistical efficiency, leading to practical private Bayesian inference on a range of probabilistic models.

We organise the remainder of this paper as follows. First, we review relevant background information on differential privacy, privacy-preserving Bayesian inference, and variational Bayes in Sec. 2. We then introduce our novel general framework of private variational Bayes in Sec. 3 and illustrate how to apply that general framework to the latent Dirichlet allocation model in Sec. 4. In Sec. 5, we introduce the Pólya-Gamma data augmentation scheme for non-CE family models, and we then illustrate how to apply our private variational Bayes algorithm to Bayesian logistic regression (Sec. 6) and sigmoid belief networks (Sec. 7). Lastly, we summarise our paper and provide future directions in Sec. 8.

2 Background

We begin with some background information on differential privacy, general techniques for designing differentially private algorithms and the composition techniques that we use. We then provide related work on privacy-preserving Bayesian inference, as well as the general formulation of the variational inference algorithm.

2.1 Differential privacy

Differential Privacy (DP) is a formal definition of the privacy properties of data analysis algorithms (Dwork and Roth, 2014; Dwork et al., 2006b).

[Differential Privacy] A randomized algorithm is said to be -differentially private if


for all measurable subsets of the range of and for all datasets , differing by a single entry (either by excluding that entry or replacing it with a new entry).

Here, an entry usually corresponds to a single individual’s private value. If , the algorithm is said to be -differentially private, and if , it is said to be approximately differentially private.

Intuitively, the definition states that the probability of any event does not change very much when a single individual’s data is modified, thereby limiting the amount of information that the algorithm reveals about any one individual. We observe that is a randomized algorithm, and randomization is achieved by either adding external noise, or by subsampling. In this paper, we use the “include/exclude” version of Def. 1, in which differing by a single entry refers to the inclusion or exclusion of that entry in the dataset.222We use this version of differential privacy in order to make use of Abadi et al. (2016)’s analysis of the “privacy amplification” effect of subsampling in the specific case of MA using the Gaussian mechanism. Privacy amplification is also possible with the “replace-one” definition of DP, cf. Wang et al. (2016).

2.2 Designing differentially private algorithms

There are several standard approaches for designing differentially-private algorithms – see Dwork and Roth (2014) and Sarwate and Chaudhuri (2013) for surveys. The classical approach is output perturbation by Dwork et al. (2006b), where the idea is to add noise to the output of a function computed on sensitive data. The most common form of output perturbation is the global sensitivity method by Dwork et al. (2006b), where the idea is to calibrate the noise added to the global sensitivity of the function.

2.2.1 The Global Sensitivity Mechanism

The global sensitivity of a function of a dataset is defined as the maximum amount (over all datasets ) by which changes when the private value of a single individual in changes. Specifically,

where is allowed to vary over the entire data domain, and can correspond to either the norm or the norm, depending on the noise mechanism used.

In this paper, we consider a specific form of the global sensitivity method, called the Gaussian mechanism (Dwork et al., 2006a), where Gaussian noise calibrated to the global sensitivity in the norm is added. Specifically, for a function with global sensitivity , we output:

and where is computed using the norm, and is referred to as the L2 sensitivity of the function . The privacy properties of this method are illustrated in (Dwork and Roth, 2014; Bun and Steinke, 2016; Dwork and Rothblum, 2016).

2.2.2 Other Mechanisms

A variation of the global sensitivity method is the smoothed sensitivity method (Nissim et al., 2007)

, where the standard deviation of the added noise depends on the dataset itself, and is less when the dataset is

well-behaved. Computing the smoothed sensitivity in a tractable manner is a major challenge, and efficient computational procedures are known only for a small number of relatively simple tasks.

A second, different approach is the exponential mechanism (McSherry and Talwar, 2007), a generic procedure for privately solving an optimisation problem where the objective depends on sensitive data. The exponential mechanism outputs a sample drawn from a density concentrated around the (non-private) optimal value; this method too is computationally inefficient for large domains. A third approach is objective perturbation (Chaudhuri et al., 2011), where an optimisation problem is perturbed by adding a (randomly drawn) term and its solution is output; while this method applies easily to convex optimisation problems such as those that arise in logistic regression and SVM, it is unclear how to apply it to more complex optimisation problems that arise in Bayesian inference.

A final approach for designing differentially private algorithms is sample-and-aggregate (Nissim et al., 2007), where the goal is to boost the amount of privacy offered by running private algorithms on subsamples, and then aggregating the result. In this paper, we use a combination of output perturbation along with sample-and-aggregate.

2.3 Composition

An important property of differential privacy which makes it conducive to real applications is composition, which means that the privacy guarantees decay gracefully as the same private dataset is used in multiple releases. This property allows us to easily design private versions of iterative algorithms by making each iteration private, and then accounting for the privacy loss incurred by a fixed number of iterations.

The first composition result was established by Dwork et al. (2006b) and Dwork et al. (2006a), who showed that differential privacy composes linearly; if we use differentially private algorithms with privacy parameters then the resulting process is -differentially private. This result was improved by Dwork et al. (2010) to provide a better rate for -differential privacy. Kairouz et al. (2017) improves this even further, and provides a characterization of optimal composition for any differentially private algorithm. Bun and Steinke (2016) uses these ideas to provide simpler composition results for differentially mechanisms that obey certain properties.

2.3.1 The Moments Accountant Method

In this paper, we use the Moments Accountant (MA) composition method due to Abadi et al. (2016) for accounting for privacy loss incurred by successive iterations of an iterative mechanism. We choose this method as it is tighter than Dwork et al. (2010), and applies more generally than zCDP composition (Bun and Steinke, 2016). Moreover, unlike the result in Kairouz et al. (2017), this method is tailored to specific algorithms, and has relatively simple forms, which makes calculations easy.

The moments accountant method is based on the concept of a privacy loss random variable, which allows us to consider the entire spectrum of likelihood ratios induced by a privacy mechanism

. Specifically, the privacy loss random variable corresponding to a mechanism

, datasets and , and an auxiliary parameter is a random variable defined as follows:

where lies in the range of . Observe that if is -differentially private, then the absolute value of is at most with probability .

The moments accountant method exploits properties of this privacy loss random variable to account for the privacy loss incurred by applying mechanisms successively to a dataset ; this is done by bounding properties of the log of the moment generating function of the privacy loss random variable. Specifically, the log moment function of a mechanism is defined as:


where and are datasets that differ in the private value of a single person.333In this paper, we will interchangeably denote expectations as or . Abadi et al. (2016) shows that if is the combination of mechanisms , then, its log moment generating function has the property that:


Additionally, given a log moment function , the corresponding mechanism satisfies a range of privacy parameters connected by the following equation:


These properties immediately suggest a procedure for tracking privacy loss incurred by a combination of mechanisms on a dataset. For each mechanism , first compute the log moment function ; for simple mechanisms such as the Gaussian mechanism this can be done by simple algebra. Next, compute for the combination from (3), and finally, recover the privacy parameters of using (4) by either finding the best for a target or the best for a target . In some special cases such as composition of Gaussian mechanisms, the log moment functions can be calculated in closed form; the more common case is when closed forms are not available, and then a grid search may be performed over . We observe that any obtained as a solution to (4) via grid search are still valid privacy parameters, although they may be suboptimal if the grid is too coarse.

2.4 Privacy-preserving Bayesian inference

Privacy-preserving Bayesian inference is a new research area which is currently receiving a lot of attention. Dimitrakakis et al. (2014) showed that Bayesian posterior sampling is automatically differentially private, under a mild sensitivity condition on the log likelihood. This result was independently discovered by Wang et al. (2015), who also showed that the Stochastic Gradient Langevin Dynamics (SGLD) Bayesian inference algorithm (Welling and Teh, 2011) automatically satisfies approximate differential privacy, due to the use of Gaussian noise in the updates.

As an alternative to obtaining privacy “for free” from posterior sampling, Foulds et al. (2016) studied a Laplace mechanism approach for exponential family models, and proved that it is asymptotically efficient, unlike the former approach. Independently of this work, Zhang et al. (2016) proposed an equivalent Laplace mechanism method for private Bayesian inference in the special case of beta-Bernoulli systems, including graphical models constructed based on these systems. Foulds et al. (2016) further analyzed privacy-preserving MCMC algorithms, via both exponential and Laplace mechanism approaches.

In terms of approximate Bayesian inference, Jälkö et al. (2017)

recently considered privacy-preserving variational Bayes via perturbing and clipping the gradients of the variational lower bound. However, this work focuses its experiments on logistic regression, a model that does not have latent variables. Given that most latent variable models consist of at least as many latent variables as the number of datapoints, the long vector of gradients (the concatenation of the gradients with respect to the latent variables; and with respect to the model parameters) in such cases is expected to typically require excessive amounts of additive noise. Furthermore, our approach, using the data augmentation scheme (see Sec.

5) and moment perturbation, yields closed-form posterior updates (posterior distributions both for latent variables and model parameters) that are closer to the spirit of the original variational Bayes method, for both CE and non-CE models, as well as an improved composition analysis using moments accountant.

In recent work, Barthe et al. (2016) designed a probabilistic programming language for designing privacy-preserving Bayesian machine learning algorithms, with privacy achieved via input or output perturbation, using standard mechanisms.

Lastly, although it is not a fully Bayesian method, it is worth noting the differentially private expectation maximisation algorithm developed by Park et al. (2017), which also involves perturbing the expected sufficient statistics for the complete-data likelihood. The major difference between our and their work is that EM is not (fully) Bayesian, i.e., EM outputs the point estimates of the model parameters; while VB outputs the posterior distributions (or those quantities that are necessary to do Bayesian predictions, e.g., expected natural parameters and expected sufficient statistics). Furthermore, Park et al. (2017) deals with only CE family models for obtaining the closed-form MAP estimates of the parameters; while our approach encompasses both CE and non-CE family models. Lastly, Park et al. (2017) demonstrated their method on small- to medium-sized datasets, which do not require stochastic learning; while our method takes into account the scenario of stochastic learning which is essential in the era of big data (Hoffman et al., 2013).

2.5 Variational Bayes

Variational inference is the class of techniques which solve inference problems in probabilistic models using variational methods (Jordan et al., 1999; Wainwright and Jordan, 2008). The general idea of variational methods is to cast a quantity of interest as an optimisation problem. By relaxing the problem in some way, we can replace the original intractable problem with one that we can solve efficiently.444Note that the “variational” terminology comes from the calculus of variations, which is concerned with finding optima of functionals. This pertains to probabilistic inference, since distributions are functions, and we aim to optimise over the space of possible distributions. However, it is typically not necessary to use the calculus of variations when deriving variational inference algorithms in practice.

The application of variational inference to finding a Bayesian posterior distribution is called Variational Bayes (VB). Our discussion is focused on the Variational Bayesian EM (VBEM) algorithm variant. The goal of the algorithm is to compute an approximation to the posterior distribution over latent variables and model parameters for models where exact posterior inference is intractable. This should be contrasted to VEM and EM, which aim to compute a point estimate of the parameters. VEM and EM can both be understood as special cases, in which the set of distributions is constrained such that the approximate posterior is a Dirac delta function (Beal, 2003). We therefore include them within the definition of VB. See Blei et al. (2017) for a recent review on variational Bayes, Beal (2003) for more detailed derivations, and see Appendix D for more information on the relationship between EM and VBEM.

High-level derivation of VB

Consider a generative model that produces a dataset consisting of independent identically distributed () items ( is the th input/output pair

for supervised learning, and

is the th vector output

for unsupervised learning), generated using a set of latent variables

. The generative model provides , where are the model parameters. We also consider the prior distribution over the model parameters and the prior distribution over the latent variables .

Variational Bayes recasts the task of approximating the posterior as an optimisation problem: making the approximating distribution , which is called the variational distribution, as similar as possible to the posterior, by minimising some distance (or divergence) between them. The terminology VB is often assumed to refer to the standard case, in which the divergence to minimise is the KL-divergence from to ,


The of Eq. 5 with respect to does not depend on the constant . Minimising it is therefore equivalent to maximizing


where is the entropy (or differential entropy) of . The entropy of rewards simplicity, while , the expected value of the complete data log-likelihood under the variational distribution, rewards accurately fitting to the data. Since the KL-divergence is 0 if and only if the two distributions are the same, maximizing will result in when the optimisation problem is unconstrained. In practice, however, we restrict to a tractable subset of possible distributions. A common choice for the tractable subset is the set of fully factorized distributions , in which case the method is referred to as mean-field variational Bayes.

We can alternatively derive as a lower bound on the log of the marginal probability of the data (the evidence),


where we have made use of Jensen’s inequality. Due to Eq. 7, is sometimes referred to as a variational lower bound, and in particular, the Evidence Lower Bound (ELBO), since it is a lower bound on the log of the model evidence . The VBEM algorithm maximises via coordinate ascent over parameters encoding , called the variational parameters. The E-step optimises the variational parameters pertaining to the latent variables , and the M-step optimises the variational parameters pertaining to the model parameters . Under certain assumptions, these updates have a certain form, as we will describe below. These updates are iterated until convergence.

VB for CE models

VB simplifies to a two-step procedure when the model falls into the Conjugate-Exponential (CE) class of models, which satisfy two conditions (Beal, 2003):


where natural parameters and sufficient statistics of the complete-data likelihood are denoted by and

, respectively. The hyperparameters are denoted by

(a scalar) and (a vector).

A large class of models fall in the CE family. Examples include linear dynamical systems and switching models; Gaussian mixtures; factor analysis and probabilistic PCA; Hidden Markov Models (HMM) and factorial HMMs; and discrete-variable belief networks. The models that are widely used but not in the CE family include: Markov Random Fields (MRFs) and Boltzmann machines; logistic regression; sigmoid belief networks; and Independent Component Analysis (ICA). We illustrate how best to bring such models into the CE family in a later section.

The VB algorithm for a CE family model optimises the lower bound on the model log marginal likelihood given by Eq. 7 (the ELBO),


where we assume that the joint approximate posterior distribution over the latent variables and model parameters is factorised via the mean-field assumption as


and that each of the variational distributions also has the form of an exponential family distribution. Computing the derivatives of the variational lower bound in Eq. 10 with respect to each of these variational distributions and setting them to zero yields the following two-step procedure.

Stochastic VB for CE models

The VB update introduced in Eq. 2.5 is inefficient for large data sets because we should optimise the variational posterior over the latent variables corresponding to each data point before re-estimating the variational posterior over the parameters. For more efficient learning, we adopt stochastic variational inference, which uses stochastic optimisation to fit the variational distribution over the parameters. We repeatedly subsample the data to form noisy estimates of the natural gradient of the variational lower bound, and we follow these estimates with a decreasing step-size , as in Hoffman et al. (2013).555When optimising over a probability distribution, the Euclidean distance between two parameter vectors is often a poor measure of the dissimilarity of the distributions. The natural gradient of a function accounts for the information geometry of its parameter space, using a Riemannian metric to adjust the direction of the traditional gradient, which results in a faster convergence than the traditional gradient (Hoffman et al., 2013). The stochastic variational Bayes algorithm is summarised in Algorithm 1.

0:  Data . Define and mini-batch size .
0:  Expected natural parameters and expected sufficient statistics .
  for  do
     (1) E-step: Given the expected natural parameters , compute for . Output the expected sufficient statistics .
     (2) M-step: Given , compute by . Set . Output the expected natural parameters .
  end for
Algorithm 1 (Stochastic) Variational Bayes for CE family distributions

3 Variational Bayes In Private Settings (VIPS) for the CE family

To create an extension of variational Bayes which preserves differential privacy, we need to inject noise into the algorithm. The design choices for the noise injection procedure must be carefully made, as they can strongly affect the statistical efficiency of the algorithm, in terms of its accuracy versus the number of samples in the dataset. We start by introducing our problem setup.

3.1 Problem setup

A naive way to privatise the VB algorithm is by perturbing both and . Unfortunately, this is impractical, due to the excessive amounts of additive noise (recall: we have as many latent variables as the number of datapoints). We propose to perturb the expected sufficient statistics only. What follows next explains why this makes sense.

While the VB algorithm is being run, the places where the algorithm needs to look at the data are (1) when computing the variational posterior over the latent variables ; and (2) when computing the expected sufficient statistics given in the E-step. In our proposed approach, we compute behind the privacy wall (see below), and compute the expected sufficient statistics using , as shown in Fig. 1. Before outputting the expected sufficient statistics, we perturb each coordinate of the expected sufficient statistics to compensate the maximum difference in caused by both and . The perturbed expected sufficient statistics then dictate the expected natural parameters in the M-step. Hence we do not need an additional step to add noise to .

The reason we neither perturb nor output for training data is that we do not need itself most of the time. For instance, when computing the predictive probability for test datapoints , we need to perform the -step to obtain the variational posterior for the test data , which is a function of the test data and the expected natural parameters , given as


where the dependence on the training data is implicit in the approximate posteriors through ; and the expected natural parameters . Hence, outputting the perturbed sufficient statistics and the expected natural parameters suffice for protecting the privacy of individuals in the training data. Furthermore, the M-step can be performed based on the (privatised) output of the E-step, without querying the data again, so we do not need to add any further noise to the M-step to ensure privacy, due to the fact that differential privacy is immune to data-independent post-processing. To sum up, we provide our problem setup as below.

  1. Privacy wall: We assume that the sensitive training dataset is only accessible through a sanitising interface which we call a privacy wall. The training data stay behind the privacy wall, and adversaries have access to the outputs of our algorithm only, i.e., no direct access to the training data, although they may have further prior knowledge on some of the individuals that are included in the training data.

  2. Training phase: Our differentially private VB algorithm releases the perturbed expected natural parameters and perturbed expected sufficient statistics in every iteration. Every release of a perturbed parameter based on the training data triggers an update in the log moment functions (see Sec 3.2 on how these are updated). At the end of the training phase, the final privacy parameters are calculated based on (4).

  3. Test phase: Test data are public (or belong to users), i.e., outside the privacy wall. Bayesian prediction on the test data is possible using the released expected natural parameters and expected sufficient statistics (given as Eq. 14). Note that we do not consider protecting the privacy of the individuals in the test data.

Figure 1: Schematic of VIPS. Given the initial expected natural parameters , we compute the variational posterior over the latent variables . Since is a function of not only the expected natural parameters but also the data , we compute behind the privacy wall. Using , we then compute the expected sufficient statistics. Note that we neither perturb nor output itself. Instead, when we noise up the expected sufficient statistics before outputting, we add noise to each coordinate of the expected sufficient statistics in order to compensate the maximum difference in caused by both and . In the M-step, we compute the variational posterior over the parameters using the perturbed expected sufficient statistics . Using , we compute the expected natural parameters , which is already perturbed since it is a function of . We continue performing these two steps until convergence.

3.2 How to compute the log moment functions?

Recall that to use the moments accountant method, we need to compute the log moment functions for each individual iteration . An iteration of VIPS randomly subsamples a

fraction of the dataset and uses it to compute a gradient which is then perturbed via the Gaussian mechanism with variance

. How the log moment function is computed depends on the sensitivity of the gradient as well as the underlying mechanism; we next discuss how each of these aspects.

Sensitivity analysis

Suppose there are two neighbouring datasets and , where has one entry difference compared to (i.e., by replacing one entry from ).

We denote the vector of expected sufficient statistics by where each expected sufficient statistic is given by


When computing the sensitivity of the expected sufficient statistics, we assume, without loss of generality, the last entry removed from maximises the difference in the expected sufficient statistics run on the two datasets and . Under this assumption and the assumption on the likelihood, given the current expected natural parameters , all and for evaluated on the dataset are the same as and evaluated on the dataset . Hence, the sensitivity is given by


where the last line is because the average (over the terms) expected sufficient statistics is less than equal to the maximum expected sufficient statistics (recall: we assumed that the last entry maximises the difference in the expected sufficient statistics), i.e.,


As in many existing works (e.g., (Chaudhuri et al., 2011; Kifer et al., 2012), among many others), we also assume that the dataset is pre-processed such that the norm of any is less than . Furthermore, we choose such that its support is bounded. Under these conditions, each coordinate of the expected sufficient statistics has limited sensitivity. We will add noise to each coordinate of the expected sufficient statistics to compensate this bounded maximum change in the E-step.

Log Moment Function of the Subsampled Gaussian Mechanism

Observe that iteration of our algorithm subsamples a fraction of the dataset, computes a gradient based on this subsample, and perturbs it using the Gaussian mechanism with variance . To simplify the privacy calculations, we assume that each example in the dataset is included in a minibatch according to an independent coin flip with probability . This differs slightly from the standard approach for stochastic gradient methods, in which a fixed minibatch size is typically used in each iteration. However, following Abadi et al. (2016), for simplicity of analysis, we will also assume that the instances are included independently with probability ; and for ease of implementation, we will use minibatches with fixed size in our experiments.

From Proposition 1.6 in Bun and Steinke (2016) along with simple algebra, the log moment function of the Gaussian Mechanism applied to a query with -sensitivity is . To compute the log moment function for the subsampled Gaussian Mechanism, we follow Abadi et al. (2016). Let and be the densities and , and let be the mixture density; then, the log moment function at is where and . and can be numerically calculated for any , and we maintain the log moments over a grid of values.

Finally, note that our algorithms are run for a prespecified number of iterations, and with a prespecified ; this ensures that the moments accountant analysis is correct, and we do not need an a data-dependent adaptive analysis such as in Rogers et al. (2016).

Algorithm 2 summarizes our VIPS algorithm that performs differentially private stochastic variational Bayes for CE family models.

0:  Data . Define , noise variance , mini-batch size , and maximum iterations .
0:  Perturb expected natural parameters and expected sufficient statistics .
  Compute the L2-sensitivity of the expected sufficient statistics.
  for  do
     (1) E-step: Given the expected natural parameters , compute for . Perturb each coordinate of by adding noise, and output . Update the log moment functions.
     (2) M-step: Given , compute by . Set . Output the expected natural parameters .
  end for
Algorithm 2 Private VIPS for CE family distributions

4 VIPS for latent Dirichlet allocation

Here, we illustrate how to use the general framework of VIPS for CE family in the example of Latent Dirichlet Allocation (LDA).

4.1 Model specifics in LDA

The most widely used topic model is Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Its generative process is given by

  • Draw topics Dirichlet , for , where is a scalar hyperarameter.

  • For each document

    • Draw topic proportions Dirichlet , where is a scalar hyperarameter.

    • For each word

      • Draw topic assignments Discrete

      • Draw word Discrete

where each observed word is represented by an indicator vector (th word in the th document) of length , and where is the number of terms in a fixed vocabulary set. The topic assignment latent variable is also an indicator vector of length , where is the number of topics.

The LDA model falls in the CE family, viewing and as two types of latent variables: , and as model parameters . The conditions for CE are satisfied: (1) the complete-data likelihood per document is in exponential family:



; and (2) we have a conjugate prior over



for . For simplicity, we assume hyperparameters and are set manually.

Under the LDA model, we assume the variational posteriors are given by

  • Discrete : , with variational parameters for capturing the posterior topic assignment,

  • Dirichlet : ,

where these two distributions are computed in the E-step behind the privacy wall. The expected sufficient statistics are due to Eq. 20. Then, in the M-step, we compute the posterior

  • Dirichlet : .

4.2 VIPS for LDA

Following the general framework of VIPS, we perturb the expected sufficient statistics for differentially private LDA. While each document has a different document length , we limit the maximum length of any document to by randomly selecting words in a document if the number of words in the document is longer than . We add Gaussian noise to each coordinate, then map to 0 if the perturbed coordinate becomes negative:


where is the th coordinate of a vector of length : , and is the sensitivity given by


since , , , and . The resulting algorithm is summarised in Algorithm 3.

0:  Data . Define (documents), (vocabulary), (number of topics).    Define , mini-batch size , hyperparameters , and
0:  Privatised expected natural parameters and sufficient statistics .
  Compute the sensitivity of the expected sufficient statistics given in Eq. 4.2.
  for  do
     (1) E-step: Given expected natural parameters
     for  do
        Compute parameterised by .
        Compute parameterised by .
     end for
     Output the perturbed expected sufficient statistics , where is Gaussian noise given in Eq. 21.
     Update the log-moment functions
     (2) M-step: Given perturbed expected sufficient statistics ,
     Compute parameterised by .
     Set .
     Output expected natural parameters .
  end for
Algorithm 3 VIPS for LDA

Figure 2: Per-word-perplexity as a function of the number of documents seen (up to ). Data: documents randomly selected from Wikipedia. We obtained the total privacy loss, by varying values of , given a mini-batch size and the number of iterations . Here, we fixed the total tolerance to . The non-private LDA (black trace) achieves the lowest perplexity. For the same level of total privacy loss, the moments accountant composition (red trace) performs better than the strong composition (blue trace). Regardless of which composition method to use, smaller yields a worse perplexity, due to a higher noise variance.

4.3 Experiments using Wikipedia data

We downloaded a random documents from Wikipedia. We then tested our VIPS algorithm on the Wikipedia dataset with three different values of total privacy budget by varying values of , using different mini-batch sizes , until the algorithm sees up to documents. We assumed there are topics, and we used a vocabulary set of approximately terms.

To compare our method with moments accountant, we used a baseline method using the strong composition (Theorem 3.20 of Dwork and Roth (2014)), resulting from the max divergence of the privacy loss random variable being bounded by a total budget including a slack variable , yields -DP.

As our evaluation metric, we compute an upper bound on the perplexity on held-out documents. Perplexity is an information-theoretic measure of the predictive performance of probabilistic models which is commonly used in the context of language modeling

(jelinek1977perplexity). The perplexity of a probabilistic model on a test set of data points (e.g. words in a corpus) is defined as


where is generally either or , corresponding to a measurement based on either bits or nats, respectively. We can interpret as the cross-entropy between the model and the empirical distribution. This is the expected number of bits (nats) needed to encode a data point from the empirical data distribution (i.e. a word in our case), under an optimal code based on the model. Perplexity is to the power of the cross entropy, which converts the number of bits (nats) in the encoding to the number of possible values in an encoding of that length (supposing the cross entropy were integer valued). Thus, perplexity measures the effective vocabulary size when using the model to encode the data, which is understood as a reflection of how confused (i.e. “perplexed”) the model is. In our case,

is the posterior predictive distribution under our variational approximation to the posterior, which is intractable to compute. Following

Hoffman et al. (2010), we approximate perplexity based on the learned variational distribution, measured in nats, by plugging the ELBO into Equation 23, which results in an upper bound:

where is a vector of word counts for the th document, . In the above, we use the that was calculated during training. We compute the posteriors over and by performing the first step in our algorithm using the test data and the perturbed sufficient statistics we obtain during training. We used the python implementation by the authors of (Hoffman et al., 2010) in our experiments. The per-word-perplexity is shown in Fig. 2.

In Table 1, we show the top words in terms of assigned probabilities under a chosen topic in each method. We show topics as examples. Non-private LDA results in the most coherent words among all the methods. For the private LDA models with a total privacy budget (), as we move from moments accountant to strong composition, the amount of noise added gets larger, and therefore the topics become less coherent.

Non-private Moments Acc Strong Comp
topic 74: topic 74: topic 74:
character 0.1034 united 0.0452 united 0.0244
james 0.0835 states 0.0247 song 0.0204
thomas 0.0713 county 0.0203 name 0.0139
tom 0.0623 name 0.0171 new 0.0131
series 0.0615 new 0.0165 states 0.0130
name 0.0305 refer 0.0125 refer 0.0127
main 0.0191 american 0.0123 series 0.0124
city 0.0161 james 0.0122 county 0.0119
gordon 0.0131 thomas 0.0120 american 0.0102
people 0.0122 south 0.0110 george 0.0094
topic 75: topic 75: topic 75:
state 0.0208 state 0.0246 priest 0.0002
history 0.0189 government 0.0123 knees 0.0002
government 0.0147 states 0.0121 sympathies 0.0002
states 0.0111 law 0.0100 egypt 0.0002
law 0.0085 international 0.0076 email 0.0002
political 0.0078 country 0.0071 supplies 0.0002
national 0.0075 united 0.0065 pilot 0.0002
university 0.0073 population 0.0064 serene 0.0002
population 0.0072 political 0.0063 rocky 0.0002
century 0.0071 people 0.0062 sex 0.0002
topic 76: topic 76: topic 76:
roman 0.1119 title 0.0274 knight 0.0041
chinese 0.0511 order 0.0123 sir 0.0015
empire 0.0384 rank 0.0095 knights 0.0015
china 0.0372 officer 0.0067 grand 0.0008
emperor 0.0350 knight 0.0066 sire 0.0006
ancient 0.0301 titles 0.0065 commander 0.0006
rome 0.0241 sir 0.0062 honours 0.0005
julian 0.0220 grand 0.0061 maam 0.0004
imperial 0.0195 royal 0.0059 addressed 0.0003
han 0.0175 chief 0.0054 accompanying 0.0003
Table 1: Posterior topics from private () and non-private LDA

Using the same data, we then tested how perplexity changes as we change the mini-batch sizes. As shown in Fig. 3, it is more beneficial when the mini-batch size is small, since it effectively decreases the amount of noise to add. The moments composition results in a better accuracy than the strong composition.

Figure 3: Per-word-perplexity with different mini-batch sizes . Top: mini-batch size and total number of iterations . Bottom: mini-batch size and total number of iterations . In the private LDA, smaller mini-batch size achieves lower perplexity. Regardless of the mini-batch size, the moments accountant composition (red trace) achieves a lower perplexity than the strong composition (blue trace).

5 VIPS for non-CE family

Under non-CE family models, the complete-data likelihood typically has the following form:


which includes some function that cannot be split into two functions, where one is a function of only and the other is a function of only