1 Introduction
Data on all aspects of our daily lives, such as behavioural, health and financial data, are increasingly collected, stored and analyzed by corporations and government agencies, and there is a dire need for developing machine learning tools that can analyze these data while still guaranteeing the privacy of individuals. Much progress has been made recently in developing privacypreserving methods
[1, 2] and differential privacy, in particular, is emerging as the dominant notion of algorithmic privacy [1].In this paper, we derive differentially private variants of the expectation maximization (EM) algorithm which has been widely used to solve statistical problems in many areas of science including bioinformatics [3], neuroscience [4]
, and computer vision
[5]. Expectation maximization iteratively estimates the parameters of models with unobserved variables. We present a very general privacypreserving EM algorithm which can be used for any model with a completedata likelihood in the exponential family. We then apply our algorithm to the mixture of Gaussians (MoG) density estimation model and the factor analysis (FA) model. Having access to a private density estimator is particularly valuable because it provides a means to anonymize the data in a principled way, by simply sampling a dataset from the model and replacing the original data with this sampled data.Since differentially private machine learning algorithms usually achieve privacy by adding noise to perturb the output of the algorithm or its intermediate stages, the main challenge in developing privacypreserving algorithms is in controlling the associated loss in statistical efficiency or utility per sample. This problem is particularly exacerbated for iterative algorithms such as EM. For example, recent work on the
means algorithm, a variant of EM for mixture of Gaussians, requires adding noise to the parameters where the noise standard deviation is on the order of the input dimension
times the number of iterations [6], which may necessitate early termination. To avoid this, more recent work proposes to apply a standard means clustering algorithm to a privatized synopsis of the data [7]. Their synopsis generation method consists of putting rectangular bounding boxes in the data space and counting how many data points are in each box. However, this method applies mainly to the clustering task and for lowdimensional data.Instead, we propose to resolve the privacyutility dilemma using two key innovations: a private EM formulation based on moment perturbation for sensible use of the privacy budget per iteration, and recently proposed composition methods to improve the privacy cost across many iterations. Our moment perturbation approach is applicable for any model in which the completedata likelihood is in the exponential family. In such cases, the EM parameters are functions of moments of latent and observed variables, which we perturb for privacy. Moment perturbation for differentially private estimators is not a new concept (see [8, 9]). However, unlike [9], we do not require subsampling of the data.
Furthermore, our algorithm calculates the cumulative privacy cost using two refined composition methods, the moments accountant and zCDP. The moments accountant [10] bounds the moments of the privacy loss random variable. Inspired by CDP [11], zCDP [12] formulates the moments of the privacy loss random variable in terms of the Rényi divergence between the output distributions obtained by running an algorithm on two datasets that differ in the value of a single individual. In both cases, the moments bound yields a tighter tail bound, and consequently, for a given total privacy budget, allows for a higher periteration budget than standard methods. Our experimental results show that by combining our moment perturbation formulation of privacypreserving EM with refined composition methods, we obtain a practical and effective algorithm for privately estimating the parameters of latent variable models.
We start by reviewing differential privacy, the moments accountant, and EM in Sec. 2. In Sec. 3, we introduce our general DPEM framework. We then derive the DPEM algorithm for mixture of Gaussians in Sec. 4. In Sec. 5, we construct the MA and zCDP formulation for EM under MoGs. In Sec. 6, we provide the DPEM algorithm for factor analysis, and we illustrate the effectiveness of our algorithms in Sec. 7.
2 Background
In this section, we provide background information on the definitions of algorithmic privacy that we use, the MA and zCDP formulations which provide a refined privacy analysis, as well as the general EM algorithm.
Differential privacy.
Differential privacy (DP) is a formal definition of the privacy properties of data analysis algorithms [1]. Given an algorithm and datasets , differing by a single entry, the privacy loss random variable of an outcome is
(1) 
is DP if and only if
. Intuitively, the definition states that the output probabilities must not change very much when a single individual’s data is modified, thereby limiting the amount of information that the algorithm reveals about any one individual. An approximate version is (
)DP, defined to hold if and only if , with probability at least .Concentrated differential privacy.
Concentrated differential privacy (CDP) is a recently proposed relaxation of differential privacy which aims to make privacypreserving iterative algorithms more practical than for DP while still providing strong privacy guarantees. There are two variants of CDP. First, in mCDP [11], subtracted by its mean is subgaussian with standard deviation : . Second, in zCDP [12], that arises from a connection between the moment generating function of and the Rényi divergence between the distributions of and that of , we require: , where the Rényi divergence is denoted by . Observe that in this case is also subgaussian but zeromean. In zCDP, composition is straightfoward since the Rényi divergence between two product distributions is simply the sum of the Rényi divergences of the marginals.
We will use zCDP rather than mCDP, since many DP and approximate DP mechanisms can be characterised in terms of zCDP, but not in terms of mCDP without a large loss in privacy parameters. This correspondence will allow us to use zCDP as a tool for analyzing composition under the DP privacy definition, for a fair comparison between CDP and DP analyses.^{1}^{1}1See Sec. 4 in [12] for a detailed explanation.
Moments accountant.
The moments accountant calculates a privacy budget by bounding the moments of , where the th moment is defined as the log of the moment generating function evaluated at [10]:
(2) 
The worst case over all the neighbouring databases is defined as ^{2}^{2}2The form of is determined by the mechanism.
Using Markov’s inequality, for any , the th moment is converted to the ()DP guarantee by^{3}^{3}3See Appendix A in [10] for the proof.
(3) 
The th moment in Eq (2) composes linearly, which yields the composability theorem (Theorem 2.1 in [10]). An immediate result from the composibility theorem is that the sum of each upper bound on is an upper bound on the total th moment after compositions,
(4) 
The general EM algorithm.
Given i.i.d. observations , with each observation , and hidden variables
, computing the maximum likelihood estimator of a vector of model parameters
is analytically intractable, due to the integral or summation inside the logarithm,(5) 
Instead, one can lowerbound using the posterior distribution over latent variables [13],
(6) 
where the lower bound is often called free energy [14], , where is the entropy of . EM alternates between: (1) the Estep: optimizing wrt distribution over unobserved variables holding parameters fixed
(7) 
and (2) the Mstep: maximizing wrt parameters holding the latent distribution fixed
(8) 
where since does not directly depend on .
To understand what EM does, one can rewrite the free energy in terms of the loglikelihood and the KL divergence terms, During the Estep, we set , which makes the second term zero and the free energy equals the likelihood. Then, in the Mstep, we get the maximum likelihood estimate (MLE). For the maximum a posteriori (MAP) estimate, we add the log prior for the parameters to the right hand side of Eq (8).
3 The general DPEM algorithm
The EM algorithm is frequently used for models whose joint distribution over observed and unobserved variables remains in the exponential family:
, while the marginal does not. In this case, the free energy can be rewritten as(9) 
where is some constant wrt , and . In the Estep, we compute the expected sufficient statistics under , i.e., . Then, in the Mstep, we compute partial derivatives wrt each parameter,
Although it is not straightforward to derive a closedform expression for each parameter update due to the dependence on other parameters in , it is easy to see that each parameter update depends on each expected sufficient statistics, i.e., moments, denoted by . So, to output privatized parameters, all we need is to perturb the moments to compensate any single data point’s change. The sensitivity of the expected sufficient statistics is given by
(10)  
where the last line is due to the triangle inequality. The expectation over can be rewritten as an inner product, and using Hölder’s inequality: where and is maximum over all . As in many existing works (e.g., [15, 16] among many others), we also assume that datasets are preprocessed such that the norm of any is less than , meaning that any stays within a unit ball. Furthermore, we assume that has a bounded support of denoted by . Under these assumptions, the sensitivity is given by Using this sensitivity, we add noise to each moment and the perturbed moments are mapped by a modelspecific deterministic function to the vector of privatized parameters, given as where are perturbed moments. Using this general framework, we derive the differentially private EM algorithm for mixture of Gaussians and factor analysis in the following.
4 DPEM for mixture of Gaussians
4.1 EM for Mixture of Gaussians
We consider the mixture of Gaussians (MoG) model as a first example to derive the DPEM algorithm. For Gaussians and data points , the loglikelihood under MoG is given by where . We denote the parameters by .
Introducing a binary vector of length for each data point, , to represent the membership to which Gaussian each datapoint belongs, e.g., and , the distribution over each is given by and the distribution over all unobserved variables is given by The joint distribution over observed and unobserved variables, which is in the exponential family, is given by In the Estep, we compute the responsibilities as given the parameters from the previous iteration
(11) 
and in the Mstep, we update the parameters by
(12)  
where .
For the maximum a posteriori estimate, we impose the Dirichlet prior on
and NormalinverseWishart prior on , where the MAP estimates areIn this paper we set hyperparameters to conventional values, e.g.
, rather than optimizing them, cf. [17].Before moving to the next section, we would like to motivate why it is important to construct a privacy preserving algorithm for MoG. In Fig. 1, we show that if one runs the EM algorithm for the given dataset, an individual’s information can be easily revealed by just looking at the EM parameters, while the noisedup parameters obtained by the method, which will be described next, protect private information effectively.^{4}^{4}4We first preprocessed the data by scaling down the magnitude with the maximum L2 norm of the data points, and then added noise to each parameter following the derivations in Sec. 4. For visualisation, we map the results back to the original latitude/longtitude space.
4.2 DPEM for MoG
Under MoG, we plug in the responsibilities given in Eq (4.1) to the parameter update expressions given in Eq (12). We then perturb each of these by taking into account one datapoint’s worstcase difference between two neighboring datasets. We use to denote a privacy budget allocated per iteration.
DP or DP mixing coefficients.
For two neighbouring datasets with a single data point difference, the maximum difference in occurs when the data point is assigned to the th Gaussian with and the altered data point is assigned to another, e.g., the th Gaussian, with . Hence, we get the following sensitivity:
(13) 
since and . We add noise to compensate the maximum difference^{5}^{5}5To ensure , we set , if , and , if . Then, we renormalize after the projection to ensure .
(14) 
where or with . For , we do not need any additional sensitivity analysis, since the MAP estimate is a deterministic mapping of the MLE.
DP or DP mean parameters.
Using the noisedup obtained from the noisedup mixing coefficients, i.e., , the maximum difference in mean parameters due to one datapoint’s difference is
(15) 
where and the L1 term is bounded by Eq (10). The term is from the fact that each input vector is L2norm bounded by 1.^{6}^{6}6 . We add noise to the MLE via^{7}^{7}7The MAP estimate only differs from the MLE in the denominator. Hence, we simply replace with in Eq (4.2) in the MAP estimation case.
(16) 
where or with , where .
DP covariance parameters.
For covariance perturbation, we follow the Analyze Gauss (AG) algorithm [18], which provides DP. We first draw Gaussian random variables
(17) 
where and the sensitivity of the covariance matrix ^{8}^{8}8 The MAP estimate only differs from the MLE in the denominator. We replace with in Eq (4.2) in the MAP estimation case. in Frobenius norm is given by
(18) 
where , and . Using , we construct a upper triangular matrix (including diagonal), then copy the upper part to the lower part so that the resulting matrix becomes symmetric. Then, we add this noisy matrix to the covariance matrix
(19) 
The perturbed covariance might not be positive definite. In such case, we project the negative eigenvalues to some value near zero to maintain positive definiteness of the covariance matrix.
Combinations of the perturbations.
Among all the possible combinations of these parameter perturbation mechanisms, we focus on two scenarios. Scenario 1 (which we call LLG) uses the DP Laplace mechanism for perturbing mixing coefficients (once) and mean parameters (K times) and the DP Gaussian mechanism for perturbing the covariance parameters (K times). Since there are Gaussians, for iterations, there will be compositions of DP mechanism and compositions of DP mechanisms in total in this scenario. Scenario 2 (which we call GGG) uses the DP Gaussian mechanism for perturbing all the parameters. For iterations, there will be compositions of DP mechanism in total in this scenario.
5 Compositions for DPEM for MoGs
Before describing our method, we first describe the two baseline methods. First, in Linear (Lin) composition (Theorem 3.16 [1]), privacy degrades linearly with the number of iterations. This result is from the Max Divergence of the privacy loss random variable being bounded by a total budget. Hence, the linear composition yields (, )DP under scenario and (, )DP under scenario . Second, Advanced (Adv) composition (Theorem 3.20 [1]), resulting from the Max Divergence of the privacy loss random variable being bounded by a total budget including a slack variable , yields DP under scenario and DP under scenario .
Our method calculates the periteration budget using the two composition methods below.
zCDP composition (zCDP).
zCDP composition yields DP, where
under scenario and
under scenario , for sensitivity and , , and , where .
These results are obtained by using the following results in [12]: Proposition 1.4. If satisfies DP, then satisfies zCDP; Proposition 1.6. Gaussian mechanism satisfies zCDP, where is a sensitivity; Lemma 1.7. If two mechanisms satisfy zCDP and zCDP, respectively, then their composition satisfies zCDP; and Proposition 1.3. If provides zCDP, then is DP for any .
Moments Accountant composition (MA).
For using MA, as a first step, we identify the form of privacy loss random variable and its th moment in each mechanism we use. For DP Laplace mechanism outputting and , at has the following form:
Following the definition in Eq (2), the th moment is given by
(20) 
For DP Gaussian mechanism with noise magnitude and , at is The th moment is then
(21) 
Note that multidimensional Laplace/Gaussian mechanisms also have the same form of the th moment as the scalar version. See the Supplementary material for the derivation.
For achieving ()DP, the tail bound is given by under scenario ; and under scenario . Under each case, we calculate satisfying the tail bound with the fixed budget . Algorithm 1 summarizes our method.
6 DPEM for Factor Analysis
Under FA, the conditional distributions over observed variables are assumed to be Gaussian, , and the prior over latent variables is also assumed to be Gaussian: .
In this case, the completedata likelihood is proportional to , where is a vectorized version of the concatenated matrix , , and where the sufficient statistics are also a vectorized version of a concatenated matrix
Due to conjugacy the posterior over is also Gaussian, where the first and second moments are given by and . The expected sufficient statistics become a function of the data second moment matrix, denoted by ,
For privacypreserving EM, we perturb by Analyze Gauss [18], resulting in a perturbed matrix , which we use when updating the parameters by
until convergence, at no extra privacy cost. Therefore, unlike MoGs, FA only requires perturbing the data second moment matrix once for privacy preservation. The EM iterations are then postprocessing steps which are free from cumulative differential privacy loss.
7 Experiments
We used four realworld datasets to test our algorithm. In all datasets, we preprocessed the data such that the input vectors had maximum norm 1.
Stroke dataset was used in [19] for predicting the occurrence of a stroke within a year after an atrial fibrillation diagnosis. We used principal components () of raw features (conditions and medicines) recorded from patients, by assuming that the private database was given in this form. We divided the extracted dataset into 10 different pairs of training () and test sets (), and reported the average test loglikelihood per datapoint across the 10 independent trials in Fig. 2, setting .
Overall, the GGG scenario yielded higher test loglikelihoods than the LLG scenario, so we focused on this method in our experiments. We found that using the zCDP and MA compositions resulted in more accurate estimates, while also requiring less privacy budget, compared to other compositions. zCDP performed better than MA with a small privacy budget , but they both performed similarly well with a larger budget. The difference with small may be due to only searching over integer values of for MA, which we do for computational reasons, following [10].
Life Science dataset is from the UCI repository [20]. The dataset contains 26,733 records, consisting of 10 principal components from chemistry and biology experiments (). Following other approaches (e.g., [21]), we set . We divided the dataset into 10 different pairs of training () and test sets (), and reported the average test loglikelihood per data point across the 10 independent trials in Fig. 3. In this experiment, we focused on scenario . Using the zCDP and MA compositions once again resulted in more accurate estimates while requiring less privacy budget than linear and advanced compositions.
Gowalla dataset contains the social network’s users’ checkin locations in terms of longitude and latitude (). The total number of data points is 1,256,384, which we divided into crossvalidation sets. We then performed means clustering and compared our method to a differentially private means clustering algorithm, DPLloyd [6]. The standard Lloyd algorithm for means clustering first partitions the data into clusters, with each point assigned to be in the same cluster as the nearest centroid, and then updates each centroid to be the center of the data points in the cluster. As summarized in [7], the DPLloyd adds noise to the updated centroids. Specifically, the Laplace noise is added to the number of data points assigned to each cluster as well as to the sum of each coordinate of the data points assigned to each cluster. Hence, the sensitivity becomes . In the original DPLloyd algorithm, due to the conventional composition theorem for DP, their noise distribution follows Lap for iterations. We also tested the DPLloyd algorithm with zCDP compositions, which resulted in better performance in terms of normalized intracluster variance (NICV) across the test sets. Our algorithm for means clustering also perturbs the centroids by adding the Laplace noise with zCDP composition, where the sensitivity of the mean locations is given in Eq (4.2). We set and for both algorithms. As shown in Fig. 4, our method achieves smaller NICV than DPLloyd, even with a very small value of .
Olivetti Faces dataset is used to illustrate our private factor analysis method^{9}^{9}9We obtained the dataset from http://scikitlearn.org/, but the dataset is originally from Laboratories Cambridge.. The dataset consists of ten different images for each of 40 distinct subjects (), where each image is 64 by 64, resulting in features (. Each pixel is a floating point value on the interval . Each image was treated as a datapoint, rather than each subject, though this could readily be done via group privacy [1]. We set the latent dimension to . We tested nonprivate EM, DPEM with and (fixing ), and showed each column of the estimated loading matrix in Fig. 5. With (bottom) the components were noisy, but with (middle) the FA components’ faces were nearly as recognizable as for the nonprivate FA algorithm (top), thereby accurately recovering a set of typical faces in the dataset.
8 Conclusion
We have developed a practical algorithm that outputs accurate and privatized EM parameters based on moment perturbation under the MA and zCDP composition analyses, which effectively decrease the amount of additive noise for the same expected privacy guarantee compared to the standard analysis. We illustrated the effectiveness of our algorithm on four datasets. Based on our results, we recommend the use of zCDP composition analysis for EM, since it performed better than MA in some regimes and is easier to compute. Furthermore, we found that the GGG combination performed better than LLG under these composition methods in the context of EM, which perhaps makes sense since the zCDP and MA compositions are tailored to the Gaussian mechanism.
The private EM algorithms for the mixture of Gaussians and factor analysis models we discussed in this paper are clearly only two examples of a much broader class of models to which our private EM framework applies. Our positive empirical results with EM strongly suggest that these ideas are likely to be beneficial for privatizing many other iterative machine learning algorithms. In future work, we plan to apply this general framework to other inference methods. This fits our broader vision that practical privacy preserving machine learning algorithms will have an increasingly relevant role to play in our field.
References
 [1] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9:211–407, August 2014.
 [2] Anand D Sarwate and Kamalika Chaudhuri. Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE signal processing magazine, 30(5):86–94, 2013.
 [3] Timothy L. Bailey and Charles Elkan. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. Technical report, Department of Computer Science and Engineering, University of California, San Diego, 1994.
 [4] Yongyue Zhang, Michael Brady, and Stephen Smith. Segmentation of brain MR images through a hidden Markov random field model and the expectationmaximization algorithm. Medical Imaging, IEEE Transactions on, 20(1):45–57, 2001.
 [5] Chad Carson, Serge Belongie, Hayit Greenspan, and Jitendra Malik. Blobworld: Image segmentation using expectationmaximization and its application to image querying. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(8):1026–1038, 2002.
 [6] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The SuLQ framework. In Proceedings of the Twentyfourth ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS ’05, pages 128–138, New York, NY, USA, 2005. ACM.
 [7] Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. Differentially private kmeans clustering. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, CODASPY ’16, pages 26–37, New York, NY, USA, 2016. ACM.

[8]
J. R. Foulds, J. Geumlek, M. Welling, and K. Chaudhuri.
On the theory and practice of privacypreserving Bayesian data
analysis.
In
Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI)
, 2016.  [9] Adam D. Smith. Efficient, differentially private point estimators. CoRR, abs/0809.4794, 2008.
 [10] M. Abadi, A. Chu, I. Goodfellow, H. Brendan McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. ArXiv eprints, July 2016.
 [11] C. Dwork and G. N. Rothblum. Concentrated differential privacy. ArXiv eprints, March 2016.
 [12] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. CoRR, abs/1605.02065, 2016.
 [13] Radford M. Neal and Geoffrey E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Kluwer Academic Publishers, 1998.
 [14] R. P. Feynman. Statistical Mechanics: A Set of Lectures. Perseus, 1972.
 [15] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069–1109, July 2011.
 [16] Daniel Kifer, Adam Smith, Abhradeep Thakurta, Shie Mannor, Nathan Srebro, and Robert C. Williamson. Private convex empirical risk minimization and highdimensional regression. In In COLT, pages 94–103, 2012.
 [17] C. M. Bishop. Pattern recognition and machine learning. Springer New York:, 2006.

[18]
Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang.
Analyze Gauss: optimal bounds for privacypreserving principal component analysis.
InSymposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31  June 03, 2014
, pages 11–20, 2014. 
[19]
Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan.
Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model.
Department of Statistics Technical Report tr608, University of Washington, 2014.  [20] M. Lichman. UCI machine learning repository, 2013.
 [21] Prashanth Mohan, Abhradeep Thakurta, Elaine Shi, Dawn Song, and David E. Culler. GUPT: privacy preserving data analysis made easy. In K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fuxman, editors, SIGMOD Conference, pages 349–360. ACM, 2012.
Comments
There are no comments yet.