Many signal processing and machine learning algorithms involve analyzing private or sensitive data. The outcomes of such algorithms may potentially leak information about individuals present in the dataset. A strong and cryptographically-motivated framework for protection against such information leaks is differential privacy 
. Differential privacy measures privacy risk in terms of the probability of identifying individual data points in a dataset from the results of computations (algorithms) performed on that data.
In several modern applications the data is distributed over different locations or sites, with each site holding a smaller number of samples. For example, consider neuro-imaging analyses for mental health disorders, in which there are many individual research groups, each with a modest number of subjects. Learning meaningful population properties or efficient feature representations from high-dimensional functional magnetic resonance imaging (fMRI) data requires a large sample size. Pooling the data at a central location may enable efficient feature learning, but privacy concerns and high communication overhead often prevent sharing the underlying data. Therefore, it is desirable to have efficient distributed algorithms that provide utility close to centralized case and also preserve privacy .
This paper focuses on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA), and orthogonal tensor decompositions. Despite some limitations, PCA/SVD is one of the most widely-used preprocessing stages in any machine learning algorithm: it projects data onto a lower dimensional subspace spanned by the singular vectors of the second-moment matrix of the data. Tensor decomposition is a powerful tool for inference algorithms because it can be used to infer complex dependencies (higher order moments) beyond second-moment methods such as PCA. This is particularly useful in latent variable models such as mixtures of Gaussians and topic modeling.
Related Works. For a complete introduction to the history of tensor decompositions, see the comprehensive survey of Kolda and Bader  (see also Appendix B). The CANDECOMP/PARAFAC, or CP decomposition [5, 6] and Tucker decomposition  are generalizations of the matrix SVD to multi-way arrays. While finding the decomposition of arbitrary tensors is computationally intractable, specially structured tensors appear in some latent variable models. Such tensors can be decomposed efficiently [3, 4] using a variety of approaches such as generalizations of the power iteration 
. Exploiting such structures in higher-order moments to estimate the parameters of latent variable models has been studied extensively using the so-called orthogonal tensor decomposition (OTD)[3, 9, 10, 11]. To our knowledge, these decompositions have not been studied in the setting of distributed data.
Several distributed PCA algorithms [12, 13, 14, 15, 16, 17] have been proposed. Liang et al.  proposed a distributed PCA scheme where it is necessary to send both the left and right singular vectors along with corresponding singular values from each site to the aggregator. Feldman et al.  proposed an improvement upon this, where each site sends a matrix to the aggregator. Balcan et al.  proposed a further improved version using fast sparse subspace embedding  and randomized SVD .
This paper proposes new privacy-preserving algorithms for distributed PCA and OTD and builds upon our earlier work on distributed differentially private eigenvector calculations and centralized differentially private OTD . It improves on our preliminary works on distributed private PCA [22, 17] in terms of efficiency and fault-tolerance. Wang and Anandkumar  recently proposed an algorithm for differentially private tensor decomposition using a noisy version of the tensor power iteration [3, 8]
. Their algorithm adds noise at each step of the iteration and the noise variance grows with the predetermined number of iterations. They also make the restrictive assumption that the input to their algorithm is orthogonally decomposable. Our centralized OTD algorithms avoid these assumptions and achieve better empirical performance (although without theoretical guarantees). To our knowledge, this paper proposes the first differentially private orthogonal tensor decomposition algorithm for distributed settings.
Our Contribution. In this paper, we propose two new -differentially private algorithms, and , for distributed differentially private principal component analysis and orthogonal tensor decomposition, respectively. The algorithms are inspired by the recently proposed correlation assisted private estimation () protocol  and input perturbation methods for differentially-private PCA [25, 26]. The protocol improves upon conventional approaches, which suffer from excessive noise, at the expense of requiring a trusted “helper” node that can generate correlated noise samples for privacy. We extend the framework to handle site-dependent sample sizes and privacy requirements. In , the sites share noisy second-moment matrix estimates to a central aggregator, whereas in the sites use a distributed protocol to compute a projection subspace used to enable efficient private OTD. This paper is about algorithms with provable privacy guarantees and experimental validation. While asymptotic sample complexity guarantees are of theoretical interest, proving performance bounds for distributed subspace estimation is quite challenging. To validate our approach we show that our new methods outperform previously proposed approaches, even under strong privacy constraints. For weaker privacy requirements they can sometimes achieve the same performance as a pooled-data scenario.
2 Problems Using Distributed Private Data
Notation. We denote tensors with calligraphic scripts, e.g., , vectors with bold lower case letters, e.g., , and matrices with bold upper case letters, e.g. . Scalars are denoted with regular letters, e.g., . Indices are denoted with lower case letters and they typically run from 1 to their upper-case versions, e.g., . We sometimes denote the set as . The -th column of the matrix is denoted as . denotes the Euclidean (or ) norm of a vector and the spectral norm of a matrix. denotes the Frobenius norm and denotes the trace operation.
Distributed Data Model. We assume that the data is distributed in sites, where each site has a data matrix . The data samples in the local sites are assumed to be disjoint. There is a central node that acts as an aggregator (see Figure 1). We denote as the total number of samples over all sites. The data matrix at site is considered to contain the -dimensional features of individuals. Without loss of generality, we assume that and . If we had all the data in the aggregator (pooled data scenario), then the data matrix would be . Our goal is to approximate the performance of the pooled data scenario using distributed differentially private algorithms.
Matrix and Tensor Factorizations. We first formulate the problem of distributed PCA. For simplicity, we assume that the observed samples are mean-centered. The sample second-moment matrix at site is . In the pooled data scenario, the positive semi-definite second-moment matrix is . According to the Schmidt approximation theorem , the rank- matrix that minimizes the difference can be found by taking the SVD of as , where without loss of generality we assume is a diagonal matrix with entries and . Additionally,
is a matrix of eigenvectors corresponding to the eigenvalues. The top-PCA subspace of is the matrix . Given and the eigenvalue matrix , we can form an approximation to , where contains the largest eigenvalues in . For a matrix with orthonormal columns, the quality of in approximating can be measured by the captured energy of as . The , which maximizes is the subspace . We are interested in approximating in a distributed setting while guaranteeing differential privacy.
Next, we describe the problem of orthogonal tensor decomposition (OTD). As mentioned before, decomposition of arbitrary tensors is usually mathematically intractable. However, some specially structured tensors that appear in several latent variable models can be efficiently decomposed  using a variety of approaches such as generalizations of the power iteration . We review some basic definitions related to tensor decomposition  in Appendix B. We start with formulating the problem of orthogonal decomposition of symmetric tensors and then continue on to distributed OTD. Due to page limitations, two examples of OTD from Anandkumar et al. , namely the single topic model (STM) and the mixture of Gaussian (MOG), are presented in Appendix D.
Let be an -way dimensional symmetric tensor. Given real valued vectors , Comon et al.  showed that there exists a decomposition of the form , where denotes the outer product. Without loss of generality, we can assume that . If we can find a matrix with orthogonal columns, then we say that has an orthogonal symmetric tensor decomposition . Such tensors are generated in several applications involving latent variable models. Recall that if is a symmetric rank- matrix then we know that the SVD of is given by , where and is the
-th column of the orthogonal matrix. As mentioned before, the orthogonal decomposition of a 3-rd order symmetric tensor
is a collection of orthonormal vectorstogether with corresponding positive scalars such that . Now, in a setting where the data samples are distributed over different sites, we may have local approximates . We intend to use these local approximates from all sites to find better and more accurate estimates of the , while preserving privacy.
Differential Privacy. An algorithm taking values in a set provides -differential privacy if
for all measurable and all data sets and differing in a single entry (neighboring datasets). This definition essentially states that the probability of the output of an algorithm is not changed significantly if the corresponding database input is changed by just one entry. Here, and are privacy parameters, where low and ensure more privacy. Note that the parameter can be interpreted as the probability that the algorithm fails. For more details, see recent surveys  or the monograph of Dwork and Roth .
To illustrate, consider estimating the mean of scalars with each . A neighboring data vector differs in a single element. The sensitivity  of the function is . Therefore, for differentially-private estimate of the average , we can follow the Gaussian mechanism  to release , where and .
Distributed Privacy-preserving Computation. In our distributed setting, we assume that the sites are “honest but curious.” That is, the aggregator is not trusted and the sites can collude to get a hold of some site’s data/function output. Existing approaches to distributed differentially private algorithms can introduce a significant amount of noise to guarantee privacy. Returning to the example of mean estimation, suppose now there are sites and each site holds a disjoint dataset of samples for . A central aggregator wishes to estimate and publish the mean of all the samples. The sites can send estimates to the aggregator but may collude to learn the data of other sites based on the aggregator output. Without privacy, the sites can send to the aggregator and the average computed by aggregator () is exactly equal to the average we would get if all the data samples were available in the aggregator node. For preserving privacy, a standard differentially private approach is for each site to send , where and . The aggregator computes . We observe : note that this estimate is still noisy due to the privacy constraint. The variance of the estimator is . However, if we had all the data samples in the central aggregator, then we could compute the differentially-private average as , where and . If we assume that each site has equal number of samples then and we have . We observe the ratio , showing that the conventional differentially-private distributed averaging scheme is always worse than the differentially-private pooled data case.
3 Correlated Noise Scheme
The recently proposed Correlation Assisted Private Estimation ()  scheme exploits the network structure and uses a correlated noise design to achieve the same performance of the pooled data case (i.e., ) in the decentralized setting. We assume there is a trusted noise generator in addition to the central aggregator (see Figure 1). The local sites and the central aggregator can also generate noise. The noise generator and the aggregator can send noise to the sites through secure (encrypted) channels. The noise addition procedure is carefully designed to ensure the privacy of the algorithm output from each site and to achieve the noise level of the pooled data scenario in the final output from the central aggregator. Considering the same distributed averaging problem as in Section 2, the noise generator and central aggregator respectively send and to each site . Site generates noise and releases/sends . The noise generator generates such that . As shown in , these noise terms are distributed according to , and , where
The aggregator computes , where we used and the fact that the aggregator knows the , so it can subtract all of those from . The variance of the estimator is , which is the same as if all the data were present at the aggregator. This claim is formalized in Lemma 1. We show the complete algorithm in Algorithm 3 (Appendix A.1). Privacy follows from previous work , and if and number of trusted sites (the sites that would not collude with any adversary) , the aggregator does not need to generate .
(Performance gain ) Consider the gain function with . Then:
the minimum is and is achieved when
the maximum is , which occurs when
The proof is a consequence of Schur convexity and is given in . ∎
3.1 Extension of to Unequal Privacy Requirements
We now propose a generalization of the scheme, which applies to scenarios where different sites have different privacy requirements and/or sample sizes. Additionally, sites may have different “quality notions”, i.e., while combining the site outputs at the aggregator, the aggregator can decide to use different weights to different sites (possibly according to the quality of the output from a site). Let us assume that site requires -differential privacy guarantee for its output. According to the Gaussian mechanism , the noise to be added to the (non-private) output of site
should have standard deviation given by. We need that site outputs . Here, is generated locally, is generated from the random noise generator, and is generated in the central aggregator. We need to satisfy
As mentioned before, the aggregator can decide to compute a weighted average with weights selected according to some quality measure of the site’s data/output (e.g., if the aggregator knows that a particular site is suffering from more noisy observations than other sites, it can choose to give the output from that site less weight while combining the site results). Let us denote the weights by such that and . Note that, our proposed generalized reduces to the existing  for . The aggregator computes
In accordance with our goal of achieving the same level of noise as the pooled data scenario, we need . Additionally, we need . With these constraints, we can formulate a feasibility problem to solve for the unknown noise variances as
for all , where , and are known to the aggregator. For this problem, multiple solutions are possible. We present one solution here that solves the problem with equality. For the -th site:
For other sites :
The derivation of this solution is shown in Appendix A.2.
4 Improved Distributed Differentially-private Principal Component Analysis
In this section, we propose an improved distributed differentially-private PCA algorithm that takes advantage of the protocol. Recall that in our distributed PCA problem, we are interested in approximating in a distributed setting while guaranteeing differential privacy. One naïve approach (non-private) would be to send the data matrices from the sites to the aggregator. When and/or are large, this entails a huge communication overhead. In many scenarios the local data are also private or sensitive. As the aggregator is not trusted, sending the data to the aggregator can result in a significant privacy violation. Our goals are therefore to reduce the communication cost, ensure differential privacy, and provide a close approximation to the true PCA subspace . We previously proposed a differentially-private distributed PCA scheme , but the performance of the scheme is limited by the larger variance of the additive noise at the local sites due to the smaller sample sizes. We intend to alleviate this problem using the correlated noise scheme . The improved distributed differentially-private PCA algorithm we propose here achieves the same utility as the pooled data scenario.
We consider the same network structure as in Section 3: there is a random noise generator that can generate and send noise to the sites through an encrypted/secure channel. The aggregator can also generate noise and send those to the sites over encrypted/secure channels. Recall that in the pooled data scenario, we have the data matrix and the sample second-moment matrix . We refer to the top- PCA subspace of this sample second-moment matrix as the true (or optimal) subspace . At each site, we compute the sample second-moment matrix as . The sensitivity  of the function is . In order to approximate satisfying differential privacy, we can employ the algorithm  to compute , where the symmetric matrix is generated with entries i.i.d. and . Note that, in the pooled data scenario, the sensitivity of the function is . Therefore, the required additive noise standard deviation should satisfy , assuming equal number of samples in the sites. As we want the same utility as the pooled data scenario, we compute the following at each site :
Here, the noise generator generates the matrix with drawn i.i.d. and . We set the variance according to (2) as . Additionally, the aggregator generates the matrix with drawn i.i.d. . The variance is set according to (2) as . Finally, the sites generate their own symmetric matrix , where are drawn i.i.d. and according to (2). Note that, these variance assignments can be readily modified to fit the unequal privacy/sample size scenario (Section 3.1). However, for simplicity, we are considering the equal sample size scenario. Now, the sites send their to the aggregator and the aggregator computes
where we used the relation . The detailed calculation is shown in Appendix C.1. We note that at the aggregator, we end up with an estimator with noise variance exactly the same as that of the pooled data scenario. Next, we perform SVD on and release the top- eigenvector matrix , which is the differentially private approximate to the true subspace . To achieve the same utility level as the pooled data case, we chose to send the full matrix from the sites to the aggregator instead of the partial square root of it . This increases the communication cost by , where is the intermediate dimension of the partial square root. This can be thought of as the cost of gain in performance.
Theorem 1 (Privacy of Algorithm).
Algorithm 1 computes an differentially private approximation to the optimal subspace .
Therefore, the computation of at each site is at least differentially-private. As differential privacy is invariant under post-processing, we can combine the noisy second-moment matrices at the aggregator while subtracting for each site. By the correlated noise generation at the random noise generator, the noise cancels out. We perform the SVD on and release . The released subspace is thus the differentially private approximate to the true subspace . ∎
Performance Gain with Correlated Noise. The distributed differentially-private PCA algorithm of  essentially employs the conventional averaging (when each site sends the full to the aggregator). Therefore, the gain in performance of the proposed algorithm over the one in  is the same as shown in Proposition 1.
Theoretical Performance Guarantee. Due to the application of the correlated noise protocol, we achieve the same level of noise at the aggregator in the distributed setting as we would have in the pooled data scenario. In essence, the proposed algorithm can achieve the same performance as the algorithm  modified to account for all the samples across all the sites. Here, we present three guarantees for the captured energy, closeness to the true subspace and low-rank approximation. The guarantees are adopted from Dwork et al.  and modified to fit our setup and notation. Let us assume that the differentially-private subspace output from Algorithm 1 and the true subspace are denoted by and , respectively. We denote the singular values of with and the un-normalized second-moment matrix with . Let and be the true and the differentially-private rank- approximates to , respectively. If we assume that the gap , then the following holds
The detailed proofs can be found in Dwork et al. .
Communication Cost. We quantify the total communication cost associated with the proposed algorithm. Recall that is an one-shot algorithm. Each of the random noise generator and the aggregator send one matrix to the sites. Each site uses these to compute the noisy estimate of the local second-moment matrix () and sends that back to the aggregator. Therefore, the total communication cost is proportional to or . This is expected as we are computing the global second-moment matrix in a distributed setting before computing the PCA subspace.
5 Distributed Differentially-private Orthogonal Tensor Decomposition
In this section, we propose an algorithm for distributed differentially-private OTD. The proposed algorithm takes advantage of the correlated noise design scheme (Algorithm 3) . To our knowledge, this is the first work on distributed differentially-private OTD. Due to page limits, the definition of the differentially-private OTD and the description of two recently proposed differentially-private OTD algorithms  are presented in Appendix E.
We start with recalling that the orthogonal decomposition of a 3-rd order symmetric tensor is a collection of orthonormal vectors together with corresponding positive scalars such that . A unit vector is an eigenvector of with corresponding eigenvalue if , where is the identity matrix . To see this, one can observe
By the orthogonality of the , it is clear that . Now, the orthogonal tensor decomposition proposed in  is based on the mapping
which can be considered as the tensor equivalent of the well-known matrix power method. Obviously, all tensors are not orthogonally decomposable. As the tensor power method requires the eigenvectors to be orthonormal, we need to perform whitening - that is, projecting the tensor on a subspace such that the eigenvectors become orthogonal to each other.
We note that the proposed algorithm applies to both of the STM and MOG problems. However, as the correlated noise scheme only works with Gaussian noise, the proposed employs the algorithm  at its core. In-line with our setup in Section 3, we assume that there is a random noise generator that can generate and send noise to the sites through an encrypted/secure channel. The un-trusted aggregator can also generate noise and send those to the sites over encrypted/secure channels. At site , the sample second-order moment matrix and the third-order moment tensor are denoted as and , respectively. The noise standard deviation required for computing the differentially-private approximate to is given by
where the sensitivity is inversely proportional to the sample size . To be more specific, we can write and . The detailed derivation of the sensitivity of for both STM and MOG are shown in Appendix E. Additionally, at site , the noise standard deviation required for computing the differentially-private approximate to is given by
Again, we can write and . Appendix E contains the detailed algebra for calculating the sensitivity of for STM and MOG. We note that, as in the case of , the sensitivity depends only on the sample size . Now, in the pooled-data scenario, the noise standard deviations would be given by:
where and , assuming equal number of samples in the sites. We need to compute the whitening matrix and the tensor in a distributed way while satisfying differential privacy. Although we could employ our previous centralized differentially-private distributed PCA algorithm  to compute , to achieve the same level of accuracy as the pooled data scenario, we compute the following matrix at site :
where is generated at the noise generator satisfying and the entries drawn i.i.d. . Here, we set the noise variance according to (2): . Additionally, is generated at the aggregator with the entries drawn i.i.d. . We set the noise variance according to (2): . Finally, is a symmetric matrix generated at site where are drawn i.i.d. , and . At the aggregator, we compute
where we used the relation . Note that the variance of the additive noise in is exactly the same as the pooled data scenario, as described in Section 3. At the aggregator, we can then compute the SVD() of as . We compute the matrix and send it to the sites.
Next, we focus on computing in the distributed setting. For this purpose, we can follow the same steps as computing . However, is a tensor, and for large enough , this will entail a very large communication overhead. We alleviate this in the following way: each site receives and from the aggregator and from the noise generator. Here, are drawn i.i.d. . Additionally, are drawn i.i.d. and is satisfied. We set the two variance terms according to (2): . Finally, each site generates their own in the following way: site draws a vector with and entries i.i.d.