Distributed Differentially-Private Algorithms for Matrix and Tensor Factorization

04/26/2018 ∙ by Hafiz Imtiaz, et al. ∙ Rutgers University 0

In many signal processing and machine learning applications, datasets containing private information are held at different locations, requiring the development of distributed privacy-preserving algorithms. Tensor and matrix factorizations are key components of many processing pipelines. In the distributed setting, differentially private algorithms suffer because they introduce noise to guarantee privacy. This paper designs new and improved distributed and differentially private algorithms for two popular matrix and tensor factorization methods: principal component analysis (PCA) and orthogonal tensor decomposition (OTD). The new algorithms employ a correlated noise design scheme to alleviate the effects of noise and can achieve the same noise level as the centralized scenario. Experiments on synthetic and real data illustrate the regimes in which the correlated noise allows performance matching with the centralized setting, outperforming previous methods and demonstrating that meaningful utility is possible while guaranteeing differential privacy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many signal processing and machine learning algorithms involve analyzing private or sensitive data. The outcomes of such algorithms may potentially leak information about individuals present in the dataset. A strong and cryptographically-motivated framework for protection against such information leaks is differential privacy [1]

. Differential privacy measures privacy risk in terms of the probability of identifying individual data points in a dataset from the results of computations (algorithms) performed on that data.

In several modern applications the data is distributed over different locations or sites, with each site holding a smaller number of samples. For example, consider neuro-imaging analyses for mental health disorders, in which there are many individual research groups, each with a modest number of subjects. Learning meaningful population properties or efficient feature representations from high-dimensional functional magnetic resonance imaging (fMRI) data requires a large sample size. Pooling the data at a central location may enable efficient feature learning, but privacy concerns and high communication overhead often prevent sharing the underlying data. Therefore, it is desirable to have efficient distributed algorithms that provide utility close to centralized case and also preserve privacy [2].

This paper focuses on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA), and orthogonal tensor decompositions. Despite some limitations, PCA/SVD is one of the most widely-used preprocessing stages in any machine learning algorithm: it projects data onto a lower dimensional subspace spanned by the singular vectors of the second-moment matrix of the data. Tensor decomposition is a powerful tool for inference algorithms because it can be used to infer complex dependencies (higher order moments) beyond second-moment methods such as PCA. This is particularly useful in latent variable models 

[3] such as mixtures of Gaussians and topic modeling.

Related Works. For a complete introduction to the history of tensor decompositions, see the comprehensive survey of Kolda and Bader [4] (see also Appendix B). The CANDECOMP/PARAFAC, or CP decomposition [5, 6] and Tucker decomposition [7] are generalizations of the matrix SVD to multi-way arrays. While finding the decomposition of arbitrary tensors is computationally intractable, specially structured tensors appear in some latent variable models. Such tensors can be decomposed efficiently [3, 4] using a variety of approaches such as generalizations of the power iteration [8]

. Exploiting such structures in higher-order moments to estimate the parameters of latent variable models has been studied extensively using the so-called orthogonal tensor decomposition (OTD) 

[3, 9, 10, 11]. To our knowledge, these decompositions have not been studied in the setting of distributed data.

Several distributed PCA algorithms [12, 13, 14, 15, 16, 17] have been proposed. Liang et al. [12] proposed a distributed PCA scheme where it is necessary to send both the left and right singular vectors along with corresponding singular values from each site to the aggregator. Feldman et al. [18] proposed an improvement upon this, where each site sends a matrix to the aggregator. Balcan et al. [13] proposed a further improved version using fast sparse subspace embedding [19] and randomized SVD [20].

This paper proposes new privacy-preserving algorithms for distributed PCA and OTD and builds upon our earlier work on distributed differentially private eigenvector calculations 

[17] and centralized differentially private OTD [21]. It improves on our preliminary works on distributed private PCA [22, 17] in terms of efficiency and fault-tolerance. Wang and Anandkumar [23] recently proposed an algorithm for differentially private tensor decomposition using a noisy version of the tensor power iteration [3, 8]

. Their algorithm adds noise at each step of the iteration and the noise variance grows with the predetermined number of iterations. They also make the restrictive assumption that the input to their algorithm is orthogonally decomposable. Our centralized OTD algorithms 

[21] avoid these assumptions and achieve better empirical performance (although without theoretical guarantees). To our knowledge, this paper proposes the first differentially private orthogonal tensor decomposition algorithm for distributed settings.

Our Contribution. In this paper, we propose two new -differentially private algorithms, and , for distributed differentially private principal component analysis and orthogonal tensor decomposition, respectively. The algorithms are inspired by the recently proposed correlation assisted private estimation () protocol [24] and input perturbation methods for differentially-private PCA [25, 26]. The protocol improves upon conventional approaches, which suffer from excessive noise, at the expense of requiring a trusted “helper” node that can generate correlated noise samples for privacy. We extend the framework to handle site-dependent sample sizes and privacy requirements. In , the sites share noisy second-moment matrix estimates to a central aggregator, whereas in the sites use a distributed protocol to compute a projection subspace used to enable efficient private OTD. This paper is about algorithms with provable privacy guarantees and experimental validation. While asymptotic sample complexity guarantees are of theoretical interest, proving performance bounds for distributed subspace estimation is quite challenging. To validate our approach we show that our new methods outperform previously proposed approaches, even under strong privacy constraints. For weaker privacy requirements they can sometimes achieve the same performance as a pooled-data scenario.

2 Problems Using Distributed Private Data


Figure 1: The structure of the network: left – conventional, right –

Notation. We denote tensors with calligraphic scripts, e.g., , vectors with bold lower case letters, e.g., , and matrices with bold upper case letters, e.g. . Scalars are denoted with regular letters, e.g., . Indices are denoted with lower case letters and they typically run from 1 to their upper-case versions, e.g., . We sometimes denote the set as . The -th column of the matrix is denoted as . denotes the Euclidean (or ) norm of a vector and the spectral norm of a matrix. denotes the Frobenius norm and denotes the trace operation.

Distributed Data Model. We assume that the data is distributed in sites, where each site has a data matrix . The data samples in the local sites are assumed to be disjoint. There is a central node that acts as an aggregator (see Figure 1). We denote as the total number of samples over all sites. The data matrix at site is considered to contain the -dimensional features of individuals. Without loss of generality, we assume that and . If we had all the data in the aggregator (pooled data scenario), then the data matrix would be . Our goal is to approximate the performance of the pooled data scenario using distributed differentially private algorithms.

Matrix and Tensor Factorizations. We first formulate the problem of distributed PCA. For simplicity, we assume that the observed samples are mean-centered. The sample second-moment matrix at site is . In the pooled data scenario, the positive semi-definite second-moment matrix is . According to the Schmidt approximation theorem [27], the rank- matrix that minimizes the difference can be found by taking the SVD of as , where without loss of generality we assume is a diagonal matrix with entries and . Additionally,

is a matrix of eigenvectors corresponding to the eigenvalues. The top-

PCA subspace of is the matrix . Given and the eigenvalue matrix , we can form an approximation to , where contains the largest eigenvalues in . For a matrix with orthonormal columns, the quality of in approximating can be measured by the captured energy of as . The , which maximizes is the subspace . We are interested in approximating in a distributed setting while guaranteeing differential privacy.

Next, we describe the problem of orthogonal tensor decomposition (OTD). As mentioned before, decomposition of arbitrary tensors is usually mathematically intractable. However, some specially structured tensors that appear in several latent variable models can be efficiently decomposed [3] using a variety of approaches such as generalizations of the power iteration [8]. We review some basic definitions related to tensor decomposition [4] in Appendix B. We start with formulating the problem of orthogonal decomposition of symmetric tensors and then continue on to distributed OTD. Due to page limitations, two examples of OTD from Anandkumar et al. [3], namely the single topic model (STM) and the mixture of Gaussian (MOG), are presented in Appendix D.

Let be an -way dimensional symmetric tensor. Given real valued vectors , Comon et al. [28] showed that there exists a decomposition of the form , where denotes the outer product. Without loss of generality, we can assume that . If we can find a matrix with orthogonal columns, then we say that has an orthogonal symmetric tensor decomposition [11]. Such tensors are generated in several applications involving latent variable models. Recall that if is a symmetric rank- matrix then we know that the SVD of is given by , where and is the

-th column of the orthogonal matrix

. As mentioned before, the orthogonal decomposition of a 3-rd order symmetric tensor

is a collection of orthonormal vectors

together with corresponding positive scalars such that . Now, in a setting where the data samples are distributed over different sites, we may have local approximates . We intend to use these local approximates from all sites to find better and more accurate estimates of the , while preserving privacy.

Differential Privacy. An algorithm taking values in a set provides -differential privacy if

(1)

for all measurable and all data sets and differing in a single entry (neighboring datasets). This definition essentially states that the probability of the output of an algorithm is not changed significantly if the corresponding database input is changed by just one entry. Here, and are privacy parameters, where low and ensure more privacy. Note that the parameter can be interpreted as the probability that the algorithm fails. For more details, see recent surveys [29] or the monograph of Dwork and Roth [30].

To illustrate, consider estimating the mean of scalars with each . A neighboring data vector differs in a single element. The sensitivity [1] of the function is . Therefore, for differentially-private estimate of the average , we can follow the Gaussian mechanism [1] to release , where and .

Distributed Privacy-preserving Computation. In our distributed setting, we assume that the sites are “honest but curious.” That is, the aggregator is not trusted and the sites can collude to get a hold of some site’s data/function output. Existing approaches to distributed differentially private algorithms can introduce a significant amount of noise to guarantee privacy. Returning to the example of mean estimation, suppose now there are sites and each site holds a disjoint dataset of samples for . A central aggregator wishes to estimate and publish the mean of all the samples. The sites can send estimates to the aggregator but may collude to learn the data of other sites based on the aggregator output. Without privacy, the sites can send to the aggregator and the average computed by aggregator () is exactly equal to the average we would get if all the data samples were available in the aggregator node. For preserving privacy, a standard differentially private approach is for each site to send , where and . The aggregator computes . We observe : note that this estimate is still noisy due to the privacy constraint. The variance of the estimator is . However, if we had all the data samples in the central aggregator, then we could compute the differentially-private average as , where and . If we assume that each site has equal number of samples then and we have . We observe the ratio , showing that the conventional differentially-private distributed averaging scheme is always worse than the differentially-private pooled data case.

3 Correlated Noise Scheme

The recently proposed Correlation Assisted Private Estimation ([24] scheme exploits the network structure and uses a correlated noise design to achieve the same performance of the pooled data case (i.e., ) in the decentralized setting. We assume there is a trusted noise generator in addition to the central aggregator (see Figure 1). The local sites and the central aggregator can also generate noise. The noise generator and the aggregator can send noise to the sites through secure (encrypted) channels. The noise addition procedure is carefully designed to ensure the privacy of the algorithm output from each site and to achieve the noise level of the pooled data scenario in the final output from the central aggregator. Considering the same distributed averaging problem as in Section 2, the noise generator and central aggregator respectively send and to each site . Site generates noise and releases/sends . The noise generator generates such that . As shown in [24], these noise terms are distributed according to , and , where

(2)

The aggregator computes , where we used and the fact that the aggregator knows the , so it can subtract all of those from . The variance of the estimator is , which is the same as if all the data were present at the aggregator. This claim is formalized in Lemma 1. We show the complete algorithm in Algorithm 3 (Appendix A.1). Privacy follows from previous work [24], and if and number of trusted sites (the sites that would not collude with any adversary) , the aggregator does not need to generate .

Proposition 1.

(Performance gain [24]) Consider the gain function with . Then:

  • the minimum is and is achieved when

  • the maximum is , which occurs when

Proof.

The proof is a consequence of Schur convexity and is given in [24]. ∎

3.1 Extension of to Unequal Privacy Requirements

We now propose a generalization of the scheme, which applies to scenarios where different sites have different privacy requirements and/or sample sizes. Additionally, sites may have different “quality notions”, i.e., while combining the site outputs at the aggregator, the aggregator can decide to use different weights to different sites (possibly according to the quality of the output from a site). Let us assume that site requires -differential privacy guarantee for its output. According to the Gaussian mechanism [1], the noise to be added to the (non-private) output of site

should have standard deviation given by

. We need that site outputs . Here, is generated locally, is generated from the random noise generator, and is generated in the central aggregator. We need to satisfy

As mentioned before, the aggregator can decide to compute a weighted average with weights selected according to some quality measure of the site’s data/output (e.g., if the aggregator knows that a particular site is suffering from more noisy observations than other sites, it can choose to give the output from that site less weight while combining the site results). Let us denote the weights by such that and . Note that, our proposed generalized reduces to the existing [24] for . The aggregator computes

In accordance with our goal of achieving the same level of noise as the pooled data scenario, we need . Additionally, we need . With these constraints, we can formulate a feasibility problem to solve for the unknown noise variances as

subject to

for all , where , and are known to the aggregator. For this problem, multiple solutions are possible. We present one solution here that solves the problem with equality. For the -th site:

For other sites :

The derivation of this solution is shown in Appendix A.2.

4 Improved Distributed Differentially-private Principal Component Analysis

1:Data matrix for ; privacy parameters , ; reduced dimension
2:At random noise generator: generate , as described in the text; send to sites
3:At aggregator: generate , as described in the text; send to sites
4:for  do at the local sites
5:     Compute
6:     Generate symmetric matrix , as described in the text
7:     Compute ; send to aggregator
8:end for
9:Compute at the aggregator
10:Perform SVD:
11:Release / send to sites:
12:return
Algorithm 1 Improved Distributed Differentially-private PCA ()

In this section, we propose an improved distributed differentially-private PCA algorithm that takes advantage of the protocol. Recall that in our distributed PCA problem, we are interested in approximating in a distributed setting while guaranteeing differential privacy. One naïve approach (non-private) would be to send the data matrices from the sites to the aggregator. When and/or are large, this entails a huge communication overhead. In many scenarios the local data are also private or sensitive. As the aggregator is not trusted, sending the data to the aggregator can result in a significant privacy violation. Our goals are therefore to reduce the communication cost, ensure differential privacy, and provide a close approximation to the true PCA subspace . We previously proposed a differentially-private distributed PCA scheme [17], but the performance of the scheme is limited by the larger variance of the additive noise at the local sites due to the smaller sample sizes. We intend to alleviate this problem using the correlated noise scheme [24]. The improved distributed differentially-private PCA algorithm we propose here achieves the same utility as the pooled data scenario.

We consider the same network structure as in Section 3: there is a random noise generator that can generate and send noise to the sites through an encrypted/secure channel. The aggregator can also generate noise and send those to the sites over encrypted/secure channels. Recall that in the pooled data scenario, we have the data matrix and the sample second-moment matrix . We refer to the top- PCA subspace of this sample second-moment matrix as the true (or optimal) subspace . At each site, we compute the sample second-moment matrix as . The sensitivity [1] of the function is  [26]. In order to approximate satisfying differential privacy, we can employ the algorithm [26] to compute , where the symmetric matrix is generated with entries i.i.d. and . Note that, in the pooled data scenario, the sensitivity of the function is . Therefore, the required additive noise standard deviation should satisfy , assuming equal number of samples in the sites. As we want the same utility as the pooled data scenario, we compute the following at each site :

Here, the noise generator generates the matrix with drawn i.i.d. and . We set the variance according to (2) as . Additionally, the aggregator generates the matrix with drawn i.i.d. . The variance is set according to (2) as . Finally, the sites generate their own symmetric matrix , where are drawn i.i.d. and according to (2). Note that, these variance assignments can be readily modified to fit the unequal privacy/sample size scenario (Section 3.1). However, for simplicity, we are considering the equal sample size scenario. Now, the sites send their to the aggregator and the aggregator computes

where we used the relation . The detailed calculation is shown in Appendix C.1. We note that at the aggregator, we end up with an estimator with noise variance exactly the same as that of the pooled data scenario. Next, we perform SVD on and release the top- eigenvector matrix , which is the differentially private approximate to the true subspace . To achieve the same utility level as the pooled data case, we chose to send the full matrix from the sites to the aggregator instead of the partial square root of it [17]. This increases the communication cost by , where is the intermediate dimension of the partial square root. This can be thought of as the cost of gain in performance.

Theorem 1 (Privacy of Algorithm).

Algorithm 1 computes an differentially private approximation to the optimal subspace .

Proof.

The proof of Theorem 1 follows from using the Gaussian mechanism [1], the AG algorithm [26], the bound on and recalling that the data samples in each site are disjoint. We start by showing that

Therefore, the computation of at each site is at least differentially-private. As differential privacy is invariant under post-processing, we can combine the noisy second-moment matrices at the aggregator while subtracting for each site. By the correlated noise generation at the random noise generator, the noise cancels out. We perform the SVD on and release . The released subspace is thus the differentially private approximate to the true subspace . ∎

Performance Gain with Correlated Noise. The distributed differentially-private PCA algorithm of [17] essentially employs the conventional averaging (when each site sends the full to the aggregator). Therefore, the gain in performance of the proposed algorithm over the one in [17] is the same as shown in Proposition 1.

Theoretical Performance Guarantee. Due to the application of the correlated noise protocol, we achieve the same level of noise at the aggregator in the distributed setting as we would have in the pooled data scenario. In essence, the proposed algorithm can achieve the same performance as the algorithm [26] modified to account for all the samples across all the sites. Here, we present three guarantees for the captured energy, closeness to the true subspace and low-rank approximation. The guarantees are adopted from Dwork et al. [26] and modified to fit our setup and notation. Let us assume that the differentially-private subspace output from Algorithm 1 and the true subspace are denoted by and , respectively. We denote the singular values of with and the un-normalized second-moment matrix with . Let and be the true and the differentially-private rank- approximates to , respectively. If we assume that the gap , then the following holds

  • .

The detailed proofs can be found in Dwork et al. [26].

Communication Cost. We quantify the total communication cost associated with the proposed algorithm. Recall that is an one-shot algorithm. Each of the random noise generator and the aggregator send one matrix to the sites. Each site uses these to compute the noisy estimate of the local second-moment matrix () and sends that back to the aggregator. Therefore, the total communication cost is proportional to or . This is expected as we are computing the global second-moment matrix in a distributed setting before computing the PCA subspace.

5 Distributed Differentially-private Orthogonal Tensor Decomposition

1:Sample second-order moment matrices and third-order moment tensors , privacy parameters , , , , reduced dimension
2:At random noise generator: generate and , as described in the text; send to sites
3:At aggregator: generate and , as described in the text; send to sites
4:for  do at the local sites
5:     Generate , as described in the text
6:     Compute ; send to aggregator
7:end for
8:Compute and then SVD of as at the aggregator
9:Compute and send to sites:
10:for  do at the local sites
11:     Generate symmetric from the entries of , where and
12:     Compute and ; send to aggregator
13:end for
14:Compute at the aggregator
15:return The differentially private orthogonally decomposable tensor , projection subspace
Algorithm 2 Distributed Differentially-private OTD ()

In this section, we propose an algorithm for distributed differentially-private OTD. The proposed algorithm takes advantage of the correlated noise design scheme (Algorithm 3[24]. To our knowledge, this is the first work on distributed differentially-private OTD. Due to page limits, the definition of the differentially-private OTD and the description of two recently proposed differentially-private OTD algorithms [21] are presented in Appendix E.

We start with recalling that the orthogonal decomposition of a 3-rd order symmetric tensor is a collection of orthonormal vectors together with corresponding positive scalars such that . A unit vector is an eigenvector of with corresponding eigenvalue if , where is the identity matrix [3]. To see this, one can observe

By the orthogonality of the , it is clear that . Now, the orthogonal tensor decomposition proposed in [3] is based on the mapping

(3)

which can be considered as the tensor equivalent of the well-known matrix power method. Obviously, all tensors are not orthogonally decomposable. As the tensor power method requires the eigenvectors to be orthonormal, we need to perform whitening - that is, projecting the tensor on a subspace such that the eigenvectors become orthogonal to each other.

We note that the proposed algorithm applies to both of the STM and MOG problems. However, as the correlated noise scheme only works with Gaussian noise, the proposed employs the algorithm [21] at its core. In-line with our setup in Section 3, we assume that there is a random noise generator that can generate and send noise to the sites through an encrypted/secure channel. The un-trusted aggregator can also generate noise and send those to the sites over encrypted/secure channels. At site , the sample second-order moment matrix and the third-order moment tensor are denoted as and , respectively. The noise standard deviation required for computing the differentially-private approximate to is given by

(4)

where the sensitivity is inversely proportional to the sample size . To be more specific, we can write and . The detailed derivation of the sensitivity of for both STM and MOG are shown in Appendix E. Additionally, at site , the noise standard deviation required for computing the differentially-private approximate to is given by

(5)

Again, we can write and . Appendix E contains the detailed algebra for calculating the sensitivity of for STM and MOG. We note that, as in the case of , the sensitivity depends only on the sample size . Now, in the pooled-data scenario, the noise standard deviations would be given by:

where and , assuming equal number of samples in the sites. We need to compute the whitening matrix and the tensor in a distributed way while satisfying differential privacy. Although we could employ our previous centralized differentially-private distributed PCA algorithm [17] to compute , to achieve the same level of accuracy as the pooled data scenario, we compute the following matrix at site :

where is generated at the noise generator satisfying and the entries drawn i.i.d. . Here, we set the noise variance according to (2): . Additionally, is generated at the aggregator with the entries drawn i.i.d. . We set the noise variance according to (2): . Finally, is a symmetric matrix generated at site where are drawn i.i.d. , and . At the aggregator, we compute

where we used the relation . Note that the variance of the additive noise in is exactly the same as the pooled data scenario, as described in Section 3. At the aggregator, we can then compute the SVD() of as . We compute the matrix and send it to the sites.

Next, we focus on computing in the distributed setting. For this purpose, we can follow the same steps as computing . However, is a tensor, and for large enough , this will entail a very large communication overhead. We alleviate this in the following way: each site receives and from the aggregator and from the noise generator. Here, are drawn i.i.d. . Additionally, are drawn i.i.d. and is satisfied. We set the two variance terms according to (2): . Finally, each site generates their own in the following way: site draws a vector with and entries i.i.d.