Principal Component Analysis (PCA) is a fundamental technology in machine learning. Nowadays many high-dimension large datasets are acquired in a distributed manner, which precludes the use of centralized PCA due to the high communication cost and privacy risk. Thus, many distributed PCA algorithms are proposed, most of which, however, focus on linear cases. To efficiently extract non-linear features, this brief proposes a communication-efficient distributed kernel PCA algorithm, where linear and RBF kernels are applied. The key is to estimate the global empirical kernel matrix from the eigenvectors of local kernel matrices. The approximate error of the estimators is theoretically analyzed for both linear and RBF kernels. The result suggests that when eigenvalues decay fast, which is common for RBF kernels, the proposed algorithm gives high quality results with low communication cost. Results of simulation experiments verify our theory analysis and experiments on GSE2187 dataset show the effectiveness of the proposed algorithm.READ FULL TEXT VIEW PDF
Principal component analysis (PCA) is a popular tool for linear
Distributed computing is a standard way to scale up machine learning and...
The growing size of modern data sets brings many challenges to the exist...
In many-body physics, renormalization techniques are used to extract asp...
We identify principal component analysis (PCA) as an empirical risk
Principal component regression (PCR) is a useful method for regularizing...
In this on-going work, I explore certain theoretical and empirical
Principal Component Analysis (PCA) is a fundamental technology in machine learning community. Researches on PCA and its variants, including sparse PCA , robust PCA , kernel PCA  , have been active for decades with wide applications in data de-noising , low-rank subspace factorization 6], etc. According to various data settings, different algorithms have been designed for PCA, for example, centralized algorithms for small datasets and stochastic algorithms for large datasets .
Nowadays, massive datasets are acquired in a distributed manner, bringing new challenges to traditional data analysis. When the data scale is large, transmitting all data to a single machine requires high communication cost and large memory, which is quite inefficient. Moreover, in many scenario, such as medical, biomedical, and financial tasks, data privacy and security are significant, which makes it impossible to get global data. From these reasons, distributed learning that can locally learn and globally synthesize information becomes very important and there are already many fantastic algorithms [8, 9]. According to different structures of data, the distribution can be generally categorised as two regimes, namely horizontally and vertically partitioned data [10, 11]. The two regimes are shown in Fig. 1: when data are partitioned horizontally, each local machine contains a subset of samples with complete features. While in the vertical regimes, each machine contains full samples but with only a subset of features.
For PCA problems, massive current researches focus on the horizontal regime [11, 12, 13, 14, 15]. The key property is the consistency between the sum of local covariance matrices and the global covariance matrix, which results in benefits for both algorithm design and theory analysis. For example, power and invert power methods could be extended to distributed setting [14, 15]
, where the global empirical covariance matrix is in-explicitly calculated from distributively matrix-vector product. Besides, adds noise during the transmission in the proposed multi-communication algorithm to protect data privacy, which however brings guaranteed loss to the accuracy. For both efficient communication and privacy protection, [15, 11] propose one-shot aggregation algorithms. In , eigenvectors of the global covariance matrix are estimated by averaging the local empirical risk minimizers with sign correction. 
gives another method that focuses on estimating the eigenspace of global covariance matrix by averaging the local eigenspaces.
However, though the vertical regime are also common in practice, e.g., in wireless sensor networks [16, 17], ranking or evaluation systems [18, 19], applicable distributed PCA algorithms are not much. In fact, in this setting, the data dimension is usually high and PCA are in high demand. Most of distributed PCA in the vertical regime is rooted in the separability such that the global projection matrix could be locally calculated however in an iterative procedure. For example, in [20, 21], power method and Oja method are combined with the average consensus algorithm. In , the latest work in the vertical regime, the distributed PCA is solved by coordinate descent methods combined with alternating direction method of multipliers (ADMM). Generally, the above methods can solve distributed PCA in the vertical regime but they require multi-communication rounds, which needs improvement in the view of both efficiency and privacy.
In this brief, inspired by the fact that the kernel trick can transfer the optimization variables from primal weights corresponding to features to dual variables corresponding to samples, we establish a distributed PCA in the vertical regime. Generally speaking, we solve the eigenproblem of the kernel matrix rather than that of the covariance matrix. Since the center kernel matrix can be reformulated as a linear/non-linear combination of local kernel matrices, one-shot fusing strategy can be used and achieves high efficiency. From the view of duality, a data corvariance matrix in the horizontal regime is corresponding to a kernel matrix in the vertical regime, from which it follows that the developed method shares similar properties to primal PCA in the horizontal regime. Besides, since kernel trick is used, the proposed method can be readily extended to nonlinear PCA, e.g., by applying the RBF kernel.
Notice that the existing researches on distributed KPCA are for the horizontal regime, most of which require multi-communication rounds. The main aim of introducing kernel trick is to extend distributed PCA from linear to non-linear case. However, as we pointed out, it actually breaks the good properties of the horizontal regime and thus iterative procedure is required. For example,  proposes to solve kernel PCA based on EM algorithm.  combines the subspace embedding and adaptive sampling to generate a representative subset of the original data and then performs local KPCA on it, which needs multi-communication rounds to determine the subset.
To fill this gap, we propose a communication-efficient distributed algorithm for KPCA in the vertical regime. Specifically, the first eigenvectors and their corresponding eigenvalues of local kernel matrices are calculated and sent to a fusion center, where they are aggregated to reproduce local estimators. Both linear and RBF kernels (when the global RBF kernel matrix is the Hadamard (element-wise) product of local RBF kernel matrices) are applicable. For linear kernels, the estimator of the global kernel matrix is then computed by adding up local estimators. For RBF kernels, the estimator of the global kernel matrix is the Hadmadard product of local estimators. Hence, the proposed algorithm needs only one privacy-preserving communication round. Theoretical discussion will show that the approximation error is related to the -th eigenvalue of local matrices. Thus, when eigenvalues decay fast, which is common for RBF kernels, the proposed algorithm could give high quality results.
The rest of this brief is organized as the following. We will briefly review kernel trick on PCA and model the problem in Section 2. Section 3 gives the algorithm in detail. The approximate analysis is presented in Section 4. In Section 5, numerical experiments are used to verify the theorem and evaluate the proposed methodology. Short conclusion is given in Section 6 to end this brief.
Throughout this brief, we use regular letters for scalars, capital letters in bold for matrices and lowercase letters in bold for vectors. For matrix , represents the Frobenius norm. We use to denote the -th eigenvalue of the symmetric matrix . In this brief, we consider to solve the KPCA problem in a distributed setting, where the data are partitioned in vertical regime and stored distributedly in local machines. In addition, without loss of generality, we set the first machine to be the fusion center. For , machine acquires a zero-mean data vector , which is independently identically distributed at time . is the feature dimension of the data and we have . Let denote the center empirical data collected by all machines, which are not stored together but given for convenience. For the center empirical kernel matrix and its approximation , and are used to denote their -th eigenvalues for convenience. The kernel matrix in the -th local machine is denoted as with the corresponding eigenvectors .
Before introducing our distributed algorithm, we first briefly review the KPCA problem, of which the basic idea is to map the original data space into a feature space by an implicit nonlinear mapping . The dot product in feature space can be computed by a kernel function, i.e.,
The goal of KPCA is to diagonalize the covariance matrix in the feature space by solving the following optimization problem,
The solution is the eigenvectors of , i.e.,
Since can be rewritten as , (3) becomes
In another words, is the eigenvector of the kernel matrix , which means we can solving eigenproblem on instead of on . Such kernel trick can sidestep the problem of computing unknown and moreover, it makes the distributed computation for vertically partitioned data more convenient:
is not separable and the approximation by local features is not accurate, i.e., .
itself (linear kernel) or its main calculation part (RBF kernel) is separable, e.g., a linear kernel .
In this brief, we propose a communication-efficient privacy-preserving distributed algorithms for KPCA with linear and RBF kernels. The algorithm could produce a good estimation to the global optimum in one-communication round with privacy protection. The algorithmic details is introduced in this section and approximation error will be analyzed in section IV.
Our basic idea is to use the eigenvectors of local kernel matrices to represent the center empirical kernel matrix . Specifically, we first calculate the top eigenvectors of with the corresponding eigenvalues in each local machine and then sent them to the fusion center, where we we aggregates these eigenvectors by a function depends on the used kernel. The calculation of the estimator could be represented as below,
For linear kernels, it holds that
Thus, the estimator is calculated as follow.
For RBF kernels, we decompose the function as follows.
where is the kernel width. Using to denote the Hadamard (element-wise) product operator, we rewrite (6) as below,
Once the eigenvector of each local kernel matrix is obtained, we can approximate the whole kernel matrix as follow,
Finally, we compute the first eigenvectors of , denoted as , and the projection matrix . Notice that for this calculation, is unknown but can be calculated in a distributed system.
For linear kernels, we have
Thus, the local machine calculates and a center machine adds up the results.
For RBF kernels, we have
where can be calculated in a distributed manner.
The overall algorithm is summarized in Alg. 1.
Intuitively, Algorithm 1
only requires one round communication and seems quite efficient. To give rigorous analysis, we restrict our discussion on the uniformly distributed situation, i.e., the dimension of the features in local isand there is no statistic difference on each node. The discussion on more general case is similar but with redundant items.
Note that Alg. 1 has only one communication round, where local machines send eigenvectors with their corresponding eigenvalues to the fusion center. Thus, the communication cost of Alg. 1 is . For centralized algorithms, where all data are sent to the fusion center, the communication efficiency is . Since we want high communication efficiency, tends to be much smaller than .
The computation process consists of three main parts:
the computation cost of calculating kernel matrix in local is .
the computation cost of solving the eigenproblem is (for general SVD algorithm).
the computation cost of estimating the global kernel matrix in the fusion center is .
Thus, the total computation cost of Alg. 1 is . Compared with centralized algorithms, which needs additional communication and fusion, Alg. 1 sacrifices computation efficiency for communication efficiency.
Let us further discuss the computation cost. When is relatively small, is ignored and the computation cost becomes . For given data, if
then the computation cost is , the same as centralized algorithms. Notice that the required condition is not strict. For example, when , , , then , which is a large range, will meet the above requiremnt and the computation cost is .
We present the approximation analysis for Alg. 1 here in both linear and RBF cases. Specifically, we study the distance between the eigenspaces spanned by , the eigenvectors of the global kernel matrix , and the estimator calculated by Alg. 1. distance is well-defined and is widely used for measuring the distance between two linear spaces [24, 11]. Let
be the singular values ofand define as follows.
The eigengap then can be given as follows
Before giving the main result, we need the following lemmas.
(Davis–Kahan’s theorem) Let and are two symmetric real matrix, whose leading eigenvectors are , respectively. Let denotes the -th eigenvalues of and is defined in (12). There holds that
Let is a kernel matrix derived by a kernel function , and is its approximation computed by Alg. 1. If is a linear kernel, then it holds that
If is a RBF kernel, then it holds that
Let is the first eigenvectors of the global kernel matrix that is derived by a kernel function , and is its approximation computed by Alg. 1. If is a linear kernel, then satisfy
If is a RBF kernel, then satisfy
The performance of Alg. 1 is evaluated in this section from three sides. First, simulation experiments are conducted to show the relationship between the communication cost and the number of local machines. Second, we compare the proposed algorithm with DPCA , which is the state-of-the-art distributed PCA algorithm in vertical case, however, can only deal linear PCA. Third, classification experiments on real dataset is conducted to show the effectiveness of Alg. 1.
Simulation and real data are used. The simulation data are generated as follows. (i) Generate the covariance matrix and two orthnormal matrix and . (ii) Calculate the total data by . Details of will be described later.
Real data from a drugs and toxicants response on rats dataset are used. This dataset is publicly available at the NIH GEO, under accession number GSE2187. It is collected on cRNA microarray chips with 8565 probes (features), corresponding to four categories: fibrates (107 samples), statins (93 samples), azoles (156 samples) and toxicants (181 samples). The features are removed if more than of the samples have their values missing. The rest missing values are filled with mean values.
For both simulation and real data, results calculated by performing SVD algorithm on the whole underlying kernel matrix are regarded as the ground truth. All the simulations are done with Matlab R2016b in Core i5-7300HQ 2.50GHz 8GB RAM. The codes of Alg. 1, together with the experiments, are available in https://github.com/hefansjtu/DKPCA.
We change the number of the features in local machines and the number of samples to see how error rate and running time changes. distance is considered to measure the estimate error here, which is computed by (11).
Simulation data are used to evaluate Alg. 1 in linear cases. We set the rank of to , where the first diagonal elements are and others are zero. The error and running time of Alg. 1 are reported. In Fig. 2 (a), we fix and change . One can see that as the number of local machines increases, the estimate error is similar except the extreme cases, i.e. or . When the number of local features changes from to , the computation time changes little. Recall that when , Alg. 1 is the same as centralized algorithms. It means that for the practical use, where we recommend setting the number of local features around the sample size, the computation cost is similar to that of centralized algorithms. In Fig. 2 (b), we fix but change the sample size, showing the error rate decay when the sample size increases. It also shows the growth trend of computation time is similar to .
For the non-linear case, we use Dataset GSE2187 (, ). We set and the width of the RBF kernel . The result is reported in Fig. 3, which indicates that both the tendency of computation cost and error rate in (a) are the same as that of linear kernels. While in (b), the error rate shows little trend as the sample size increases.
|Toxicant vs Fibrate||Alg. 1||0.36750.0383||0.03230.0218||0.01640.0128||0.01500.0118||0.01480.0111||0.01450.0110||0.01480.0115||0.01480.0115|
|Toxicant vs Azole||Alg. 1||0.47900.0410||0.39660.0305||0.27780.0361||0.19070.0376||0.14160.0331||0.10990.0283||0.10190.0271||0.09690.0259|
|Toxicant vs Others||Alg. 1||0.33410.0148||0.33260.0202||0.24770.0209||0.20040.0231||0.14990.0197||0.14280.0169||0.14090.0164||0.14100.0172|
In this subsection, we compare the proposed Alg. 1 with DPCA , a state-of-the-art distributed PCA in the vertical regime, which however can only deal with linear problem. Hence, we only compare the performance of Alg. 1 (DKPCA) with DPCA in linear cases. Simulation data are used here, where and and the rank of is . The first diagonal elements of are , where we change to control the eigengap and see how accuracy changes with different data.
DPCA solves PCA in a decentralized setting and the number of neighbors of local machines influences the result significantly. We set this number as and denote the corresponding result as DPCA3, DPCA5, and DPCA10 in Fig. 4, respectively. In DPCA, a coordinate descent method is used with ADMM cycle inside. Following the experiment setting in , we set the parameter of ADMM as and the maximum number of the inner ADMM iterations as . The following error metric is used to measure the accuracy of principal subspace estimation,
where is the ground truth and is its estimation.
Fig. 4 shows the log of mean error of DKPCA and DPCA with respect to the iteration number. DKPCA is in a one-shot manner, thus this error is independent of the iteration number. The eigengap in Fig. 4 (a) is bigger than that in (b). Hence, DKPCA performs better on (a), which coincides with our theory analysis in section IV. Though the accurate of DPCA changes little with respect to eigengap, the converge speed are significant affected. Note that in each outer iteration of DPCA, the inner ADMM iterates times with communication rounds. Hence, from this point of view, DKPCA is more economy in communication and computation cost.
The aim of PCA is to keep useful information during data projection and thus its performance could be observed in a post learning task on the projected data. In this subsection, we first map data into a low-dimension feature space by Alg. 1
(DKPCA), the centralized kernel algorithm (KPCA), or the centralized linear algorithm (PCA). Then we sent them to a linear support vector machine (L-SVM). Dataset GSE2187 provides classification tasks: toxicants vs fibrates (), toxicants vs azoles (), and toxicants vs others (). We randomly choose data as the training set and use the rest for test. The RBF kernel width .
The average classification error and its standard deviation over 50 trials are reported in TableI. As methods on full data, it could be expected that the classification performance based on KPCA is better than the proposed DKPCA, and meanwhile PCA is better than all the existing distributed algorithms for linear cases. From Table I, one could observe that the performance of DKPCA is approaching KPCA and is generally better than PCA, showing the benefit of extending distributed PCA from linear to nonlinear for vertical regime.
This brief introduces a communication-efficient privacy-preserving algorithm for distributed kernel PCA in vertical regime, which estimates the underlying kernel matrix from the eigenvectors of local kernel matrices. The theoretical analysis of the approximation error shows that the proposed algorithm gives high quality results with low communication cost if eigenvalues decay fast. As an one-shot method, the proposed algorithm scarifies the computation efficiency for communication efficiency. But both theoretical and experimental result show that when the number of local machines falls in a suitable range, the computation cost is similar to that of centralized algorithms. Experiments on real dataset GSE2187 also verify the effectiveness of the proposed method in practical.
Y. Zhang, J. Duchi, and M. Wainwright, “Divide and conquer kernel ridge regression,” inConference on learning theory, 2013, pp. 592–617.
O. Ghorbel, M. W. Jmal, M. Abid, and H. Snoussi, “Distributed and efficient one-class outliers detection classifier in wireless sensors networks,” inInternational Conference on Wired/Wireless Internet Communication. Springer, 2015, pp. 259–273.
R. Rosipal and M. Girolami, “An expectation-maximization approach to nonlinear component analysis,”Neural Computation, vol. 13, no. 3, pp. 505–510, 2001.
If are two positive semidefinite matrices, then so is .
Let are two positive semidefinite matrices, any eigenvalue of satisfies
Now we are at the stage of proofing Lemma 2.
Recall that in the fusion center, local kernel matrices are represented by their first eigenvectors. Therefore, we first estimate the approximate error in local. In machine , denotes the first eigenvectors of with the corresponding eigenvalues , which is a diagonal matrix with diagonal elements equal to . Then the local approximation error satisfies
When linear kernel is used, from (5), we have
Then, we focus on the RBF kernel. For convenience, we use to denote . Recall the definition of , we know that both and are positive semidefinite and .
We define matrices for as follows
because and .