Online Principal Component Analysis in High Dimension: Which Algorithm to Choose?

11/11/2015 ∙ by Hervé Cardot, et al. ∙ 0

In the current context of data explosion, online techniques that do not require storing all data in memory are indispensable to routinely perform tasks like principal component analysis (PCA). Recursive algorithms that update the PCA with each new observation have been studied in various fields of research and found wide applications in industrial monitoring, computer vision, astronomy, and latent semantic indexing, among others. This work provides guidance for selecting an online PCA algorithm in practice. We present the main approaches to online PCA, namely, perturbation techniques, incremental methods, and stochastic optimization, and compare their statistical accuracy, computation time, and memory requirements using artificial and real data. Extensions to missing data and to functional data are discussed. All studied algorithms are available in the R package onlinePCA on CRAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Principal Component Analysis (PCA) is a popular method for reducing the dimension of data while preserving most of their variations (see Jolliffe (2002) for a general presentation). In a few words, PCA consists in extracting the main modes of variation of the data around their mean via the computation of new synthetic variables named principal components. This technique has found applications in fields as diverse as data mining, industrial process modeling (Tang et al., 2012)

, face recognition

(Zhao et al., 2006), latent semantic indexing (Zha and Simon, 1999)

, sentiment analysis

(Iodice D’Enza and Markos, 2015), astronomy (Budavári et al., 2009), and more. Traditional PCA, also called batch PCA or offline

PCA, is typically implemented via the eigenvalue decomposition (EVD) of the sample covariance matrix or the singular value decomposition (SVD) of the centered data. EVD and SVD can be standardly computed in

floating point operations (flops), where is the sample size and the number of variables (Golub and Van Loan, 2013, Chap. 8). If the number of required principal components is much smaller than and (this is usually the case), approximate PCA decompositions can be obtained quickly with little loss of accuracy. For example, truncated SVD can be computed in flops by combining rank-revealing QR factorization and R-SVD. Other effective approaches to truncated SVD or EVD include power methods, Krylov subspace methods (Lehoucq et al., 1998; Golub and Van Loan, 2013), and random projection methods (Halko et al., 2011). With respect to memory usage, batch PCA algorithms have at least space complexity as they necessitate to hold all data in computer memory. EVD requires additional memory for storing a covariance matrice (or memory to store a covariance matrix, see Section 2). Because of its time and space complexity, batch PCA is essentially infeasible with: (i) massive datasets for which and are, say, in the thousands or millions, and (ii) datasets that change rapidly and may need to be processed on the fly (e.g., streaming data, databases). Given the exponential growth of such data in modern applications, fast and accurate PCA algorithms are in high demand.

Over the years a large number of solutions has emerged from fields as diverse as signal processing, statistics, numerical analysis, and machine learning. These approaches, called

recursive, incremental, or online PCA in the literature, consist in updating the current PCA each time new data are observed without recomputing it from scratch. This updating idea does not only apply to time-varying datasets but also to datasets that are too large to be processed as a whole and must be analyzed one subset at a time. We give here a brief description of the main approaches to online PCA. Some of the most representative techniques will be examined in detail in the next sections. One of the first historical approaches is the numerical resolution of so-called secular equations (Golub, 1973; Gu and Eisenstat, 1994; Li et al., 2000). By exploiting the interlacing property of the eigenvalues of a covariance matrix under rank 1 modifications, this approach reduces the PCA update to finding the roots of a rational function. Interestingly, this recursive method is exact, that is, it produces the same results as batch PCA. Perturbation methods formulate the PCA update as the EVD of a diagonal matrix plus a rank-1 matrix and obtain closed-form solutions using large-sample approximations (Hegde et al., 2006). Incremental SVD is a highly effective method for producing an approximate, reduced rank PCA. It allows for block updates and typically involves the SVD or EVD of a small matrix of dimension , where is the block size. Important contributions in this area include Zha and Simon (1999), Brand (2002), and Baker et al. (2012). Stochastic optimization

methods enable very fast PCA updates and bear interesting connections with neural networks. Prominent examples are the stochastic gradient algorithms of

Krasulina (1970), Oja and Karhunen (1985), Oja (1992), and the generalized hebbian algorithm of Sanger (1989). While many results can be found on the consistency of these methods, their finite-sample behavior does not lend itself well to scrutiny. In addition, their numerical performances hinge on tuning parameters that are often selected by trial-and-error. Related techniques that are less sensitive to the choice of tuning parameters include Weng et al. (2003) and Mitliagkas et al. (2013). Approaches like moving window PCA (Wang et al., 2005) and fixed point algorithms (Rao and Principe (2000)) have also garnered interest. In recent years, randomized algorithms for online PCA have been developed in computer science and machine learning (Warmuth and Kuzmin, 2008; Boutsidis et al., 2015). An important aspect of this line of research is to establish bounds on finite-sample performance.

In contrast to the wide availability of online PCA methods, very few studies provide guidance on selecting a method in practice. To our knowledge, the only articles on this topic are Chatterjee (2005), Arora et al. (2012), and Rato et al. (2015)

. This gap in the literature is a significant problem because each application has its own set of data, analytic goals, computational resources, and time constraints, and no single online PCA algorithm can perform satisfactorily in all situations, let alone optimally. The present paper considerably expands on previous work in terms of scope of study and practical recommendations. We investigate a wide range of popular online PCA algorithms and compare their computational cost (space and time complexity, computation time, memory usage) and statistical accuracy in various setups: simulations and real data analyses, low- and high-dimensional data. We discuss practical issues such as the selection of tuning parameters, the potential loss of orthogonality among principal components, and the choice of vector versus block updates. Particular attention is given to the analysis of functional data and to the imputation of missing data. All algorithms under study are implemented in the R package

onlinePCA available at http://cran.r-project.org/package=onlinePCA.

The remainder of the article is organized as follows: Section 2 provides a reminder on batch PCA. Section 3 gives recursive formulae for the sample mean and covariance. It also presents approximate and exact perturbation methods for online PCA. Incremental PCA is discussed in Section 4 and stochastic approximation methods are detailed in Section 5. Extensions to nonstationary data, missing values, and functional data are given in Section 6. In Section 7, the computational and statistical performances of the previous online PCA methods are compared in simulations and in a face recognition application. Concluding remarks are offered in Section 8. R code for the numerical study of Section 7 is available online as Supplementary Materials.

2 Batch PCA

Given data vectors in , with possibly large, batch PCA seeks to produce a faithful representation of the data in a low-dimensional subspace of . The dimension of the subspace must be small enough to effectively reduce the data dimension, but large enough to retain most of the data variations. The quality of the representation is measured by the squared distance between the (centered) vectors and their projections in the subspace. Denoting the sample mean by , the goal of batch PCA is thus to find a projection matrix of rank

that minimizes the loss function

(1)

Consider the sample covariance matrix

(2)

Let

be orthonormal eigenvectors of

with associated eigenvalues . The minimum of among rank projection matrices is attained when is the orthogonal projector , in which case . In other words, batch PCA reduces to finding the first eigenvectors of the covariance or, equivalently, the first singular vectors of . These eigenvectors are called principal components and from here onwards, we use the two expressions interchangeably.

The computation of takes flops and its full EVD requires flops. When is large and , iterative methods such as the power method and implicitly restarted methods should be preferred to standard EVD as they are much faster and approximate the first eigenvectors of with high accuracy (e.g., Lehoucq et al., 1998; Golub and Van Loan, 2013). Despite this speedup, the cost of computing does not scale with the data. Note that if , batch PCA can be performed on , leading to the same result with reduced time complexity and space complexity . The duality between the PCA of and that of comes from the well-known result that and have the same eigenvalues and that their sets of eigenvectors can be deduced from one another by left-multiplication by or ; see Holmes (2008) for more details. In terms of space complexity, batch PCA necessitates having both and in random access memory, which incurs a storage cost of .

When data arrive sequentially, performing a batch PCA for each new observation may not be feasible for several reasons: first, as and/or becomes large, the runtime of the algorithm becomes excessive and forbids processing data on the fly; second, batch PCA requires storing all data, which is not always possible. Under these circumstances, much faster recursive techniques are of great interest. The price to pay for computational efficiency is that the obtained solution is often only an approximation to the true eigenelements.

3 Perturbation methods

The sample mean vector can be computed recursively as

(3)

and similarly, the sample covariance matrix satisfies

(4)

Note that for each new observation, the sample covariance matrix is updated through a rank one modification. In view of (4), can be expressed as a perturbation of :

(5)

Hence, if the eigendecomposition of is known, a natural idea to compute the eigenelements of is to apply perturbation techniques to , provided that is sufficiently large and is small compared to (viewing

as independent copies of a random vector with finite variance), the latter condition is satisfied with probability tending to one as

). Note that the trace of , which represents the total variance of the data, can also be computed recursively thanks to (4).

Large-sample approximations

We first recall a classical result in perturbation theory of linear operators (see Kato (1976) and Sibson (1979) for a statistical point of view). Denote by the Moore-Penrose pseudoinverse of a matrix .

Lemma 3.1.

Let be a symmetric matrix with eigenelements . Consider the first-order perturbation

with small and a symmetric matrix. Assuming that all eigenvalues of are distinct, the eigenelements of satisfy

We now apply Lemma 3.1 to (5) with , , and . Define for . Assuming that the eigenelements of satisfy , we have

(6)
(7)

Similar approximations can be found in Li et al. (2000) and Hegde et al. (2006). Note that the assumption that has distinct eigenvalues is not restrictive: indeed, the eigenanalysis of can always be reduced to this case by deflation (see e.g., Bunch et al. (1978)).

Secular equations

Gu and Eisenstat (1994) propose an exact and stable technique for the case of a rank one perturbation (see also the seminal work of Golub (1973)). For simplicity, suppose that is a diagonal matrix with distinct eigenvalues . The eigenvalues of the perturbed matrix , where is taken to be an unit vector without loss of generality, can be computed exactly as the roots of the secular equation

(8)

The eigenvectors of can then be computed exactly by applying (7) to the eigenvalues of and eigenvectors of .

A major drawback of perturbation techniques for online PCA is that they require computing all eigenelements of the covariance matrix. Accordingly, for each new observation vector, flops are needed to update the PCA and memory is required to store the results. This computational burden and storage requirement are prohibitive for large .

4 Reduced rank incremental PCA

Arora et al. (2012) suggest an incremental PCA (IPCA) approach based on the incremental SVD of Brand (2002). In comparison to perturbation methods, a decisive advantage of IPCA is that it does not require computing all eigenelements if one is only interested in the largest eigenvalues. This massively speeds up computations when is much smaller than . A limitation of the IPCA algorithm of Arora et al. (2012) is that it only performs updates with respect to a single data vector. A similar procedure allowing for block updates can be found in Levy and Lindenbaum (2000).

Let be a rank approximation to , where the diagonal matrix approximates the first eigenvalues of and the matrix approximates the corresponding eigenvectors of . The columns of are orthonormal so that . When a new observation becomes available, can be updated as follows. The centered vector is decomposed as , where are the coordinates of in the -dimensional space spanned by and is the projection of onto the orthogonal space of . In view of (4) the covariance is approximated by (note that is not computed in the algorithm). The previous equation rewrites as

(9)
with
(10)

It then suffices to perform the EVD of the matrix of dimension . Writing with orthogonal and diagonal, the EVD of simply expresses as , where and

(11)

To keep the approximation of at rank , the row and column of containing the smallest eigenvalue are deleted and the associated eigenvector is deleted from .

5 Stochastic approximation

5.1 Stochastic gradient optimization

Stochastic gradient approaches adopt a rather different point of view based on the population version of the optimization problem (1). Stochastic gradient algorithms for online PCA have been proposed by Sanger (1989), Krasulina (1970), Oja and Karhunen (1985) and Oja (1992). Parametric convergence rates for the first eigenvalue and eigenvector have been obtained recently by Balsubramani et al. (2013). These algorithms are very fast and differ mostly in how they (approximately) orthonormalize eigenvectors after each iteration.

Let be a random vector taking values in with mean and covariance . Consider the minimization of the compression loss

(12)

where is a projection matrix onto some -dimensional subspace of . Note that (12) is the probabilistic version of the empirical loss (1).

Let be a matrix whose columns form an orthonormal basis of the projection subspace, so that . The minimization of (12) is conveniently reformulated as the maximization of

(13)

whose gradient is

(14)

Assuming for now that is known and ignoring the orthonormality constraints on , a gradient ascent algorithm would have updates of the form with a step size. Since is in fact unknown, it is replaced by a random approximation whose expectation is proportional to . Accordingly, for each new observation , the matrix

of orthonormal vectors is updated as follows:

(15)
(16)

where satisfies the usual conditions of Robbins-Monro algorithms: and (e.g., Duflo, 1997). The first condition ensures the almost sure convergence of the algorithm whereas the second guarantees convergence to the global maximizer of (13), namely, the eigenvectors associated to the largest eigenvalues of .

The update (15) of the projection space has computational complexity , which is much less than the complexity of perturbation techniques if . The orthonormalization (16) can be realized for example with the Gram-Schmidt procedure. In this case, Oja (1992) describes the combination (15)-(16) as the Stochastic Gradient Ascent (SGA) algorithm. Although the Gram-Schmidt procedure requires

elementary operations, the space generated by the estimated eigenvectors remains the same even if orthonormalization is not performed at each step. As a result, if orthonormalization is performed every

steps or so, the overall complexity of the SGA algorithm remains per iteration. Alternatively, computational speed can be increased at the expense of numerical accuracy by using a first order approximation of the Gram-Schmidt orthonormalization (i.e., neglecting the terms of order ). This enables the approximate implementation

(17)

where . This fast implementation of the SGA algorithm can be interpreted as a neural network (Oja, 1992).

The SGA algorithm also allows consistent recursive estimation of the eigenvalues of (see Oja and Karhunen, 1985, and (6)):

(18)

Subspace Network Learning (SNL) is another stochastic gradient algorithm in which the orthonormalization (16) consists in multiplying by (Oja, 1983, 1992). Using first order approximations, a fast approximate implementation of SNL is

(19)

The SNL algorithm is faster than SGA but unlike SGA, it only converges to the eigenvectors of

up to a rotation. In other words, SNL recovers the eigenspace generated by the first

eigenvectors but not the eigenvectors themselves.

Sanger (1989) proposes a neural network approach called the Generalized Hebbian Algorithm (GHA):

(20)

The almost sure convergence of the estimator to the corresponding eigenvector of for is established in the same paper. By construction, the vectors are mutually orthogonal. In practice however, loss of orthogonality may occur due to roundoff errors. As noted in Oja (1992), GHA is very similar to the fast implementation (17) of SGA, the only difference being that there is no coefficient 2 in the sum. Strictly speaking, however, GHA is not a stochastic gradient algorithm.

5.2 Choosing the learning rate

The choice of the learning rate sequence in the previous stochastic algorithms has great practical importance, yet is rarely discussed in the literature. A usual choice is for some well chosen constant . However, if is too small, the algorithm may get stuck far from the optimum whereas if is too large, it may have large oscillations. There are no universally good values for because a sensible choice should depend on the magnitude of the data vectors, the distance between the starting point of the algorithm and the global solution, the gaps between successive eigenvalues of , etc.. In practice, is often selected by trial and error (e.g., Oja, 1983).

A more prudent strategy consists in using learning rates of the form with . In this way, tends to zero less rapidly so that the algorithm is allowed to oscillate and has less chances to get stuck at a wrong position. This is particularly important if the starting point of the algorithm is far from the solution.

Data-driven methods have been proposed in the literature to select the learning rate. For example, a non-zero-approaching adaptive learning rate of the form is studied in Lv et al. (2006), where is a constant to be chosen in In our simulations however, this technique produced poor performances not reported here.

5.3 Candid covariance-free incremental PCA

Weng et al. (2003) propose a method called candid covariance-free incremental PCA (CCIPCA) that resembles SGA and GHA. This method however does not aim to optimize an objective function or train a neural network. Following Weng et al. (2003), we first present the algorithm in the case where and then consider the general case. Let be an eigenvector of with unit norm and let be the associated eigenvalue. Assume that estimates of have been constructed in previous steps. The idea of CCIPCA is to substitute to and to in the eigenequation for and to average the results:

(21)

The normalized eigenvector and eigenvalue are estimated by and , respectively. A proof of the almost sure convergence of to can be found in Zhang and Weng (2001).

As can be seen in (21), CCIPCA produces a sequence of stochastic approximations to the eigenvectors of and then averages them. This is a major difference compared to the previous stochastic approximation algorithms that directly target the population covariance matrix. Because it is based on averaging, CCIPCA does not require specifying tuning parameters. This is a major advantage over SGA and GHA.

From a computational standpoint, (21) is conveniently written in recursive form as

(22)

The algorithm can be initialized by . In the general case where is unknown, it should be estimated via (3) and should be centered on in (22) for (note that ). A suitable initialization is .

When estimating more than one eigenvector, say , the same deflation method as in GHA is applied to enforce orthogonality of the estimates: to compute , the input vector is replaced by in (22). This saves much computation time compared to Gram-Schmidt orthonormalization but may cause loss of orthogonality due to roundoff errors.

To handle data generated by nonstationary processes, a parameter called amnesic factor can be introduced in (22):

(23)

This parameter controls the weight given to earlier observations. According to Weng et al. (2003), should typically range between and , with larger values of giving more weight to recent observations. For , (23) reduces to the stationary case (22).

6 Extensions

6.1 Nonstationary processes

The perturbation methods of Section 3 and IPCA algorithm of Section 4 have been presented under the implicit assumption of a stationary data-generating process. However, these methods can easily handle nonstationary processes by generalizing the sample mean and sample covariance as follows:

(24)
(25)

where is a "forgetting factor" that determines the weight of a new observation in the mean and covariance updates. In the stationary case, equations (3)-(4) are recovered by setting . More generally, larger values of give more weight on new observations.

The stochastic algorithms of Section 5.1 naturally accommodate nonstationary processes through the learning rate .

6.2 Missing data

Standard imputation methods (mean, regression, hot-deck, maximum likelihood, multiple imputation, etc.) can be used to handle missing data in the context of online PCA. See e.g., Josse et al. (2011) for a description of missing data imputation in offline PCA.

Hereafter we describe the approach of Brand (2002) that imputes missing values in the observation vector by empirical best linear unbiased prediction (EBLUP). The key idea is to consider

as a realization of the multivariate normal distribution

. In other words, the population mean vector and covariance matrix are approximated using the current sample mean vector and PCA.

Partition into two subvectors (observed values) and (missing values) of respective sizes and , where is the number of missing values. Similarly, partition in two subvectors and whose entries correspond to and . Also partition in two submatrices and of respective dimensions and whose rows correspond to and . Let be the diagonal matrix containing the square roots of the diagonal elements of . By the properties of conditional expectations for multivariate normal distributions, the empirical best linear unbiased predictor of is

(26)

6.3 Functional data

In many applications, the data are functions of a continuous argument (e.g., time, space, or frequency) observed on a dense grid of points . The corresponding observation vectors are often high-dimensional. Rather than carrying out PCA directly on the

it is advantageous to consider the functional version of this problem (FPCA). FPCA consists in finding the eigenvalues and eigenfunctions of the linear operator

associated to the empirical covariance function

where the empirical mean function. By accounting for the structure of the data (e.g., time ordering) and the smoothness of the eigenfunctions, FPCA can both reduce data dimension, hence computation time, and increase statistical accuracy.

An efficient way to implement FPCA is to first approximate the functions in a low-dimensional space:

(27)

where are smooth basis functions and . The FPCA of the approximate in (27) now reduces to the PCA of the basis coefficients in the metric , which is the Gram matrix of the basis functions. More precisely, writing , it suffices to diagonalize the matrix

with eigenvectors satisfying the orthonormality constraints

(28)

Further details can be found in Ramsay and Silverman (2005).

In turn, the eigenvectors can be found through the eigenanalysis of the symmetric matrix . Indeed, if is a (unit norm) eigenvector of associated to the eigenvalue , then is an eigenvector of for the same eigenvalue and the satisfy (28). It is possible to obtain a reduced-rank FPCA by computing only eigenvectors, but this is not as crucial as for standard PCA because the data dimension has already been reduced from to .

To extend FPCA to the online setup, the following steps are required:

  1. Given a new observation , compute , where is a suitable matrix. Typically, with of dimension , a penalty matrix, and a smoothing parameter.

  2. Update the sample mean via (3) and compute .

  3. Apply an online PCA algorithm to update the eigenelements with respect to . Note that the PCA takes place in and not .

  4. Compute the eigenvectors . If needed, compute the discretized eigenfunctions .

Steps 1 and 2 can be gathered for computational efficiency, that is, the matrix product can be computed once for all and be directly applied to new data vectors. This combined step requires flops. The time complexity of step 3 depends on the online PCA algorithm used; it is for example with IPCA and with GHA. If , one can also explicitly compute and update the covariance matrix with (3), and perform its batch PCA for each . The cost per iteration of this approach is the same as online PCA, namely . Step 4 produces the eigenvectors in flops and requires additional flops to compute eigenfunctions. Using for example IPCA in step 3, the total cost per iteration of FPCA is , which makes it very competitive with standard (online) PCA. In addition, if and the eigenfunctions of the covariance operator are smooth, FPCA can greatly improve the accuracy of estimates thanks to the regularized projection (27) onto smooth basis functions. This fact is confirmed in the simulation study.

A standard choice for the penalty is , which penalizes curvature in the basis function approximation (27). The parameter can be selected manually using pilot data. Alternatively, an effective automated selection procedure is to randomly split pilot data in two subsets and select the value for which the FPCA of one subset (i.e., the projection matrix ) minimizes the loss function (1) for the other subset.

7 Comparison of online PCA algorithms

7.1 Time and space complexity

Method Required Memory Computation time
Batch (EVD)
Batch (SVD)
SGA (ortho.)
SGA (nn)
GHA
CCIPCA
Perturbation Approximation
Incremental PCA
Table 1: Computational cost and memory usage of online PCA per iteration

Table 1 compares the time and space complexity of the batch and online PCA algorithms under study. The usual batch PCA (EVD) does not scale with the data as it requires time. In comparison, truncated SVD has a computational cost that grows linearly with the data and hence can be used with fairly large datasets. When is small, batch PCA (EVD or SVD) can provide reasonable starting points to online algorithms. If the dimension is large, perturbation methods are very slow and require a large amount of memory. At the opposite end of the spectrum, the stochastic algorithms SGA and SNL (with neural network implementation - “nn." in the table), GHA, and CCIPCA provide very fast PCA updates () with minimal memory requirement (this is the space needed to store the eigenvectors and eigenvalues). If is relatively small compared to and , SGA and SNL (with exact orthonormalization - “ortho." in the table) and IPCA offer efficient PCA updates ( time complexity) albeit slightly slower than the previously mentioned stochastic algorithms.

7.2 Simulation study

7.2.1 Setup

A simulation study was conducted to compare the numerical performances of the online PCA algorithms. The data-generating model used for the simulation was a Gaussian random vector in with zero mean and covariance matrix . This random vector can be interpreted as a Brownian motion observed at equidistant time points in . For large enough, the eigenvalues of the scaled covariance decrease rapidly to zero () so that most of the variability of is concentrated in a low-dimensional subspace of (e.g., Ash and Gardner, 1975). In each simulation a number of independent realizations of was generated with and . The online PCA algorithms were initialized by the batch PCA of the first observations and then run on the remaining observations. The number of estimated eigenvectors varied in .

To evaluate the statistical accuracy of the algorithms, we considered the relative error in the estimation of the eigenspace associated to the largest eigenvalues of . Let be the orthogonal projector on this eigenspace. Given a matrix of estimated eigenvectors such that , we consider the orthogonal projector and measure the eigenspace estimation error by

(29)

where denotes the Frobenius norm. In unreported simulations we also used the cosine between the top eigenvector of and its estimate as a performance criterion and obtained qualitatively similar results to those presented here.

7.2.2 Computation time

Computation times (in milliseconds, for one iteration) evaluated on a personal computer (1,3 GHz Intel Core i5, with 8GB of RAM) are presented in Table 2. The results are globally coherent with Table 1. When the dimension of the data is small, all considered methods have comparable computation times at the exception of the secular equation approach which is at least ten times slower than the others. As increases, the perturbation techniques, which compute all eigenelements, get slower and slower compared to the other algorithms. For example, if we are interested in the first eigenvectors of a covariance matrix, the CCIPCA and GHA algorithms are more than 500 times faster than the perturbation approaches. The effect of the orthonormalization step on the computation time becomes much larger for high dimension and a relatively large number of computed eigenvectors . When and , the GHA and CCIPCA algorithms that perform approximate orthonormalization are about seven times faster than IPCA and SGA that perform exact Gram-Schmidt orthonormalization.

     
SGA (ortho.)
SGA (nn)
SNL (ortho.)
SNL (nn)
GHA
CCIPCA
IPCA
Perturbation (approx.)
Perturbation (secular)
Table 2: Computation time (in milliseconds per iteration) of the online PCA algorithms

7.2.3 Statistical accuracy

Table 4 shows the eigenspace estimation error (averaged over 100 to 500 replications) of the online PCA algorithms for the first eigenvectors of and different values of and . To increase statistical accuracy, each algorithm actually computed eigenvectors; only the first eigenvectors were kept for estimation in the end. It is in general advisable to compute more eigenvectors than required to maintain good accuracy for all target eigenvectors.

The approximate perturbation approach produces estimation errors that are much greater at the end (that is, after all observations have been processed) than at the beginning (initialization by batch PCA of observations). Therefore, this approach should not be used in practice. In contrast, the exact perturbation algorithm based on secular equation performs as well as batch PCA.

The convergence of stochastic algorithms largely depends on the sequence of learning rates . Following the literature, we considered learning rates of the form , with and . Larger values of can be expected to produce better convergence rates but also increase the risk to get stuck close to the starting point of the algorithm. This can lead to poor results if the starting point is far from the true eigenvectors of . In our setup we use the values and obtain the constants by minimizing the eigenspace estimation error over the grid (see Table 3). Interestingly, smaller values of are chosen when the dimension increases and when is decreases. The sample size does not seem to strongly impact the optimal value of . With our calibrated choice of the constant in the learning rates, the SGA and GHA algorithms display virtually identical performances. In addition, there is no great difference for the estimation error between and . Given its high speed of computation, CCIPCA performs surprisingly well in all situations. IPCA is even more accurate and, although it is an approximate technique, it performs nearly as well as exact methods in this simulation study.

10 1 0.1 1 0.1 0.01
10 1 0.1 1 0.1 0.01
Table 3: Best constant for the learning rate of the SGA and GHA algorithms. The best constants are identical for the two methods

Figures 12 present the eigenspace estimation error (averaged over 100 replications) in function of the sample size for the most effective online algorithms under study: IPCA, SGA, and CCIPCA. Although the data dimension is in Figure 1 and in Figure 2, the two figures are similar, meaning that the effect of the dimension on the evolution of the accuracy is not crucial. IPCA produces reliable estimates and always outperform the SGA and CCIPCA algorithms. The SGA algorithm with learning rate produces a stronger initial decrease in than with . In the long run however, decreases faster with the slower rate .

   
Batch ()
Batch ()
SGA ()
SGA ()
GHA ()
GHA ()
CCIPCA
IPCA
Perturbation
Secular
Table 4: Eigenspace estimation error for the first eigenvectors of
Figure 1: Eigenspace estimation error for the first eigenvectors of with
Figure 2: Eigenspace estimation error for the first eigenvectors of with

The random vector considered in our simulation framework can be seen as a discretized standard Brownian motion. Given the smoothness of the eigenfunctions associated to the Brownian motion (sine functions) and the high dimension of the data, the functional PCA approach of Section 6.3 seems an excellent candidate for estimating the eigenelements of the covariance matrix . To implement this approach, we used a basis of cubic B-splines controlled by equispaced knots. The penalty matrix was as in Section 6.3, namely, a roughness penalty on the second derivative of the approximating function, with a smoothing parameter . The estimation error is presented in Table 5 for and . Thanks to the dimension reduction, all online PCA algorithms can be rapidly computed even when . Comparing Tables 4 and 5, it is clear that FPCA improves the statistical accuracy. Again, the performances of the stochastic approximation approaches strongly depend on the dimension and the learning rate . As before, IPCA offers a good compromise between computation time and accuracy.

   
Batch ()
Batch ()
SGA ()
SGA ()
GHA ()
GHA ()
CCIPCA
IPCA
Perturbation
Secular
Table 5: Eigenspace estimation error for the first eigenvectors of when the discretized trajectories are approximated by spline functions with 28 equispaced knots

7.2.4 Missing data

The ability of the IPCA algorithm to handle missing data was evaluated in a high dimensional context () using the EBLUP imputation method of Section 6.2. Missing values were removed by simple random sampling without replacement with different sampling fractions (). As shown in Figure 3, the incremental algorithm performs well even when half of the data are missing. We note that imputation based on EBLUP is well adapted to this simulation study due to the rather strong correlation between variables. It is also worth noting that the computation time if the imputation is about of the same order as the computation time of the IPCA algorithm itself.

Figure 3: Eigenspace estimation error of the IPCA algorithm for the first eigenvectors of with different levels of missingness in the data and

7.3 A face recognition example

To assess the performance of online PCA with real data, we have selected the Database of Faces of the AT&T Laboratories Cambridge (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html). This database consists in 400 face images of dimensions pixels in 256 gray levels. For each of 40 subjects, 10 different images featuring various facial expressions (open/closed eyes, smiling/non-smiling) and facial details (glasses/no-glasses) are available.

The database was randomly split in a training set and a test set by stratified sampling. Specifically, for each subject, one image was randomly selected for testing and the nine others were included in the training set. The IPCA, SGA, and CCIPCA algorithms were applied to the (vectorized) training images using or principal components as in Levy and Lindenbaum (2000). No image centering was used (uncentered PCA). IPCA and CCIPCA were initialized using only the first image whereas the SGA algorithm was initialized with the batch PCA of the first images. The resulting principal components were used for two tasks: compression and classification of the test images. We also computed the batch PCA of the data as a benchmark for the online algorithms.

For the compression task, we measured the performance of the algorithms using the uncentered, normalized version of the loss function (1):

For the purpose of classifying the test images, we performed a linear discriminant analysis (LDA) of the scores of the training images on the principal components, using the subjects as classes. This technique is a variant of the well-known Fisherface method of

Belhumeur et al. (1997); see also Zhao et al. (2006) for related work.

The random split of the data was repeated 100 times for each task and we report here the average results. Table 6 reports the performance of the algorithms with respect to data compression. As can be expected, the compression obtained with principal components is far superior to the one using components. Due to the large size of the training set (90% of all data), there is little difference in compression error between the training and testing sets. IPCA and CCIPCA produce nearly optimal results that are almost identical to batch PCA (see also Figures ref). The SGA algorithm shows worse performance on average but also more variability. The effectiveness of SGA further degrades if a random initialization is used. Interestingly, some principal components found by SGA strongly differ from those of the other methods, as shown in Figure 4 where SGA components are stored in the last row. Accordingly, SGA may compress images quite differently from the other methods. In Figure 5 for example, some images compressed by SGA (first and third images in the last row) have sharp focus and are in fact very similar to other images of the same subjects used in the training phase. In contrast, the corresponding images compressed by batch PCA, IPCA, and CCIPCA are blurrier, yet often closer to the original image.

Training Test Training Test
Batch 0.0323 0.0363 0.0224 0.0286
IPCA 0.0327 0.0367 0.0229 0.0290
SGA 0.0523 0.0551 0.0388 0.0431
CCIPCA 0.0335 0.0373 0.0257 0.0312
Table 6: Compression loss of the batch PCA, IPCA, CCIPCA, and SGA algorithms with the AT&T Database of Faces
Figure 4: Top five principal components (eigenfaces) for the AT&T Database of Faces. Rows from top to bottom: batch PCA, IPCA, CCIPCA, SGA
Figure 5: Sample of compressed images from the AT&T Database of Faces ( principal components). Rows from top to bottom: original image, batch PCA, IPCA, CCIPCA, SGA

Table 7 displays the classification accuracy of the LDA based on the component scores of the different online PCA algorithms. Overall, all algorithms have high accuracy. IPCA and CCIPCA yields the best performances, followed closely by batch PCA. It is not surprising that online algorithms can surpass batch PCA in classification since the latter technique is only optimal for data compression. SGA produces slightly lower, yet still high classification accuracy.

Training Test Training Test
Batch 0.9897 0.9580 0.9986 0.9880
IPCA 0.9915 0.9635 0.9995 0.9875
CCIPCA 0.9920 0.9655 0.9988 0.9837
SGA 0.9788 0.9340 0.9963 0.9710
Table 7: Classification rates for batch PCA, IPCA, CCIPCA, and SGA coupled with Linear Discriminant Analysis on the AT&T Database of Faces

Table 8 examines the computation time and memory usage of the PCA algorithms. As can be expected, batch PCA requires much more (at least one order of magnitude) memory than the online algorithms. Also, the size of the data is large enough so that batch PCA becomes slower than the online algorithms CCIPCA and IPCA. The fact that the SGA algorithm runs slower than all other algorithms is not surprising given that its exact implementation used here requires a Gram-Schmidt orthogonalization in high dimension at each iteration.

Method Time Memory
Batch PCA 7.13 924.4
IPCA 7.09 73.3
CCIPCA 3.93 74.6
SGA 9.45 67.8
Table 8: Computation time (s) and memory usage (MB) for the PCA of the AT&T Database of Faces (, ).

8 Concluding remarks

PCA is a popular and powerful tool for the analysis of high-dimensional data with multiple applications to data mining, data compression, feature extraction, pattern detection, process monitoring, fault detection, and computer vision. We have presented several online algorithms that can efficiently perform and update the PCA of time-varying data (e.g., databases, streaming data) and massive datasets. We have compared the computational and statistical performances of these algorithms using artificial and real data. The R package

onlinePCA available at http://cran.r-project.org/package=onlinePCA implements all the techniques discussed in this paper and others.

Of all algorithms under study, the stochastic methods SGA, SNL, and GHA provide the highest computation speed. They are however very sensitive to the choice of the learning rate (or step size) and converge more slowly than IPCA and CCIPCA. For strongly misspecified learning rates, they may even fail to converge. Furthermore, simulations not presented here suggest that a different step size should be used for each estimated eigenvector. In theory this guarantees the almost sure convergence of estimators towards the corresponding eigenspaces (Monnez, 2006) but, as far as we know, there exist no automatic procedure for choosing these learning rates in practice. In relation to the recent result given in Balsubramani et al. (2013), averaging techniques (see Polyak and Juditsky (1992)) could be useful to get efficient estimators of the first eigenvector. Simulation studies not presented here do not confirm at all this intuition. As a matter of fact, averaging improves significantly the initial stochastic gradient estimators when but the estimation error remains much larger than with IPCA and CCIPCA.

The IPCA and CCIPCA algorithms offer a very good compromise between statistical accuracy and computational speed. They also have the advantage of not having major dependence on tuning parameters (forgetting factor). Approximate perturbation methods can yield highly inaccurate estimates and we do not recommend them in practice. The method of secular equations, although slower than the other algorithms, has the advantage of being exact. It is a very good option when accuracy matters more than speed and the dimension