# An Implicit Form of Krasulina's k-PCA Update without the Orthonormality Constraint

We shed new insights on the two commonly used updates for the online k-PCA problem, namely, Krasulina's and Oja's updates. We show that Krasulina's update corresponds to a projected gradient descent step on the Stiefel manifold of the orthonormal k-frames, while Oja's update amounts to a gradient descent step using the unprojected gradient. Following these observations, we derive a more implicit form of Krasulina's k-PCA update, i.e. a version that uses the information of the future gradient as much as possible. Most interestingly, our implicit Krasulina update avoids the costly QR-decomposition step by bypassing the orthonormality constraint. We show that the new update in fact corresponds to an online EM step applied to a probabilistic k-PCA model. The probabilistic view of the updates allows us to combine multiple models in a distributed setting. We show experimentally that the implicit Krasulina update yields superior convergence while being significantly faster. We also give strong evidence that the new update can benefit from parallelism and is more stable w.r.t. tuning of the learning rate.

## Authors

• 8 publications
• 20 publications

Continuous-time mirror descent (CMD) can be seen as the limit case of th...
02/24/2020 ∙ by Ehsan Amid, et al. ∙ 0

The (stochastic) gradient descent and the multiplicative update method a...
02/05/2019 ∙ by Udaya Ghai, et al. ∙ 0

The rise of deep learning in recent years has brought with it increasing...
01/09/2018 ∙ by Igor Gitman, et al. ∙ 0

• ### Average performance analysis of the stochastic gradient method for online PCA

This paper studies the complexity of the stochastic gradient algorithm f...
04/03/2018 ∙ by Stéphane Chrétien, et al. ∙ 0

• ### Finite Regret and Cycles with Fixed Step-Size via Alternating Gradient Descent-Ascent

Gradient descent is arguably one of the most popular online optimization...
07/09/2019 ∙ by James P. Bailey, et al. ∙ 0

• ### An Approach to Stable Gradient Descent Adaptation of Higher-Order Neural Units

Stability evaluation of a weight-update system of higher-order neural un...
06/23/2016 ∙ by Ivo Bukovsky, et al. ∙ 0

• ### Improving SGD convergence by tracing multiple promising directions and estimating distance to minimum

Deep neural networks are usually trained with stochastic gradient descen...
01/31/2019 ∙ by Jarek Duda, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

Principal Component Analysis (PCA) [Pearson1901] is one of the most widely used techniques in data analysis [Hastie, Tibshirani, and Friedman2009], dimensionality reduction [Van Der Maaten, Postma, and Van den Herik2009]

, and machine learning

[Jolliffe2011]. The problem amounts to finding projections of the -dimensional data along

orthogonal directions such that the expected variance of the reconstruction error is minimized. Formally, let

be a zero-mean random variable

111We focus on the centered -PCA problem since handling the mean value of a random variable is rather simple.. In the vanilla -PCA problem we seek a projection matrix222Projection matrices are symmetric () and idempotent (). of rank- such that the compression loss of ,

 ℓ\tiny comp(P) (1)

is minimized. Here denotes the identity matrix.

The optimum projection matrix can be decomposed333Note that this decomposition is not unique. However, there exists a unique decomposition in terms of the orthonormal eigenvectors of . as , where is a matrix that spans the space of top-eigenvectors of the data covariance matrix . For a given , we denote by , the projection of onto the column space of . The reconstruction of from the projection can be calculated as . In practice, (1) is approximated using a finite number of samples resulting in the following objective:

 ^ℓ\tiny comp(P) =\sfrac1Ntr((Id−P)∑nyny⊤n) (2) =\sfrac1Ntr((Id−P)YY⊤).

Here, denotes the matrix of observations, i.e. the -th column is We also define as the matrices of the projected samples. Note that the objective (2) is linear and thus convex in , but non-convex in . Nevertheless, the solution can be efficiently calculated using the SVD of the empirical data covariance matrix .

Although the original PCA problem concerns the full-batch setting, in many cases the dataset might be too large to be processed by a batch solver. In such scenarios, online PCA solvers that process a mini-batch (or a single observation) at a time are more desirable. In the online setting, two elegant solutions proposed by Krasulina [Krasulina1969] and Oja [Oja1982] for finding the top- direction have received considerable attention and been studied extensively throughout the years [Chen, Hua, and Yan1998, Balsubramani, Dasgupta, and Freund2013, Jain et al.2016, Allen-Zhu and Li2017]. In the following, we review the generalizations of these two algorithms to the -PCA problem.

We show that both the Krasulina and Oja updates correspond to stochastic gradient descent steps on the compression loss (and its reduced form) using the gradient of the loss at the old parameter. Using the terminology introduced in

[Kivinen, Warmuth, and Hassibi2006], we call updates that are based on the old gradient the explicit updates. In this paper, we focus on deriving a more implicit form of Krasulina’s updates. Implicit here means that the updates are based on the future gradients at the updated parameters. As an example, consider the following regularized loss minimization over the parameter

 \upthetat+1=argmin\uptheta(\sfrac12η∥\uptheta−\upthetat∥2+loss(\uptheta)), (3)

with learning rate . Setting the derivative w.r.t. to zero yields the following gradient descent update

 \upthetat+1=\upthetat−η∇loss(\upthetat+1). (4)

The above update is referred to as an implicit gradient descent update since it uses the gradient of the loss at the future parameter

. In cases where solving the update using the gradient at the future estimate is infeasible, the update is usually approximated by the gradient at the current estimate

 \upthetat+1≈\upthetat−η∇loss(\upthetat), (5)

which is referred to as the explicit gradient descent update. In many settings, implicit updates are more stable and have better convergence properties compared to their explicit counterparts [Hassibi, Sayed, and Kailath1996, Kivinen, Warmuth, and Hassibi2006].

In summary:

• We first formulate the Krasulina and Oja updates as an online (un)projected gradient descent steps on the Stiefel manifold.

• Using this observation, we derive a more implicit form for Krasulina’s update for the -PCA problem that avoids the orthonormality constraint and is strikingly simple.

• We show that the new implicit Krasulina update actually amounts to an online EM step on a probabilistic -PCA model. This allows combining multiple -PCA models in a distributed setting using the recent framework of [Amid and Warmuth2019].

• With an extensive set of experiments, we show that the implicit Krasulina update yields better convergence while being more stable w.r.t. the choice of initial learning rate. Furthermore, by avoiding the orthonormalization step and maintaining matrix pseudo-inverses instead, we achieve a much faster update. Further speedup can be achieved by running our algorithm in parallel and combining the results.

### Related Work

Many efficient solvers have been developed for the vanilla PCA problem (2) throughout the years. For a reasonable data size, one can find the exact solution by applying a truncated SVD solver on the empirical data covariance matrix. Randomized SVD solvers [Halko, Martinsson, and Tropp2011] and power methods [Golub and Van Loan2012] are common alternatives when the size of the dataset is larger. Online PCA solvers are desirable when the data comes in as a stream or when a full pass over the data may not be feasible due to the large size of the dataset. In general, the online algorithms iteratively perform the updates using a mini-batch of observations (commonly a single observation) at every round. Among the online algorithms, Oja’s update [Oja1982] and its variants are the most well-studied [Jain et al.2016, Allen-Zhu and Li2017]. Noticeably, Shamir [Shamir2015] showed exponential convergence on a variance reduced variant of Oja’s algorithm. However, the algorithm requires multiple passes over the data. On the other hand, a thorough convergence analysis of Krasulina’s algorithm and its extension to the -PCA problem is still lacking. A partial analysis was done recently in [Tang2019] where an exponential convergence was shown when the data covariance matrix has low rank. However, this assumption might be too restrictive in real-world scenarios. A lower expected rate of still holds for Krasulina’s -PCA update for high rank data [Balsubramani, Dasgupta, and Freund2013]. Other formulations for online PCA exploit the linearity of the objective (1) in by essentially maintaining a mixture of solutions for of rank as a capped density matrix [Warmuth and Kuzmin2008, Arora, Cotter, and Srebro2013]. This approach leads to algorithms with optimal regret bounds [Nie, Kotłowski, and Warmuth2016] but these algorithms are fundamentally less efficient than the incremental counterparts that aim to optimize . However, inspired by this approach of optimizing for

, an incremental heuristic method has been developed

[Arora et al.2012] that gives reasonable experimental convergence, but again suffers from high computational cost for the updates.

## Quadratic Program on the Stiefel Manifold

Given the zero-mean random variable , consider the following optimization problem for the centered -PCA problem:

 C∗ =argminC∈St(d,k)ℓ\tiny comp(C), (6) for\,\,\, ℓ\tiny comp(C) \coloneqq\sfrac12E[∥CC⊤y−y∥2],

where with is the compact Stiefel manifold of orthonormal matrices. We view as an embedded submanifold of . The objective (6) is identical to the expected compression loss (1). Indeed when , then is a projection matrix. Also for , we have . Thus, we can rewrite (6) as

 C∗ =argminC∈St(d,k)ℓ\tiny var(C), (7) for\,\,\, ℓ\tiny var(C)\coloneqq\sfrac12(tr (E[yy⊤])−tr(C⊤E[yy⊤]C)).

Thus minimizing the compression loss (6) is equivalent to maximizing the variance of the projection . Note that although the values of the objectives (6) and (7) are identical when , they might yield different updates for the gradient based methods as we shall see in the following.

A stochastic optimization procedure for solving (6) can be motivated similar to (3) where an inertia term is added to the loss to keep the updated parameters close to the current estimates, that is,

 C\tiny new=argmin˜C∈St(d,k)\sfrac12(\sfrac1η∥∥˜C−C∥∥2\tiny F+E[∥˜Cx−y∥2]),

where is the projection onto the column space of and denotes the squared Frobenius norm. A procedure for solving the optimization problem is based on iteratively applying a gradient descent step using the projected gradient of the objective onto the tangent space of at followed by a retraction step [Liu, So, and Wu2016]. The tangent space at is characterized by and the projected gradient of , denoted by , can be obtained by projecting the Euclidean gradient onto :

 ∇ℓ\tiny comp =E[(CC⊤y−y)y⊤C], ¯¯¯¯¯∇ℓ\tiny comp(C) =(Id−CC⊤)∇ℓ\tiny comp(C) =CC⊤E[yy⊤]C−E[yy⊤]C=∇ℓ\tiny comp.

Notice that . In a stochastic approximation setting, the gradient at each iteration is approximated by a given batch of observations :

 ¯¯¯¯¯∇^ℓ\tiny comp(C) =\sfrac1N∑n(CC⊤yny⊤nC−yny⊤nC) (8) =\sfrac1N(CX−Y)X⊤,

where denotes the matrix of observations and . Thus, the stochastic approximation update becomes

 ˜C=C−η¯¯¯¯¯∇^ℓ\tiny comp(C) =C−\sfracηN(CX−Y)X⊤ (9) and C\tiny new ={QR}(˜C).

The intermediate parameter in general, but the second QR step ensures that . This update (9) is identical to the extension of Krasulina’s update to the -PCA problem which was proposed recently in [Tang2019].

Alternatively, the Euclidean gradient of (7) becomes

 ∇ℓ\tiny var(C)=−E[yy⊤]C.

Projecting this gradient onto the tangent space yields

Thus the update using the projected gradient again yields Krasulina’s update for -PCA (9). Interestingly, the update using the Euclidean (unprojected) gradient gives the well-known Oja update for -PCA [Oja1982]:

 ˜C=C−η∇^ℓ% \tiny var(C) =C+\sfracηNYY⊤C, and C\tiny new ={QR}(˜C),

Thus, Oja’s update is a an stochastic approximation update that aims to maximize the variance of the projection, but ignores the structure of the tangent space of the Stiefel manifold.

Note that both the Krasulina and Oja updates are explicit (similar to gradient descent update (5)) in that they use the gradient at the old parameter rather than the new parameter  [Kivinen, Warmuth, and Hassibi2006]. Unfortunately, using the new parameter for the updates as in (4) does not yield a tractable solution. However, we will show that a more implicit form of Krasulina’s update, i.e. the one that uses the gradient at the new parameter, can be achieved when the orthonormality constraint is abandoned.

## New Update w.o. Orthonormality Constraint

Instead of maintaining an orthonormal matrix , we let be any matrix of rank- and use the fact that the projection of an observation onto the column space of corresponds to :

 (10)

where is the projection using the old matrix . Setting the derivatives w.r.t. to zero, we obtain

 C\tiny new =C−ηE[(C\tiny newx−y)x⊤] =(E[yx⊤]+\sfrac1ηC)(E[xx⊤]+\sfrac1ηIk)−1.

For a batch of points , this new update, which we call the implicit Krasulina update, becomes

 C\tiny new=(\sfrac1NYX⊤+\sfrac1ηC)(\sfrac1NXX⊤+\sfrac1ηIk)−1. (11)

Note that an explicit variant can also be obtained by simply approximating in the derivative equation by the old matrix :

This is update (9) without enforcing the orthonormality constraint using the QR decomposition and in which the projection is calculated using instead of .

A few additional remarks are in order. First, it may seem plausible to replace the projected gradient term in (9) with a more recent projected gradient to achieve an implicit update similar to (11). However, note that in general and thus, this would not correspond to a valid projected gradient update. Avoiding the orthonormality constraint assures that the future gradient is indeed a valid descent direction. Additionally, note that (11) is only partially implicit since we use the old for finding the projection . Unfortunately, the fully implicit update does not yield a closed form solution.

The implicit update (9) has a simple form in the stochastic setting, when a single observation is received at round . Let denote the projection. Applying the Sherman-Morrison formula [Golub and Van Loan2012] for the inverse of rank-one matrix updates, we can write (11) as

 C\tiny new=C−η1+η∥xt∥2(Cxt−yt)x⊤t. (12)

which we call the stochastic implicit Krasulina update. Note that the learning rate in (12) is essentially inversely proportional to the norm of the individual projection

. The same fractional form of the learning rate appears in the implicit update for online stochastic gradient descent for linear regression

[Kivinen and Warmuth1997, Kivinen, Warmuth, and Hassibi2006] which coincides with the differently motivated “normalized LMS” algorithm of [Hassibi, Sayed, and Kailath1996]. As noted in [Kivinen, Warmuth, and Hassibi2006], implicit updates have slower initial convergence (due to the smaller learning rate) and smaller final error rate compared to the explicit updates. Also, as a result of the adaptivity of the learning rate, the implicit Krasulina update becomes less sensitive to the initial choice for the learning rate. We will show this in the experimental section.

## Efficient Implementation

We provide some insight on the efficient implementation of the implicit Krasulina update. While our update avoids the costly QR step, it may impose extra computational complexity when calculating the projection. Namely, the computational complexity of the stochastic update (12) is dominated by the calculation of the matrix pseudo-inverse . In order to implement the stochastic updates efficiently, we can exploit the fact that each iteration involves a rank- update on matrix . One approach to reduce the computational complexity would be to maintain the inverse matrix and update it accordingly at every iteration. Note that a rank- update on corresponds to a rank- update on . Thus, the inverse can be carried out efficiently using the Woodbury matrix identity [Woodbury1950]. The one-time complexity of calculating the inverse is , but can be significantly reduced by e.g. using a rectangular diagonal matrix as the initial solution. Thus, the overall complexity of the algorithm for the stochastic update (12) becomes .

A computationally more efficient approach is to directly keep track of the matrix and apply the rank-1 update for the Moore-Penrose inverse proposed in [Meyer1973]. The complexity of each rank- update amounts to . Our implicit Krasulina update is summarized in Algorithm 1, where the operator RankOnePinvUpdate is described in [Petersen and Pedersen2008].

## Probabilistic k-PCA and Online EM

We now provide an alternative motivation for our new implicit Krasulina update (11

) as an instantiation of the recent online Expectation Maximization (EM) algorithm

[Amid and Warmuth2019] for a certain probabilistic -PCA model. This probabilistic -PCA model was introduced in [Roweis1998, Tipping and Bishop1999] as a linear-Gaussian model:

 y=Cx +\upnu, (13) where x∼N(0k,Ik) and \upnu∼N(0d,Q).

where is the unknown hidden state, denotes the observation noise, and . Also,

denotes a Gaussian distribution having mean

and covariance . Usually, the noise covariance is assumed to be isotropic, i.e. . Note that since all random variables are Gaussian, the posterior distribution of the hidden state also becomes Gaussian, that is,

 x|y∼N(\upbetay,Ik−\upbetaC), where \upbeta=C⊤(CC⊤+Q)−1.

An interesting case happens in the limit where the covariance of the noise becomes infinitesimally small. Namely, in the limit , the likelihood of a point is solely determined by the squared Euclidean distance between and its reconstruction . The posterior of the hidden state collapses into a single point,

 x|y∼N((C⊤C)−1C⊤y,0)=δ(x−(C⊤C)−1C⊤y),

where is the Dirac measure. Moreover, the maximum likelihood estimator for is achieved when spans the space of top- eigenvectors of the data covariance matrix [Tipping and Bishop1999]. Thus, the linear-Gaussian model reduces to the vanilla -PCA problem.

Using the probabilistic -PCA formulation allows solving the vanilla -PCA problem iteratively in the zero noise limit via the application of the EM algorithm [Dempster, Laird, and Rubin1977]. Let

denote the set of parameters of the probabilistic latent variable model with joint probability density

. The EM upper-bound can be written as

 UΘ(t)(˜Θ|Y(t))=−\sfrac1N(t)∑n∫xPΘ(t)(x|yn)logP˜Θ(x,yn) (14)

where   with  and . We now consider the case when becomes infinitesimally small. Note that in the limit , we have and . Iteratively forming the EM upper-bound and minimizing it yields the following procedure444Assuming that is rank-. [Roweis1998]:

 X =C†Y=(CC⊤)−1C⊤Y (E-step)  , (15) Cnew =YX†=YX⊤(XX⊤)−1 (M-step) ,

where the matrices and are defined as before. Because of the tightness of the EM upper-bound, every step of the EM algorithm is guaranteed to either improve the compression loss or leave it unchanged [Amid and Warmuth2019]. The final projection matrix is obtained as .

We now motivate our new implicit Krasulina update as online version of the update (15). This can be achieved by an application of a recent online EM algorithm developed in [Amid and Warmuth2019]. In the online setting, the learner receives a mini-batch of observations (usually a single observation) at a time and performs parameter updates by minimizing the negative log-likelihood of the given examples. In order to make the learning stable, an inertia (aka regularizer) term is added to the loss to keep the updates close to the old parameters. Thus, the learner minimizes the combined inertia plus loss of the current iteration. Let be the given batch of observations at round . The online EM algorithm introduced in [Amid and Warmuth2019] is motivated in the same manner by minimizing the following loss at iteration :

 Θ(t+1)=argmin˜Θ(\sfrac1η(t)ΔRE(Θ(t),˜Θ)+UΘ(t)(˜Θ|Y(t))).

where is the relative entropy divergence between the joints and is the learning rate. In the following, we consider one iteration of the online EM algorithm and drop the superscript to avoid clutter555We also use instead of for reasons which will be clear in the following..

We now consider the online EM algorithm when applied to the probabilistic -PCA model (13). Note that . We can write

 ΔRE(Θ,˜Θ) =d2(ϵ~ϵ−logϵ~ϵ−1) (16) +12~ϵ−1tr((C−˜C)(C−˜C)⊤),

We now consider the case where is fixed. Combining (16) and (14), the update for parameter becomes

 C\tiny new=(1NYX⊤+ϵ−1η′C)×(Ik−\upbetaC+1NXX⊤+ϵ−1η′Ik)−1(M-step).

In order to recover the updates for the online -PCA, we again need to consider the case when becomes infinitesimally small. Choosing such that   yields (11). Note that the E-step of the online EM algorithm becomes identical to (15). Also, the limit case keeps the parameters unchanged, i.e. and the case recovers the batch EM updates (15).

## Distributed Setting

The alternative view of the implicit Krasulina update (11) as an EM step allows combining multiple models in a distributed setting via the online EM framework introduced in [Amid and Warmuth2019]. More specifically, given a set of hidden variable models parameterized by , the optimal combined model corresponds to the minimizer of the following objective

 Θ∗=argminΘ∑i∈[M]αiΔRE(Θ(i),Θ),

where is the weight associated with model . The weight of the model can be assigned based on the performance on a validation set or based on the number of observations processed by the model so far. For synchronous updates, we simply have . In a distributed online PCA setting, let denote the matrix learned by the model . By fixing for all the models and letting , we have

 C∗=∑i∈[M]αiC(i)∑i∈[M]αi. (17)

The probabilistic view of our implicit Krasulina updates allows training multiple -PCA models in parallel and combining them efficiently via simple averaging. Note that for the previous approaches, the matrices are orthonormal and to the best of our knowledge, there is no systematic way of combining rank- orthonormal matrices. One trick would be to also average these matrices. However, the average of a set of orthonormal matrices does not necessarily yields an orthonormal matrix and a QR step is required after combining. As we will show experimentally, the heuristic of simply averaging the orthonormal matrices produced by the Krasulina and Oja updates and then orthonormalizing the average yields poor empirical results. On the other hand, our new implicit Krasulina update produces arbitrary matrices and averaging those matrices (as advised by the online EM framework) results in excellent performance.

## Experiments

In this section, we perform experiments on real-world datasets using our proposed implicit form of Krasulina’s update (12) and contrast the results with Oja’s, Krasulina’s, and the incremental algorithm [Arora et al.2012]. We perform experiments on MNIST dataset of handwritten images with dimension and CIFAR-10 dataset of real-world images of dimension having images in total. We apply all the updates in a stochastic manner by sweeping over the data once. We consider a decaying learning rate of   where denotes the iteration number, is a constant, and denotes the initial learning rate. For each dataset, we randomly select 10% of the data as a validation set to select the optimal initial learning rate and the value of for each algorithm. We use for Krasulina’s and Oja’s methods and set for our method. Note that the incremental algorithm does not require a learning rate. We report the performance on the full dataset. We repeat each experiment 10 times with a different random initialization and report the average. For comparison, we also calculate the result of the batch PCA solution using SVD decomposition of the empirical data covariance matrix for each experiment.

### Centered PCA

We show the results of the algorithms on the centered datasets in Figure 1 as a function of runtime. We plot the percentage of excess loss (i.e. normalized regret) w.r.t to the best achievable loss by an offline algorithm (i.e. full-batch PCA). Each dot shows the progress of the algorithm in 1000 iteration intervals. As can be seen, our implicit Krasulina update achieves the best convergence among all the algorithms. Additionally, our algorithm is considerably faster in most cases. Especially, the advantage of our updates becomes more evident as the values of and increase. This can be explained by the low complexity of matrix updates for our algorithm versus the costly QR update or the eigen-decomposition operation for the remaining algorithms.

### Sensitivity to Initial Learning Rate

We demonstrate the sensitivity of each algorithm to the choice of initial learning rate by applying each algorithm using the optimal learning rate (obtained based on the performance on a validation set) as well as the results of running the same algorithm with and the optimal learning rate value. We show the results in Table 1. In the table, we show the final loss of each algorithm on the full dataset. Note that the incremental algorithm is unaffected since no learning rate is used for this method. Among the other methods, our implicit Krasulina algorithm provides excellent convergence even with the non-optimal learning rate and has the lowest sensitivity. The performance of Oja’s and Krasulina’s algorithms immediately deteriorates as the value of the learning rate is altered.

### Distributed Setting

Finally, we evaluate the results of different algorithms in a distributed setting. We randomly split the data across machines and perform synchronous updates by combining the results every 1000 iterations. In our experiments all sub-problems have the same size and we use . We propagate back the value of the combined matrix to each machine. For our proposed algorithm, we apply the online EM framework [Amid and Warmuth2019] which corresponds to averaging the learned matrices of all machines as in (17). For all the remaining methods, since there exists no clear procedure for combining orthonormal matrices, we naïvely apply the same procedure. However, the average of a set of orthonormal matrices does not necessarily correspond to an orthonormal matrix. Thus, we apply the QR step on the combined matrix before calculating the loss and propagating back the combined value. We show the results in Figure 2. In the figure, we plot the percentage of the excess loss w.r.t. to the loss of the optimal algorithm, i.e. batch PCA. The solid line indicates the performance of the combined model. We also calculate the loss of each individual model (of each machine) on the full dataset over time and plot the average loss value across the machines with a dashed line. This verifies that whether the combined model performs better than each individual model on average.

As can be seen from the figure, our implicit Krasulina algorithm consistently provides excellent performance and converges to the optimal solution. Among the remaining methods, Krasulina’s algorithm provides better convergence behaviour, but converges to an inferior solution. The Oja algorithm as well as the incremental algorithm fail to provide comparable results. The final loss of the combined model for our algorithm is also very close to the final loss of running our algorithm on a single machine on the full dataset, as shown in Table 2.

## Conclusion

The advantage of using future gradients has now appeared in a large variety of contexts (e.g. [Nesterov1983, Cheng et al.2007, Kulis and Bartlett2010]). Here we develop a partially implicit version of Krasulina’s update for the centered -PCA problem that is dramatically better than the standard explicit update. A second key component is to use a latent variable interpretation of the problem and then apply the online EM framework for combining models in a distributed setting [Amid and Warmuth2019]. Combining sub-models is equally important for the -PCA problem, and finding further practical applications of this second component and blending it with other methods is a promising future direction.

## References

• [Allen-Zhu and Li2017] Allen-Zhu, Z., and Li, Y. 2017. First efficient convergence for streaming k-pca: a global, gap-free, and near-optimal rate. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 487–492. IEEE.
• [Amid and Warmuth2019] Amid, E., and Warmuth, M. K. 2019. Divergence-based motivation for online EM and combining hidden variable models. arXiv preprint arXiv:1902.04107 https://arxiv.org/pdf/1902.04107.pdf.
• [Arora et al.2012] Arora, R.; Cotter, A.; Livescu, K.; and Srebro, N. 2012. Stochastic optimization for PCA and PLS. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 861–868. IEEE.
• [Arora, Cotter, and Srebro2013] Arora, R.; Cotter, A.; and Srebro, N. 2013. Stochastic optimization of PCA with capped MSG. In Advances in Neural Information Processing Systems, 1815–1823.
• [Balsubramani, Dasgupta, and Freund2013] Balsubramani, A.; Dasgupta, S.; and Freund, Y. 2013. The fast convergence of incremental PCA. In Advances in Neural Information Processing Systems, 3174–3182.
• [Chen, Hua, and Yan1998] Chen, T.; Hua, Y.; and Yan, W.-Y. 1998. Global convergence of Oja’s subspace algorithm for principal component extraction.

IEEE Transactions on Neural Networks

9(1):58–67.
• [Cheng et al.2007] Cheng, L.; Schuurmans, D.; Wang, S.; Caelli, T.; and Vishwanathan, S. 2007. Implicit online learning with kernels. In Schölkopf, B.; Platt, J. C.; and Hoffman, T., eds., Advances in Neural Information Processing Systems 19. MIT Press. 249–256.
• [Dempster, Laird, and Rubin1977] Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) 1–38.
• [Golub and Van Loan2012] Golub, G. H., and Van Loan, C. F. 2012. Matrix computations, volume 3. JHU press.
• [Halko, Martinsson, and Tropp2011] Halko, N.; Martinsson, P.-G.; and Tropp, J. A. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53(2):217–288.
• [Hassibi, Sayed, and Kailath1996] Hassibi, B.; Sayed, A. H.; and Kailath, T. 1996. H optimality of the LMS algorithm. IEEE Transactions on Signal Processing 44(2):267–280.
• [Hastie, Tibshirani, and Friedman2009] Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The Elements of Statistical Learning – Data Mining, Inference, and Prediction. Springer Series in Statistics.
• [Jain et al.2016] Jain, P.; Jin, C.; Kakade, S. M.; Netrapalli, P.; and Sidford, A. 2016. Streaming PCA: Matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In Conference on learning theory, 1147–1164.
• [Jolliffe2011] Jolliffe, I. 2011. Principal component analysis. Springer.
• [Kivinen and Warmuth1997] Kivinen, J., and Warmuth, M. K. 1997. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132(1):1–63.
• [Kivinen, Warmuth, and Hassibi2006] Kivinen, J.; Warmuth, M. K.; and Hassibi, B. 2006. The p-norm generalization of the LMS algorithm for adaptive filtering. IEEE Transactions on Signal Processing 54(5):1782–1793.
• [Krasulina1969] Krasulina, T. 1969.

The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix.

USSR Computational Mathematics and Mathematical Physics 9(6):189–195.
• [Kulis and Bartlett2010] Kulis, B., and Bartlett, P. L. 2010. Implicit online learning. In ICML.
• [Liu, So, and Wu2016] Liu, H.; So, A. M.-C.; and Wu, W. 2016. Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods. In ICML, 1158–1167.
• [Meyer1973] Meyer, Jr, C. D. 1973. Generalized inversion of modified matrices. SIAM Journal on Applied Mathematics 24(3):315–323.
• [Nesterov1983] Nesterov, Y. E. 1983. A method for solving the convex programming problem with convergence rate of (). Dokl. akad. nauk Sssr 269:543–547.
• [Nie, Kotłowski, and Warmuth2016] Nie, J.; Kotłowski, W.; and Warmuth, M. K. 2016. Online PCA with optimal regret. The Journal of Machine Learning Research 17(1):6022–6070.
• [Oja1982] Oja, E. 1982.

Simplified neuron model as a principal component analyzer.

Journal of mathematical biology 15(3):267–273.
• [Pearson1901] Pearson, K. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11):559–572.
• [Petersen and Pedersen2008] Petersen, K. B., and Pedersen, M. S. 2008. The matrix cookbook, volume 7(15). Technical University of Denmark. Version 20081110.
• [Roweis1998] Roweis, S. T. 1998. EM algorithms for PCA and SPCA. In Advances in neural information processing systems, 626–632.
• [Shamir2015] Shamir, O. 2015. A stochastic PCA and SVD algorithm with an exponential convergence rate. In International Conference on Machine Learning, 144–152.
• [Tang2019] Tang, C. 2019. Exponentially convergent stochastic k-PCA without variance reduction. arXiv preprint arXiv:1904.01750 https://arxiv.org/abs/1904.01750.
• [Tipping and Bishop1999] Tipping, M. E., and Bishop, C. M. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(3):611–622.
• [Van Der Maaten, Postma, and Van den Herik2009] Van Der Maaten, L.; Postma, E.; and Van den Herik, J. 2009. Dimensionality reduction: a comparative. Journal of Machine Learning Research.
• [Warmuth and Kuzmin2008] Warmuth, M. K., and Kuzmin, D. 2008. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research 9(Oct):2287–2320.
• [Woodbury1950] Woodbury, M. A. 1950. Inverting Modified Matrices. Princeton, NJ: Department of Statistics, Princeton University.