Diffusion Approximations for Online Principal Component Estimation and Global Convergence

08/29/2018 ∙ by Chris Junchi Li, et al. ∙ Princeton University 4

In this paper, we propose to adopt the diffusion approximation tools to study the dynamics of Oja's iteration which is an online stochastic gradient descent method for the principal component analysis. Oja's iteration maintains a running estimate of the true principal component from streaming data and enjoys less temporal and spatial complexities. We show that the Oja's iteration for the top eigenvector generates a continuous-state discrete-time Markov chain over the unit sphere. We characterize the Oja's iteration in three phases using diffusion approximation and weak convergence tools. Our three-phase analysis further provides a finite-sample error bound for the running estimate, which matches the minimax information lower bound for principal component analysis under the additional assumption of bounded samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the procedure of Principal Component Analysis (PCA) we aim at learning the principal leading eigenvector of the covariance matrix of a

-dimensional random vector

from its independent and identically distributed realizations . Let

, and let the eigenvalues of

be , then the PCA problem can be formulated as minimizing the expectation of a nonconvex function:

(1)

where denotes the Euclidean norm. Since the eigengap is nonzero, the solution to (1) is unique, denoted by . The classical method of finding the estimator of the first leading eigenvector can be formulated as the solution to the empirical covariance problem as

In words, denotes the empirical covariance matrix for the first samples. The estimator produced via this process provides a statistical optimal solution . Precisely, [43] shows that the angle between any estimator that is a function of the first samples and has the following minimax lower bound

(2)

where is some positive constant. Here the infimum of is taken over all principal eigenvector estimators, and is the collection of all -dimensional subgaussian distributions with mean zero and eigengap satisfying Classical PCA method has time complexity and space complexity . The drawback of this method is that, when the data samples are high-dimensional, computing and storage of a large empirical covariance matrix can be costly.

Figure 1: Left: an objective function for the top-1 PCA, where we use both the radius and heatmap to represent the function value at each point of the unit sphere. Right: A quiver plot on the unit sphere denoting the directions of negative gradient of the PCA objective.

In this paper we concentrate on the streaming or online method for PCA that processes online data and estimates the principal component sequentially without explicitly computing and storing the empirical covariance matrix . Over thirty years ago, Oja [30] proposed an online PCA iteration that can be regarded as a projected stochastic gradient descent method as

(3)

Here is some positive learning rule or stepsize, and is defined as for each nonzero vector , namely, projects any vector onto the unit sphere . Oja’s iteration enjoys a less expensive time complexity and space complexity and thereby has been used as an alternative method for PCA when both the dimension and number of samples are large.

In this paper, we adopt the diffusion approximation method to characterize the stochastic algorithm using Markov processes and its differential equation approximations. The diffusion process approximation is a fundamental and powerful analytic tool for analyzing complicated stochastic process. By leveraging the tool of weak convergence, we are able to conduct a heuristic finite-sample analysis of the Oja’s iteration and obtain a convergence rate which, by carefully choosing the stepsize

, matches the PCA minimax information lower bound. Our analysis involves the weak convergence theory for Markov processes [11]

, which is believed to have a potential for a broader class of stochastic algorithms for nonconvex optimization, such as tensor decomposition, phase retrieval, matrix completion, neural network, etc.

Our Contributions

We provide a Markov chain characterization of the stochastic process

generated by the Oja’s iteration with constant stepsize. We show that upon appropriate scalings, the iterates as a Markov process weakly converges to the solution of an ordinary differential equation system, which is a multi-dimensional analogue to the logistic equations. Also locally around the neighborhood of a stationary point, upon a different scaling the process weakly converges to the multidimensional Ornstein-Uhlenbeck processes. Moreover, we identify from differential equation approximations that the global convergence dynamics of the Oja’s iteration has three distinct phases:

  1. [label=()]

  2. The initial phase corresponds to escaping from unstable stationary points;

  3. The second phase corresponds to fast deterministic crossing period;

  4. The third phase corresponds to stable oscillation around the true principal component.

Lastly, this is the first work that analyze the global rate of convergence analysis of Oja’s iteration, i.e., the convergence rate does not have any initialization requirements.

Figure 2: A simulation plot of Oja’s method, marked with the three phases.

Related Literatures

This paper is a natural companion to paper by the authors’ recent work [23] that gives explicit rate analysis using a discrete-time martingale-based approach. In this paper, we provide a much simpler and more insightful heuristic analysis based on diffusion approximation method under the additional assumption of bounded samples.

The idea of stochastic approximation for PCA problem can be traced back to Krasulina [19] published almost fifty years ago. His work proposed an algorithm that is regarded as the stochastic gradient descent method for the Rayleigh quotient. In contrast, Oja’s iteration can be regarded as a projected stochastic gradient descent method. The method of using differential equation tools for PCA appeared in the first papers [31, 19] to prove convergence result to the principal component, among which, [31] also analyze the subspace learning for PCA. See also [16, Chap. 1] for a gradient flow dynamical system perspective of Oja’s iteration.

The convergence rate analysis of the online PCA iteration has been very few until the recent big data tsunami, when the need to handle massive amounts of data emerges. Recent works by [6, 10, 34, 17] study the convergence of online PCA from different perspectives, and obtain some useful rate results. Our analysis using the tools of diffusion approximations suggests a rate that is sharper than all existing results, and our global convergence rate result poses no requirement for initialization.

More Literatures

Our work is related to a very recent line of work [13, 38, 39, 40, 41, 21, 3, 33] on the global dynamics of nonconvex optimization with statistical structures. These works carefully characterize the global geometry of the objective functions, and in special, around the unstable stationary points including saddle points and local maximizers. To solve the optimization problem various algorithms were used, including (stochastic) gradient method with random initialization or noise injection as well as variants of Newton’s method. The unstable stationary points can hence be avoided, enabling the global convergence to desirable local minimizers.

Our diffusion process-based characterization of SGD is also related to another line of work [8, 10, 37, 24, 26]. Among them, [10] uses techniques based on martingales in discrete time to quantify the global convergence of SGD on matrix decomposition problems. In comparison, our techniques are based on Stroock and Varadhan’s weak convergence of Markov chains to diffusion processes, which yield the continuous-time dynamics of SGD. The rest of these results mostly focus on analyzing continuous-time dynamics of gradient descent or SGD on convex optimization problems. In comparison, we are the first to characterize the global dynamics for nonconvex statistical optimization. In particular, the first and second phases of our characterization, especially the unstable Ornstein-Uhlenbeck process, are unique to nonconvex problems. Also, it is worth noting that, using the arguments of [26], we can show that the diffusion process-based characterization admits a variational Bayesian interpretation of nonconvex statistical optimization. However, we do not pursue this direction in this paper.

In the mathematical programming and statistics communities, the computational and statistical aspects of PCA are often studied separately. From the statistical perspective, recent developments have focused on estimating principal components for very high-dimensional data. When the data dimension is much larger than the sample size, i.e.,

, classical method using decomposition of the empirical convariance matrix produces inconsistent estimates [18, 29]. Sparsity-based methods have been studied, such as the truncated power method studied by [45] and [44]. Other sparsity regularization methods for high dimensional PCA has been studied in [18, 43, 42, 46, 9, 2, 25, 7], etc. Note that in this paper we do not consider the high-dimensional regime and sparsity regularization.

From the computational perspective, power iterations or the Lanczos method are well studied. These iterative methods require performing multiple products between vectors and empirical covariance matrices. Such operation usually involves multiple passes over the data, whose complexity may scale with the eigengap and dimensions [20, 28]. Recently, randomized algorithms have been developed to reduce the computation complexity [36, 35, 12]. A critical trend today is to combine the computational and statistical aspects and to develop algorithmic estimator that admits fast computation as well as good estimation properties. Related literatures include [4, 5, 27, 14, 10].

Organization

§2 introduces the settings and distributional assumptions. §3 briefly discusses the Oja’s iteration from the Markov processes perspective and characterizes that it globally admits ordinary differential equation approximation upon appropriate scaling, and also stochastic differential equation approximation locally in the neighborhood of each stationary point. §4 utilizes the weak convergence results and provides a three-phase argument for the global convergence rate analysis, which is near-optimal for the Oja’s iteration. Concluding remarks are provided in §5.

2 Settings

In this section, we present the basic settings for the Oja’s iteration. The algorithm maintains a running estimate of the true principal component , and updates it while receiving streaming samples from exterior data source. We summarize our distributional assumptions.

The random vectors are independent and identically distributed and have the following properties:

  1. [label=()]

  2. and ;

  3. ;

  4. There is a constant such that .

For the easiness of presentation, we transform the iterates and define the rescaled samples, as follows. First we let the eigendecomposition of the covariance matrix be

where is a diagonal matrix with diagonal entries , and

is an orthogonal matrix consisting of column eigenvectors of

. Clearly the first column of is equal to the principal component . Note that the diagonal decomposition might not be unique, in which case we work with an arbitrary one. Second, let

(4)

One can easily verify that

The principal component of the rescaled random variable

, which we denote by , is equal to , where is the canonical basis of . By applying the orthonormal transformation to the stochastic process , we obtain an iterative process in the rescaled space:

(5)

Moreover, the angle processes associated with and are equivalent, i.e.,

(6)

Therefore it would be sufficient to study the rescaled iteration in (5) and the transformed iteration throughout the rest of this paper.

3 A Theory of Diffusion Approximation for PCA

In this section we show that the stochastic iterates generated by the Oja’s iteration can be approximated by the solution of an ODE system upon appropriate scaling, as long as is small. To work on the approximation we first observe that the iteration , generated by (5) forms a discrete-time, time-homogeneous Markov process that takes values on . Furthermore, holds strong Markov property.

3.1 Global ODE Approximation

To state our results on differential equation approximations, let us define a new process, which is obtained by rescaling the time index according to the stepsize

(7)

We add the superscript in the notation to emphasize the dependence of the process on . We will show that converges weakly to a deterministic function , as .

Furthermore, we can identify the limit as the closed-form solution to an ODE system. Under Assumption 2 and using an infinitesimal generator analysis we have

It follows that, as

, the infinitesimal conditional variance tends to 0:

and the infinitesimal mean is

Using the classical weak convergence to diffusion argument [11, Corollary 4.2 in §7.4], we obtain the following result.

If converges weakly to some constant vector as then the Markov process converges weakly to the solution to the following ordinary differential equation system

(8)

with initial values .

We can straightforwardly check for sanity that the solution vector lies on the unit sphere , i.e., for all . Written in coordinates , the ODE is expressed for

One can straightforwardly verify that the solution to (8) has

(9)

where is the normalization function

To understand the limit function given by (9), we note that in the special case where

and

(10)

This is the formula of the logistic curve. Hence analogously, in (9) is namely the generalized logistic curves.

3.2 Local Approximation by Diffusion Processes

The weak convergence to ODE theorem introduced in §3.1 characterizes the global dynamics of the Oja’s iteration. Such approximation explains many behaviors, but neglected the presence of noise that plays a role in the algorithm. In this section we aim at understanding the Oja’s iteration via stochastic differential equations (SDE). We refer the readers to [32] for more on basic concepts of SDE.

In this section, we instead show that under some scaling, the process admits an approximation of multidimensional Ornstein-Uhlenbeck process within a neighborhood of each of the unstable stationary points, both stable and unstable. Afterwards, we develop some weak convergence results to give a rough estimate on the rate of convergence of the Oja’s iteration. For purposes of illustration and brevity, we restrict ourselves to the case of starting point being the stationary point for some , and denote an arbitrary vector to be a -dimensional vector that keeps all but the th coordinate of . Using theory from [11] we conclude the following theorem.

Let be arbitrary. If converges weakly to some as , then the Markov process

converges weakly to the solution of the multidimensional stochastic differential equation

(11)

with initial values . Here is a standard -dimensional Brownian motion. 111 The reason we have a -dimensional Ornstein-Uhlenbeck process is because the objective function of PCA is defined on a -dimensional manifold and has independent variables.

The solution to (11) can be solved explicitly. We let for a matrix the matrix exponentiation as Also, let for the positive semidefinite diagonal matrix . The solution to (11) is hence

which is known as the multidimensional Ornstein-Uhlenbeck process, whose behavior depends on the matrix and is discussed in details in §4.

Before concluding this section, we emphasize that the weak convergence to diffusions results in §3.1 and §3.2 should be distinguished from the convergence of the Oja’s iteration. From a random process theoretical perspective, the former one treats the weak convergence of finite dimensional distributions of a sequence of rescaled processes as tends to 0, while the latter one charaterizes the long-time behavior of a single realization of iterates generated by algorithm for a fixed .

4 Global Three-Phase Analysis of Oja’s Iteration

Previously §3.1 and §3.2 develop the tools of weak convergence to diffusion under global and local scalings. In this section, we apply these tools to analyze the dynamics of online PCA iteration in three phases in sequel. For purposes of illustration and brevity, we restrict ourselves to the case of starting point that is near a saddle point . Let denotes , a.s., and when both and hold.

4.1 Phase I: Noise Initialization

In consideration of global convergence, we analyze the initial phase where the iteration starts at a point on or around and eventually escapes an -neighborhood of the set

When thinking the sphere as the globe with being the north and south poles, corresponds to the equator of the globe. Therefore, all unstable stationary points (including saddle points and local maximizers) lie on the equator .

4.2 Phase II: Deterministic Crossing

In Phase II, the iteration escapes from the neighborhood of equator and converges to a basin of attraction of the local minimizer . From strong Markov property of the Oja’s iteration introduced in the beginning of §3, one can forget the iteration steps in Phase I and analyze the iteration from the final iterate of Phase I. Suppose we have an initial point that satisfies , where is a fixed constant in , Theorem 3.1 concludes that the iteration moves in a deterministic pattern and quickly evolves into a small neighborhood of the principal component such that .

4.3 Phase III: Convergence to Principal Component

In Phase III, the iteration quickly converges to and fluctuates around the true principal component . We start our iteration from a neighborhood around the principal component, where has . Letting in (11) and taking the limit , we have the limit Rescaling the Markov process along with some calculations gives as , in very rough sense,

(12)

The above display implies that there will be some nondiminishing fluctuations, variance being proportional to the constant stepsize , as time goes to infinity or at stationarity. Therefore in terms of angle, at stationarity the Markov process concentrates within a -radius neighborhood of zero.

4.4 Crossing Time Estimate

We turn to estimate the running time, namely the crossing time, which is the number of iterates required for the iteration to cross the corresponding regions in different phases. We will use the relation to bridge the discrete-time algorithm and its continuous-time approximation.

Phase I. For illustrative purposes we only consider the special case where is close to the th coordinate vector, which is a saddle point that has a negative Hessian eigenvalue. In this situation, the SDE (11) in terms of the first coordinate of reduces to

(13)

with initial value . Solution to (13) is known as unstable Ornstein-Uhlenbeck process [1] and can be expressed explicitly in closed-form, as

Rescaling the time back to the discrete-time iteration, we let and obtain

(14)

In (14), the term is approximately distributed as

where stands for a standard normal variable. We have

(15)

In order to have in (15), we have as the crossing time is approximately

(16)

Therefore we have whenever the smallest eigenvalue is bounded away from 0, then asymptotically This suggests that the noise helps the iteration to move away from rapidly.

Phase II. We turn to estimate the crossing time in Phase II. (9) together with simple calculation ensures the existence of a constant , that depends only on such that . Furthermore has the following bounds:

(17)

Translating back to the timescale of the iteration, it takes asymptotically

iterates to achieve . Theorem 3.1 indicates that when is positively small, the iterates needed for the first coordinate squared to cross from to is . This is substantiated by simulation results [4] suggesting that the Oja’s iteration moves fast from the warm initialization.

Phase III. To estimate the crossing time or the number of iterates needed in Phase III, we restart our counter and have from the approximation in Theorem 3.2 and (11) that

In terms of the iterations , note the relationship The end of Phase II implies that , and hence by setting

we conclude that as

(18)

4.5 Finite-Sample Rate Bound

In this subsection we establish the global finite-sample convergence rate using the crossing time estimates in the previous subsection. Starting from where is arbitrary, the global convergence time as such that, by choosing as a small fixed constant,

with the following estimation on global convergence rate as in (12)

Given a fixed number of samples , by choosing as

(19)

we have . Plugging in as in (19) we have, by the angle-preserving property of coordinate transformation (6), that

(20)

The finite sample bound in (20) is sharper than any existing results and matches the information lower bound. Moreover, (20) implies that the rate in terms of sine-squared angle is which matches the minimax information lower bound (up to a factor), see for example, Theorem 3.1 of [43]. Limited by space, details about the rate comparison is provided in the supplementary material.

5 Concluding Remarks

We make several concluding remarks on the global convergence rate estimations, as follows.

Crossing Time Comparison. From the crossing time estimates in (16), (17), (18) we conclude

  1. [label=()]

  2. As we have . This implies that the algorithm demonstrates the cutoff phenomenon which frequently occur in discrete-time Markov processes [22]. In words, the Phase II where the objective value in Rayleigh quotient drops from to is an asymptotically a phase of short time, compared to Phases I and III, so the convergence curve occurs instead of an exponentially decaying curve.

  3. As we have . This suggests that for the high- case that Phase I of escaping from the equator consumes roughly the same iterations as in Phase III.

To summarize from above, the cold initialization iteration roughly takes twice the number of steps than the warm initialization version which is consistent with the simulation discussions in [31].

Subspace Learning. In this work we primarily concentrates on the problem of finding the top-1 eigenvector. It is believed that the problem of finding top- eigenvectors, a.k.a. the subspace PCA problem, can be analyzed using our approximation methods. This will involve a careful characterization of subspace angles and is hence more complex. We leave this for future investigations.

References

Appendix A Proofs of Auxiliary Results

This section provides the proofs of auxilary Propositions. For brevity, we use the following notations throughout Section A of the Appendix: (i) The ’s with subscripts denotes some positive numerical constants; (ii) The , , ’s (without subscripts) are positive numerical constants whose values may change between lines; (iii) The and ; (iv) For generic function the .

Also in this section, we let be the filtration of the algorithm iterates, i.e. the -field generated by the stochastic iterates by .

a.1 Analysis of Algorithm

To analyze the algorithm from the view of a Markov chain, we need to understand the increments on each coordinate at each step. Under Assumption 2, for each and we have for all the following:

  1. [label=()]

  2. There exists a random variable with almost surely, such that the increment on coordinate at iterate can be represented as

    (21)
  3. The increment has the following bound

    (22)
  4. There exists a deterministic function with

    such that for all ,

    (23)

To prove Proposition A.1 we first come to show

For each

Proof.

Since

(24)

Taylor expansion suggests for

which is an alternating series for , whereas the absolute terms approach to 0 monotonically

Hence the error bound gives

(25)

Noting we have for all

The above display is strictly less than 1 when , and hence (25) applies. Combined with (24) we have

Noticing , triangle inequality suggests

completing the proof.

Proof of Proposition a.1.

Setting . Then

where

(26)

Note the term

is absolutely bounded by , and taking expectation gives

To this stage, we have verified

(27)

(21) as long as Eqs. (22) and (23) in Proposition A.1 can be concluded if

(28)

since this implies for we have . To conclude (28), note that and hence

Lemma A.1 implies