Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes

08/29/2018 ∙ by Chris Junchi Li, et al. ∙ Princeton University 2

Solving statistical learning problems often involves nonconvex optimization. Despite the empirical success of nonconvex statistical optimization methods, their global dynamics, especially convergence to the desirable local minima, remain less well understood in theory. In this paper, we propose a new analytic paradigm based on diffusion processes to characterize the global dynamics of nonconvex statistical optimization. As a concrete example, we study stochastic gradient descent (SGD) for the tensor decomposition formulation of independent component analysis. In particular, we cast different phases of SGD into diffusion processes, i.e., solutions to stochastic differential equations. Initialized from an unstable equilibrium, the global dynamics of SGD transit over three consecutive phases: (i) an unstable Ornstein-Uhlenbeck process slowly departing from the initialization, (ii) the solution to an ordinary differential equation, which quickly evolves towards the desirable local minimum, and (iii) a stable Ornstein-Uhlenbeck process oscillating around the desirable local minimum. Our proof techniques are based upon Stroock and Varadhan's weak convergence of Markov chains to diffusion processes, which are of independent interest.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For solving a broad range of large-scale statistical learning problems, e.g., deep learning, nonconvex optimization methods often exhibit favorable computational and statistical efficiency empirically. However, there is still a lack of theoretical understanding of the global dynamics of these nonconvex optimization methods. In specific, it remains largely unexplored why simple optimization algorithms, e.g., stochastic gradient descent (SGD), often exhibit fast convergence towards local minima with desirable statistical accuracy. In this paper, we aim to develop a new analytic framework to theoretically understand this phenomenon.

The dynamics of nonconvex statistical optimization are of central interest to a recent line of work. Specifically, by exploring the local convexity within the basins of attraction, [26, 35, 1, 5, 52, 53, 36, 22, 25, 11, 7, 56, 39, 20, 47, 31, 46, 6, 57, 12, 50, 13, 8, 54, 58, 10, 24, 51, 55, 48, 21, 49] establish local fast rates of convergence towards the desirable local minima for a variety statistical problems. Most of these characterizations of local dynamics are based on two decoupled ingredients from statistics and optimization: (i) the local (approximately) convex geometry of the objective functions, which is induced by the underlying statistical models, and (ii) adaptation of classical optimization analysis [34, 19] by incorporating the perturbations induced by nonconvex geometry as well as random noise. To achieve global convergence guarantees, they rely on various problem-specific approaches to obtain initializations that provably fall into the basins of attraction. Meanwhile, for some learning problems, such as phase retrieval and tensor decomposition for latent variable models, it is empirically observed that good initializations within the basins of attraction are not essential to the desirable convergence. However, it remains highly challenging to characterize the global dynamics, especially within the highly nonconvex regions outside the local basins of attraction.

In this paper, we address this problem with a new analytic framework based on diffusion processes. In particular, we focus on the concrete example of SGD applied on the tensor decomposition formulation of independent component analysis (ICA). Instead of adapting classical optimization analysis accordingly to local nonconvex geometry, we cast SGD in different phases as diffusion processes, i.e., solutions to stochastic differential equations (SDE), by analyzing the weak convergence from discrete Markov chains to their continuous-time limits [40, 17]. The SDE automatically incorporates the geometry and randomness induced by the statistical model, which allows us to establish the exact dynamics of SGD. In contrast, classical optimization analysis only yields upper bounds on the optimization error, which are unlikely to be tight in the presence of highly nonconvex geometry, especially around the stationary points that have negative curvatures along certain directions. In particular, we identify three consecutive phases of the global dynamics of SGD, which is illustrated in Figure 1.

  1. [label=()]

  2. We consider the most challenging initialization at a stationary point with negative curvatures, which can be cast as an unstable equilibrium of the SDE. Within the first phase, the dynamics of SGD are characterized by an unstable Ornstein-Uhlenbeck process [37, 2], which departs from the initialization at a relatively slow rate and enters the second phase.

  3. Within the second phase, the dynamics of SGD are characterized by the exact solution to an ordinary differential equation. This solution evolves towards the desirable local minimum at a relatively fast rate until it approaches a small basin around the local minimum.

  4. Within the third phase, the dynamics of SGD are captured by a stable Ornstein-Uhlenbeck process [37, 2], which oscillates within a small basin around the local minimum.


Figure 1: Left: an illustration of the objective function for the tensor decomposition formulation of ICA. Note that here we use the spherical coordinate system and add a global offset of to the objective function for better illustration. Right: An illustration of the three phases of diffusion processes.

More related work. Our results are connected with a very recent line of work [18, 42, 43, 44, 45, 27, 29, 3, 38] on the global dynamics of nonconvex statistical optimization. In detail, they characterize the global geometry of nonconvex objective functions, especially around their saddle points or local maxima. Based on the geometry, they prove that specific optimization algorithms, e.g., SGD with artificial noise injection, gradient descent with random initialization, and second-order methods, avoid the saddle points or local maxima, and globally converge to the desirable local minima. Among these results, our results are most related to [18], which considers SGD with noise injection on ICA. Compared with this line of work, our analysis takes a completely different approach based on diffusion processes, which is also related to another line of work [14, 15, 41, 30, 33, 32].

Without characterizing the global geometry, we establish the global exact dynamics of SGD, which illustrate that, even starting from the most challenging stationary point, it may be unnecessary to use additional techniques such as noise injection, random initialization, and second-order information to ensure the desirable convergence. In other words, the unstable Ornstein-Uhlenbeck process within the first phase itself is powerful enough to escape from stationary points with negative curvatures. This phenomenon is not captured by the previous upper bound-based analysis, since previous upper bounds are relatively coarse-grained compared with the exact dynamics, which naturally give a sharp characterization simultaneously from upper and lower bounds. Furthermore, in Section 5 we will show that our sharp diffusion process-based characterization provides understanding on different phases of dynamics of our online/SGD algorithm for ICA.

A recent work [29]

analyzes an online principal component analysis algorithm based on the intuition gained from diffusion approximation. In this paper, we consider a different statistical problem with a rigorous characterization of the diffusion approximations in three separate phases.

Our contribution. In summary, we propose a new analytic paradigm based on diffusion processes for characterizing the global dynamics of nonconvex statistical optimization. For SGD on ICA, we identify the aforementioned three phases for the first time. Our analysis is based on Stroock and Varadhan’s weak convergence of Markov chains to diffusion processes, which are of independent interest.

2 Background

In this section we formally introduce a special model of independent component analysis (ICA) and the associated SGD algorithm. Let be the data sample identically distributed as . We make assumptions for the distribution of as follows. Let be the

-norm of a vector.

Assumption 1.

There is an orthonormal matrix such that , where is a random vector that has independent entries satisfying the following conditions:

  1. [label=(), topsep=0pt, leftmargin=*]

  2. The distribution of each is symmetric about 0;

  3. There is a constant such that ;

  4. The are independent with identical moments for , denoted by ;

  5. The , , .

Assumption 1(iii) above is a generalization of i.i.d. tensor components. Let

whose columns form an orthonormal basis. Our goal is to estimate the orthonormal basis

from online data . We first establish a preliminary lemma.

Lemma 1.

Let be the 4th-order tensor whose -entry is . Under Assumption 1, we have

(2.1)

Lemma 1 implies that finding ’s can be cast into the solution to the following population optimization problem

(2.2)

It is straightforward to conclude that all stable equilibria of (2.2) are whose number linearly grows with . Meanwhile, by analyzing the Hessian matrices the set of unstable equilibria of (2.2) includes (but not limited to) all , whose number grows exponentially as increases [18, 44].

Now we introduce the SGD algorithm for solving (2.2) with finite samples. Let be the unit sphere in , and denote for the projection operator onto . With appropriate initialization, the SGD for tensor method iteratively updates the estimator via the following Eq. (2.3):

(2.3)

The SGD algorithms that performs stochastic approximation using single online data sample in each update has the advantage of less temporal and spatial complexity, especially when is high [29, 18]. An essential issue of this nonconvex optimization problem is how the algorithm escape from unstable equilibria. [18] provides a method of adding artificial noises to the samples, where the noise variables are uniformly sampled from . In our work, we demonstrate that under some reasonable distributional assumptions, the online data provide sufficient noise for the algorithm to escape from the unstable equilibria.

By symmetry, our algorithm in Eq. (2.3) converges to a uniformly random tensor component from components. In order to solve the problem completely, one can repeatedly run the algorithm using different set of online samples until all tensor components are found. In the case where is high, the well-known coupon collector problem [16] implies that it takes runs of SGD algorithm to obtain all tensor components.

Remark.

From Eq. (2.2) we see the tensor structure in Eq. (2.1) is unidentifiable in the case of , see more discussion in [4, 18]. Therefore in Assumption 1 we rule out the value and call the value the tensor gap. The reader will see later that, analogous to eigengap in SGD algorithm for principal component analysis (PCA) [29], tensor gap plays a vital role in the time complexity in the algorithm analysis.

3 Markov Processes and Differential Equation Approximation

To work on the approximation we first conclude the following proposition.

Proposition 1.

The iteration generated by Eq. (2.3) forms a discrete-time, time-homogeneous Markov process that takes values on . Furthermore, holds strong Markov property.

For convenience of analysis we use the transformed iteration in the rest of this paper. The update equation in Eq. (2.3) is equivalently written as

(3.1)

Here has the same sign with . It is obvious from Proposition 1 that the (strong) Markov property applies to , and one can analyze the iterates generated by Eq. (3.1) from a perspective of Markov processes.

Our next step is to conclude that as the stepsize , the iterates generated by Eq. (2.3), under the time scaling that speeds up the algorithm by a factor , can be globally approximated by the solution to the following ODE system. To characterize such approximation we use theory of weak convergence to diffusions [40, 17]

via computing the infinitesimal mean and variance for SGD for the tensor method. We remind the readers of the definition of weak convergence

in stochastic processes: for any the following convergence in distribution occurs as

To highlight the dependence on we add it in the superscipts of iterates . Recall that is the integer part of the real number .

Theorem 1.

If for each , as converges weakly to some constant scalar then the Markov process converges weakly to the solution of the ODE system

(3.2)

with initial values .

To understand the complex ODE system in Eq. (3.2) we first investigate into the case of . Consider a change of variable

we have by chain rule in calculus and

the following derivation:

(3.3)

Eq. (3.3) is an autonomous, first-order ODE for . Although this equation is complex, a closed-form solution is available:

and , where the choices of and depend on the initial value. The above solution allows us to conclude that if the initial vector (resp. ), then it approaches to 1 (resp. 0) as . This intuition can be generalized to the case of higher that the ODE system in Eq. (3.2) converges to the coordinate direction if is strictly maximal among in the initial vector. To estimate the time of traverse we establish the following Proposition 2.

Proposition 2.

Fix and the initial value that satisfies for all , then there is a constant (called traverse time) that depends only on such that Furthermore has the following upper bound: let solution to the following auxillary ODE

(3.4)

with . Let be the time that . Then

(3.5)

Proposition 2 concludes that, by admitting a gap of between the largest and second largest , the estimate on traverse time can be given, which is tight enough for our purposes in Section 5.

Remark.

In an earlier paper [29] which focuses on the SGD algorithm for PCA, when the stepsize is small, the algorithm iteration is approximated by the solution to ODE system after appropriate time rescaling. The approximate ODE system for SGD for PCA is

(3.6)

The analysis there also involves computation of infinitesimal mean and variance for each coordinate as the stepsize and theory of convergence to diffusions [40, 17]. A closed-form solution to Eq. (3.6) is obtained in [29], called the generalized logistic curves. In contrast, to our best knowledge a closed-form solution to Eq. (3.2) is generally not available.

4 Local Approximation via Stochastic Differential Equation

The ODE approximation in Section 3 is very informative: it characterizes globally the trajectory of our algorithm for ICA or tensor method in Eq. (2.3) with approximation errors. However it fails to characterize the behavior near equilibria where the gradients in our ODE system are close to zero. For instance, if the SGD algorithm starts from , on a microscopic magnitude of the noises generated by online samples help escaping from a neighborhood of .

Our main goal in this section is to demonstrate that under appropriate spatial and temporal scalings, the algorithm iteration converges locally to the solution to certain stochastic differential equations (SDE). We provide the SDE approximations in two scenarios, separately near an arbitrary tensor component (Subsection 4.1) which indicates that our SGD for tensor method converges to a local minimum at a desirable rate, and a special local maximum (Subsection 4.2) which implies that the stochastic nature of our SGD algorithm for tensor method helps escaping from unstable equilibria. Note that in the algorithm iterates, the escaping from stationary points occurs first, followed by the ODE and then by the phase of convergence to local minimum. We discuss this further in Section 5.

4.1 Neighborhood of Local Minimizers

To analyze the behavior of SGD for tensor method we first consider the case where the iterates enter a neighborhood of one local minimizer, i.e. the tensor component. Since the tensor decomposition in Eq. (2.2) is full-rank and symmetric, we consider without loss of generality the neighborhood near the first tensor component. The following Theorem 2 indicates that under appropriate spatial and temporal scalings, the process admits an approximation by Ornstein-Uhlenbeck process. Such approximation is characterized rigorously using weak convergence theory of Markov processes [40, 17]. The readers are referred to [37] for fundamental topics on SDE.

Theorem 2.

If for each , converges weakly to as then the stochastic process converges weakly to the solution of the stochastic differential equation

(4.1)

with initial values . Here is a standard one-dimensional Brownian motion.

We identify the solution to Eq. (4.1) as an Ornstein-Uhlenbeck process which can be expressed in terms of a Itô integral, with

(4.2)

Itô isometry along with mean-zero property of Itô integral gives

which, by taking the limit , approaches . From the above analysis we conclude that the Ornstein-Uhlenbeck process has the mean-reverting property that its mean decays exponentially towards 0 with persistent fluctuations at equilibrium.

4.2 Escape from Unstable Equilibria

In this subsection we consider SGD for tensor method that starts from a sufficiently small neighborhood of a special unstable equilibrium. We show that after appropriate rescalings of both time and space, the SGD for tensor iteration can be approximated by the solution to a second SDE. Analyzing the approximate SDE suggests that our SGD algorithm iterations can get rid of the unstable equilibria (including local maxima and stationary points with negative curvatures) whereas the traditional gradient descent (GD) method gets stuck. In other words, under weak distributional assumptions the stochastic gradient plays a vital role that helps the escape. As a illustrative example, we consider the special stationary points . Consider a submanifold where

In words, consists of all where the maximum of is not unique. In the case of , it is illustrated by Figure 1 that is the frame of a 3-dimenisional box, and hence we call the frame. Let

(4.3)

The reason we study is that these functions of form a local coordinate map around and further characterize the distance between and on a spatial scale of . We define the positive constant as

(4.4)

We have our second SDE approximation result as follows.

Theorem 3.

Let be defined as in Eq. (4.3), and let be as in Eq. (4.4). If for each distinct , converges weakly to as then the stochastic process converges weakly to the solution of the stochastic differential equation

(4.5)

with initial values . Here is a standard one-dimensional Brownian motion.

We can solve Eq. (4.5) and obtain an unstable Ornstein-Uhlenbeck process as

(4.6)

Let be defined as

(4.7)

We conclude that the following holds.

  1. [label=(), topsep=0pt, leftmargin=*]

  2. is a normal variable with mean and variance ;

  3. When is large has the following approximation

    (4.8)

To verify (i) above we have the Itô integral in Eq. (4.6)

and by using Itô isometry

The analysis above on the unstable Ornstein-Uhlenbeck process indicates that the process has the momentum nature that when

is large, it can be regarded as at a normally distributed location centered at 0 and grows exponentially. In Section

5 we will see how the result in Theorem 3 provides explanation on the escape from unstable equilibria.

5 Phase Analysis

In this section, we utilize the weak convergence results in Sections 3 and 4 to understand the dynamics of online ICA in different phases. For purposes of illustration and brevity, we restrict ourselves to the case of starting point , a local maxima that has negative curvatures in every direction. In below we denote by as when the limit of ratio .

Phase I (Escape from unstable equilibria).

Assume we start from , then for all . We have from Eqs. (4.6) and (4.7) that

(5.1)

Suppose is the index that maximizes and maximizes . Then by Eq. (5.1) we know is positive. By setting

we have from the construction in the proof of Theorem 3 that as

Phase II (Deterministic traverse).

By (strong) Markov property we can restart the counter of iteration, we have the max and second max

Proposition 2 implies that it takes time

for the ODE to traverse from for . Converting to the timescale of the SGD, the second phase has the following relations as

Phase III (Convergence to stable equilibria).

Again restart our counter. We have from the approximation in Theorem 3 and Eq. (4.2) that

In terms of the iterations , note the relationship The end of ODE phase implies that , and hence

By setting

we conclude that as

6 Summary and discussions

In this paper, we take online ICA as a first step towards understanding the global dynamics of stochastic gradient descent. For general nonconvex optimization problems such as training deep networks, phase-retrieval, dictionary learning and PCA, we expect similar multiple-phase phenomenon. It is believed that the flavor of asymptotic analysis above can help identify a class of stochastic algorithms for nonconvex optimization with statistical structure.

Our continuous-time analysis also reflects the dynamics of the algorithm in discrete time. This is substantiated by Theorems 1, 2 and 3 which rigorously characterize the convergence of iterates to ODE or SDE by shifting to different temporal and spatial scales. In detail, our results imply when :

  • Phase I takes iteration number ;

  • Phase II takes iteration number ;

  • Phase III takes iteration number .

After the three phases, the iteration reaches a point that is distant on average to one local minimizer. As we have . This implies that the algorithm demonstrates the cutoff phenomenon which frequently occur in discrete-time Markov processes [28, Chap. 18]. In words, the Phase II where the objective value in Eq. (2.2) drops from to is a short-time phase compared to Phases I and III, so the convergence curve illustrated in the right figure in Figure 1 instead of an exponentially decaying curve. As we have , which suggests that Phase I of escaping from unstable equlibria dominates Phase III by a factor of .

References

Appendix A Detailed Proofs in Sections 2 and 3

a.1 Proof of Lemma 1

Proof.

We only need to show

(A.1)

Note due to the following well-known expansion [9]

where the summations above iterate through all monomial terms. Plugging in and taking expectations, we conclude that under Assumption 1

(A.2)

Note that from the constraint of our optimization problem Eq. (2.2), we have

(A.3)

Combining both Eqs. (A.2) and (A.3) we conclude Eq. (A.1) and hence the lemma.

a.2 Proof of Proposition 1

Proof.

Let be the -field filtration generated by the iteration , viewed as a stochastic process. From the recursion equation in Eq. (2.3) we have a Markov transition kernel such that for each Borel set

Therefore it is a time-homogeneous Markov chain. The strong Markov property holds directly from Markov property, see [16] as a reference. This proves Proposition 1.

a.3 Proof of Theorem 1

We first use the standard one-step analysis and conclude the following proposition, whose proof is deferred to Subsection C.1.

Proposition 3.

For brevity let and , separately. Under Assumption 1, when

(A.4)

for each and we have the following:

  1. [label=(), topsep=0pt, leftmargin=*]

  2. There exists a random variable

    that depends solely on with almost surely, such that the increment can be represented as

    (A.5)
  3. The increment of on coordinate has the following bound

    (A.6)
  4. There exists a deterministic function with , such that the conditional expectation of the increment is

    (A.7)

In Proposition 3, (i) characterizes the relationship between the increment on and the online sample, and (ii) bounds such increment. From (iii) we can compute the infinitesimal mean and variance for SGD for tensor method and conclude that as the stepsize , the iterates generated by Eq. (2.3), under the time scaling that speeds up the algorithm by a factor , can be globally approximated by the solution to the following ODE system in Eq. (3.2) as

To characterize such approximation we use theory of weak convergence to diffusions [40, 17]. We remind the readers of the definition of weak convergence in stochastic processes: for any the following convergence in distribution occurs as

To highlight the dependence on we add it in the superscipts of iterates .

Proof of Theorem 1.

Let . Proposition 3 implies for coordinate satisfies

where . Eq. (A.7) implies that if the infinitesimal mean is [17]

Using Eq. (A.6) we have the infinitesimal variance

which tends to 0 as . Let be the solution to ODE system Eq. (3.2) with initial values . Applying standard infinitesimal generator argument [17, Corollary 4.2 in Sec. 7.4] one can conclude that as , the Markov process converges weakly to .

a.4 Proof of Proposition 2

For simplicity we denote in the proofs that the initial value . Also, throughout this subsection we assume without loss of generality that is maximal among , , and furthermore

(A.8)
Lemma 2.

For that satisfies Eq. (A.8), then we have for all

(A.9)
Proof.

We compare the coordinate between two distinct coordinates and have by calculus that for all