Dropping Convexity for More Efficient and Scalable Online Multiview Learning

02/27/2017 ∙ by Zhehui Chen, et al. ∙ 0

Multiview representation learning is very popular for latent factor analysis. It naturally arises in many data analysis, machine learning, and information retrieval applications to model dependent structures among multiple data sources. For computational convenience, existing approaches usually formulate the multiview representation learning as convex optimization problems, where global optima can be obtained by certain algorithms in polynomial time. However, many evidences have corroborated that heuristic nonconvex approaches also have good empirical computational performance and convergence to the global optima, although there is a lack of theoretical justification. Such a gap between theory and practice motivates us to study a nonconvex formulation for multiview representation learning, which can be efficiently solved by a simple stochastic gradient descent (SGD) algorithm. We first illustrate the geometry of the nonconvex formulation; Then by characterizing the dynamics of the approximate limiting process, we establish global rates of convergence to the global optima. Numerical experiments are provided to support our theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multiview data have become increasingly available in many popular real-world data analysis and machine learning problems. These data are collected from diverse domains or different feature extractors, which share latent factors. For instance, the pixels and captions of images can be considered as two-view data, since they are two different features describing the same contents. More motivating examples involving two or more data sets simultaneously can be found in computer vision, natural language processing, and acoustic recognition

(Hardoon et al., 2004; Socher and Fei-Fei, 2010; Kidron et al., 2005; Chaudhuri et al., 2009; Arora and Livescu, 2012; Bharadwaj et al., 2012; Vinokourov et al., 2002; Dhillon et al., 2011). Although these data are usually unlabeled, there exist underlying association and dependency between different views, which allows us to learn useful representations in an unsupervised manner. What we are interested in is to find a representation that reveals intrinsic low-dimensional structures and decomposes underlying confounding factors. One ubiquitous approach is partial least square (PLS) for multiview representation learning. Specifically, given a data set containing

samples of two sets of random variables (views),

and , PLS aims to find an -dimensional subspace () that preserves most of the covariance between two views. Existing literature have shown that such a subspace is spanned by the leading

components of the singular value decomposition (SVD) of

, where we sample from some unknown distribution  (Arora et al., 2012). Throughout the rest of the paper, if not clear specified, we denote by for notational simplicity.

A straightforward approach for PLS is “Sample Average Approximation” (SAA, Abdi (2003); Ando and Zhang (2005)), where we run an offline (batch) SVD algorithm on the empirical covariance matrix after seeing sufficient data samples. However, in the “big data” regime, this approach requires unfeasible amount of storage and computation time. Therefore, it is much more practical to consider the multiview learning problem in a “data laden” setting, where we draw independent samples from an underlying distribution over , one at a time. This further enables us to formulate PLS as a stochastic (online) optimization problem. Here we only consider the rank- case () for simplicity, and solve

(1.1)

We will explain more details on the rank- case in the Section 7.

Several nonconvex stochastic approximation (SA) algorithms have been proposed in Arora et al. (2012). These algorithms work well in practice, but lack theoretic justifications, since the nonconvex landscape of (1.1) makes the theoretical analysis very challenging. To overcome this obstacle, Arora et al. (2016) propose a convex relaxation of (1.1). Specifically, by a reparametrization (Recall that we are interested in the rank-1 PLS), they rewrite (1.1) as111For case, we replace with

(1.2)

where , and , are the spectral (i.e., the largest singular value of ) and nuclear (i.e., the sum of all singular values of ) norms of respectively. By examining the KKT conditions of (1.2), one can verify that is the optimal solution, where

are the leading left and right singular vectors of

, i.e., a pair of global optimal solutions to (1.1) for . Accordingly, they propose a projected stochastic gradient-type algorithm to solve (1.2), which is often referred to as the Matrix Stochastic Gradient (MSG) algorithm. Particularly, at the -th iteration, MSG takes

where and are independently sampled from , and is a projection operator to the feasible set of (1.2). They further prove that given a pre-specified accuracy , MSG requires iterations such that

with high probability

222They establish a non-asymptotic rate of convergence for MSG..

Despite of the attractive theoretic guarantee, MSG does not present superior performance to other heuristic nonconvex stochastic optimization algorithms for solving (1.1). Another drawback of MSG is the complicated projection step at each iteration. Although Arora et al. (2016) further propose an algorithm to compute the projection with a computational cost cubically depending on the rank of the iterates (the worst case: ), such a sophisticated implementation significantly decreases the practicability of MSG. Furthermore, MSG is also unfavored in a memory-restricted scenario, since storing the update requires real number storage. In contrast, the heuristic algorithms analyzed in this paper require only real number storage, or in the rank- case. Although there is a lack of theoretical justification, numerous empirical evidence has corroborated that heuristic nonconvex approaches not only converge to the global optima in practice, but also enjoy better empirical computational performance than the convex approaches (Zhao et al., 2015; Candes et al., 2015; Ge et al., 2015; Cai et al., 2016).

We aim to bridge the gap between theory and practice for solving multiview representation learning problems by nonconvex approaches. Specifically, we first illustrate the nonoconvex optimization landscape of (1.1). Then we analyze the convergence properties of a simple stochastic optimization algorithm for solving (1.1) based on diffusion processes. Our analysis takes the advantage of the Markov properties of the stochastic optimization algorithm updates and provides a diffusion approximation of the algorithm (Ethier and Kurtz, 2009; Li et al., 2016b)

. By leveraging the weak convergence from discrete Markov chains to their continuous time limits, we show that asymptotically our algorithm converges to the global optimal by solving a stochastic differential equation (SDE). Such an SDE-type analysis automatically incorporates the geometry of the objective and the randomness of the algorithm, and eventually demonstrates three phases of convergence.

  1. Starting from a unstable equilibrium with negative curvature, the dynamics of the limiting process can be described by an Ornstein-Uhlenbeck process, which further implies the dynamics of the algorithm.

  2. When the algorithm is sufficiently distant from the initial unstable equilibrium, the dynamics can be characterized by a deterministic ordinary differential equation (ODE). The trajectory of this phase is evolving directly toward the desired global maximum until it reaches a small basin around the global maximum.

  3. In this phase, the trajectory can be also described by an Ornstein-Uhlenbeck process oscillating around the global maximum. The process has a drifting term that gradually dies out and eventually becomes a nearly unbiased random walk centered at the maximum.

These characterizations in three phases eventually allow us to establish an asymptotic convergence guarantee. Particularly, we show that the diffusion approximation of nonconvex stochastic gradient algorithm implies an -optimal solution in iterations with high probability, which is a significant improvement over convex MSG by a factor of . Our theoretical analysis reveals the power of the nonconvex optimization in PLS. It helps us understand the nonconvex stochastic gradient algorithm better. The simple heuristic algorithms drop the convexity, but achieve much better efficiency.

Notations: Given a vector , we define vector norms: , , and . Given a matrix , we use to denote the -th column of and define the matrix norms and as the largest singular value of .

2 Stochastic Nonconvex Optimization

Recall that we solve (1.1)

(2.1)

where follows some unknown distribution . Note due to the symmetrical structure of (2.1), is the other pair of global optimum. Our analysis holds for both optima. Throughout the rest of the paper, if not clearly specified, we consider as the global optimum for simplicity.

We apply the stochastic approximation (SA) of the generalized Hebbian algorithm (GHA) to solve (2.1). GHA, which is also referred to as Sanger’s rule (Sanger, 1989), is essentially a primal-dual algorithm. Specifically, we consider the Lagrangian function of (2.1):

(2.2)

where and are Lagrangian multipliers. We then check the optimal KKT conditions,

(2.3)

which further imply

Solving the above equations, we obtain the optimal Lagrangian multipliers as

(2.4)

GHA is inspired by (2.3) and (2.4). At -th iteration GHA takes

(2.5)
(2.6)

where is the step size. Combining (2.5) and (2.6), we obtain a dual-free update as follow:

(2.7)

Different from the projected SGD algorithm, which is a primal algorithm proposed in Chen et al. (2017), Stochastic GHA does not need projection at each iteration.

3 Optimization Landscape

We illustrate the nonconvex optimization landscape of (1.1), which helps us understand the intuition behind the algorithmic convergence. By the KKT conditions (2.3), we define the stationary point of (2.2) as follows. Given (1.1) and (2.2), we define:

  1. A quadruplet of is called a stationary point of (2.2), if it satisfies (2.3).

  2. A pair of is called a stable stationary point of  (1.1), if is a stationary point of (2.2), and is negative semi-definite.

  3. A pair of is called an unstable stationary point of (1.1), if is a stationary point of (2.2), and

    has a positive eigenvalue.

Our definition is similar to Absil et al. (2009). Absil et al. (2009) is for the manifold, while ours is for the Lagrangian formula. We consider the Lagrangian version because our algorithm cannot guarantee the solution to stay. We then obtain all stationary points by solving (2.3). For notational simplicity, we denote . Before we proceed with our analysis, we introduce the following assumption. Suppose and rank. We have , where ’s are the -th singular values of . We impose such an eigengap assumption () to ensure the identifiability of the leading pair of singular vectors. Thus, the leading pair of singular vectors are uniquely determined only up to sign change. Let and be any pair of left and right singular matrices333Since all singular values are not necessarily distinct, some pairs of singular vectors are not unique, e.g., when , and are uniquely determined up to rotation. Note that our analysis works for all possible combinations of and . See more details in Golub and Van Loan (2012).. Let and denote the -th column of and -th column of , respectively. The next proposition reveals the connection between stationary points and singular vectors. Suppose Assumption 3 holds. A quadruplet of is the stationary point of (2.2), if either of the following condition holds:

  1. are a pair of singular vectors associated with the same nonzero singular value;

  2. and belong to the row and column null spaces of respectively: .

The proof of Proposition 3 is presented in Appendix A.1. We then determine the types of these obtained stationary points. The next proposition characterizes the maximum eigenvalues of at these stationary points of (2.2).

Suppose Assumption 3 holds. All pairs of singular vectors associated with the leading singular value are global optima of (1.1), i.e., also the saddle points of (2.2), and they are stable stationary points. All other stationary points of (2.2) are all unstable with

The proof of Proposition 3 is presented in Appendix A.2. Proposition 3 essentially characterizes the geometry of (1.1) at all stationary points. Specifically, except the global optima, at the remaining stationary points are so called strict saddle points on the underlying manifold, proposed in Ge et al. (2015). The unstableness of these strict saddle points allows the stochastic gradient algorithm to escape, as will be shown in the next sections.

4 Global Convergence by ODE

Before we proceed with our analysis, we first impose some mild assumptions on the problem.

are data samples identically independently distributed as respectively satisfying the following conditions:

  1. For any , and , where is a constant may depend on ;444We only need (

    )-th moments of

    and to be bounded, while the preliminary results in Chen et al. (2017) require both and to be bounded random variables.

  2. , where ’s are the singular values of .

Here we assume and are of the same dimensions (i.e., ) and is full rank for convenience of analysis. The extension to in a rank deficient setting is straightforward, but more involved (See more details in Section 5.4). Moreover, for a multiview learning problem, it is also natural to impose the following additional assumptions.

Given the observed random variables and , there exist two orthogonal matrices such that , where and are the latent variables satisfying:

  1. and are uncorrelated if , so that and are the left and right singular matrices of respectively;

  2. , where , and are constants.

The next proposition characterizes the Markov property of our algorithm.

Using (2.7), we get a sequence of . They form a discrete-time Markov process.

With Proposition 4, we can construct a continuous time process whose value to derive an ordinary differential equation to analyze the algorithmic convergence. Specifically, as the fixed step size , two processes based on the sequence generated by (2.7) are essentially on the unit sphere, which satisfies the constraint.

If the initial points are on the unit sphere, i.e., , then as , we have in probability. The proof of Proposition 4 is presented in Appendix B.1. By this proposition, we further show that weakly converge to the solution of the following ODE system in probability (see more details in Ethier and Kurtz (2009)),

(4.1)

where and . To highlight the sequence generated by (2.7) depending on , we redefine .

As , the processes weakly converge to the solution of the ODE system in (4.1) with the same initial on the sphere as , i.e., . The proof of Theorem 4 is presented in Appendix B.2. Under Assumption 4, the above ODE system admits a closed form solution. Specifically, we solve and simultaneously, since they are coupled together in (4.1). To simplify (4.1), we define and . We then rewrite (4.1)as

(4.2)

where . By Assumption 4, and are the left and right singular matrices of respectively, i.e., , where is diagonal. For notational simplicity, we define such that . One can verify , where

(4.3)

By left multiplying both sides of (4.2), we obtain

(4.4)

which is a coordinate separable ODE system. Accordingly, we define ’s as:

(4.5)

Thus, we can obtain a closed form solution to (4.4) based on the following theorem. Given (4.4), we write the ODE in each component ,

(4.6)

where when . This ODE System has a closed form solution as follows:

(4.7)

for , where

is a normalization function such that .

The proof of Theorem 4 is presented in Appendix B.3. Without loss of generality, we assume . As can be seen, , as . We have successfully characterized the asymptotic global convergence performance of algorithm with an approximate error . The solution to the ODE system in (4.7), however, does not fully reveal the algorithmic behavior (more precisely, the rate of convergence) near the equilibria of the ODE system. This further motivates us to exploit the following SDE-based approach for a more precise characterization.

5 Local Dynamics by SDE

We characterize three stages for the trajectories of solutions: [a] Neighborhood around unstable equilibria — minimizers and saddle points of (2.1), [b] Neighborhood around stable equilibria — maximizers of (2.1), and [c] deterministic traverses between equilibria. Specifically, for stage [a] and [c], we rescale the influence of the noise for characterizing the local algorithmic behavior. Moreover, we provide the approximate time in each phase, which implies the number of iteration each phase needs, until convergence.

5.1 Phase I: Escaping from Unstable Equilibria

Suppose that the algorithm starts to iterate around a unstable equilibrium, (e.g. saddle point). Different from our previous analysis, we rescale two aforementioned processes and rescaled by a factor of

. This eventually allows us to capture the uncertainty in Phase I by stochastic differential equations. Roughly speaking, the ODE approximation is essentially a variant of the law of large numbers for Markov process, while the SDE approximation serves as a variant of central limit theorem accordingly.

Recall that is an orthonormal matrix for diagonalizing Q, and is defined in (4.4). Let and denote the -th coordinates of and respectively. The following theorem characterizes the asymptotic dynamics of the algorithm around the unstable equilibrium. Condition on the event that for . Then as , for all , weakly converges to a diffusion process satisfying the following SDE:

(5.1)

where is a brownian motion, and is defined as follows:

where , , similar definition of for or .

is only a technical assumption. This does not cause any issue since when is large, or equivalently is smaller than , our algoraithm has escaped from the saddle point, which is out of Phase I.

The proof of Theorem 5.1 is provided in Appendix C.1. Note that (5.1) is a Fokker-Planck equation, which admits a closed form solution as follows,

(5.2)

Such a solution is well known as the Ornstein-Uhlenbeck process (Øksendal, 2003), and also implies that the distribution of

can be well approximated by the normal distribution of

for a sufficiently small step size. This continuous approximation further has the following implications:

  1. For is a random variable with mean

    and variance smaller than

    . The larger is, the closer its variance gets to this upper bound. While essentially amplifies by a factor exponentially increasing in . This tremendous amplification forces to quickly get away from , as increases.

  2. For , we have

    As has been shown in [a] that does not need to be large for to get away from . Here we only consider relatively small . Since the initial drift for is very small, tends to stay at . As increases, the exponential decay term makes the drift quickly become negligible. Moreover, by mean value theorem, we know that the variance is bounded, and increases far slower than the variance in [a]. Thus, roughly speaking, oscillates near .

  3. For , we have and . This implies that also tends to oscillate around , as increases.

Overall speaking, [a] is dominative so that it is the major driving force for the process to escape from this unstable equilibrium. More precisely, let us consider one special case for Phase I, that is we start from the second maximum singular value, with . We then asymptotically calculate the time required to escape by the following proposition.

Given pre-specified and sufficiently small , there exists some , where is a generic constant, such that the following result holds: We need

such that with probability at least , where is the CDF of standard normal distribution.

The proof of Proposition 5.1 is provided in Appendix C.2. Proposition 5.1 suggests that asymptotically SGD can escape from unstable equilibria within a short time. This further implies that the algorithm needs at most

iterations in Phase I. After escaping from the saddle, SGD gets into the next phase, which is almost a deterministic traverse between equilibria.

5.2 Phase II: Traverse between Equilibria

When the algorithm is close to neither the saddle points nor the optima, the performance of algorithm is nearly deterministic asymptotically. Specifically, the gradient dominants the noise, and the algorithm behaves like an almost deterministic traverse between stationary points, which can be viewed as a discretization of the ODE in (4.6) with a discretization error (Griffiths and Higham, 2010). Thus, we use the ODE approximation to study the algorithm before it enters the neighborhood of the optimum. The next proposition characterizes the asymptotic dynamics of the algorithm in this phase.

After restarting the counter of time, given sufficiently small and defined in Proposition 5.1, we need

such that . The proof of Proposition 5.2 is provided in Appendix C.3. Combining Propositions 5.1 and 5.2, we know that the algorithm achieves a neighborhood of the stable equilibrium after time with high probability, and gets into Phase III. This implies that the algorithm needs at most

iterations in Phase II.

5.3 Phase III: Convergence to Stable Equilibria

Again, we restart the counter of time. The trajectory and analysis of Phase III are similar to Phase I, since we still characterize the convergence using an Ornstein-Uhlenbeck process. The following theorem characterizes the asymptotic dynamics of the algorithm around the stable equilibrium.

Suppose is initialized around some maximizer (the first column of ), i.e., . Then as , for all , weakly converges to a diffusion process satisfying the following SDE for ,

(5.3)

where is a brownian motion, and

The proof of Theorem 5.3 is provided in Appendix C.4. Similar to (5.1), the closed form solution to (5.3) for is as follows:

(5.4)

By the property of the O-U process, we characterize the expectation and variance of for .

Recall that the distribution of can be well approximated by the normal distribution of for a sufficiently small step size. This further implies that after sufficiently many iterations, the algorithm enforces except . Meanwhile, it behaves like a biased random walk towards the optimum, when it iterates within a small neighborhood the optimum. Unlike Phase I, the variance gradually becomes a constant. Moreover, different from the ODE in Phase II, the SDE in Phase III implies how small we need.

Based on theorem 5.3, we further establish an upper bound of time for Phase III in following proposition. Given a sufficiently small , a sufficiently small , defined in Proposition 5.1, and after restarting the counter of time, we need

such that . The proof of Proposition 5.3 is provided in Appendix C.5. This implies that the algorithm needs at most

iterations to converge to achieve an -optimal solution in the third phase. Combining Propositions 5.1, 5.2, and 5.3, we know that after time the algorithm asymptotically achieves an -optimal solution with high probability. This further leads to a more refined result in the following corollary.

Given a sufficiently small , we define , and choose

Then we need

time such that we have

The proof of Corollary 5.3 is provided in Appendix C.6. Corollary 5.3 shows that after time , asymptotically the algorithm achieves an -optimal solution. This further implies that after

iterations, the algorithm achieves an -optimal solution with high probability.

5.4 Extension to

Our analysis can further extend to the case where and have different dimensions, i.e., . Specifically, we consider an alternative way to construct defined in (4.3). We follow the same notations to Assumption 4, and use and to denote the transition matrix between the observed data and latent variables. The dimensions of and , however, are different now, i.e., and . Without loss of generality, we assume and , where and , and are the transform matrix of and , respectively. Then we have the singular value decomposition as follows,

(5.5)

Thus, we have and . Now we design the orthogonal transform matrix .

(5.6)

One can check that

(5.7)

Then our previous analysis using ODE and SDE still holds.

Note that for , any column vector of in (4.3) is a stationary solution. Here the square matrix in (5.6) contains column vectors, but only the first and last column vectors are stationary solutions. This is because the remaining column vectors are even not feasible solutions, and violate the constraint . Thus, given a feasible initial, the algorithm will not be trapped in the subspace spanned by th remaining column vectors .

5.5 Extension to Missing Values

Our methodology and theory can tolerate missing values. For simplicity, we assume the entries of and misses independently with probability in each iteration, where . We then set all missing entries as

values. We denote such imputed vectors by

and . One can verify

is an unbiased estimator of

. Note that can be further absorbed into the step size , denoted by . Then (2.7) becomes: