Complete Dictionary Recovery over the Sphere II: Recovery by Riemannian Trust-region Method

11/15/2015 ∙ by Ju Sun, et al. ∙ 0

We consider the problem of recovering a complete (i.e., square and invertible) matrix A_0, from Y ∈R^n × p with Y = A_0 X_0, provided X_0 is sufficiently sparse. This recovery problem is central to theoretical understanding of dictionary learning, which seeks a sparse representation for a collection of input signals and finds numerous applications in modern signal processing and machine learning. We give the first efficient algorithm that provably recovers A_0 when X_0 has O(n) nonzeros per column, under suitable probability model for X_0. Our algorithmic pipeline centers around solving a certain nonconvex optimization problem with a spherical constraint, and hence is naturally phrased in the language of manifold optimization. In a companion paper (arXiv:1511.03607), we have showed that with high probability our nonconvex formulation has no "spurious" local minimizers and around any saddle point the objective function has a negative directional curvature. In this paper, we take advantage of the particular geometric structure, and describe a Riemannian trust region algorithm that provably converges to a local minimizer with from arbitrary initializations. Such minimizers give excellent approximations to rows of X_0. The rows are then recovered by linear programming rounding and deflation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 23

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, there is a surge of research studying nonconvex formulations and provable algorithms for a number of central problems in signal processing and machine learning, including, e.g., low-rank matrix completion/recovery [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], phase retreival [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]

, tensor recovery 

[37, 38, 39, 40, 41], mixed regression [42, 43], structured element pursuit [44, 41], blind deconvolution [45, 46, 47, 48, 49], noisy phase synchronization and community detection [50, 51, 52]

, deep learning 

[53, 54], numerical linear algebra and optimization [55, 56]. The research efforts are fruitful in producing more practical and scalable algorithms and even significantly better performance guarantees than known convex methods.

In a companion paper [3]

, we set out to understand the surprising effectiveness of nonconvex heuristics on the dictionary learning (DL) problem. In particular, we have focused on the complete dictionary recovery (DR) setting: given

, with complete (i.e., square and invertible), and obeying an i.i.d. Bernoulli-Gaussian (BG) model with rate (i.e., with and ), recover and . In this setting, , where denotes the row space. To first recover rows of

, we have tried to find the sparsest vectors in

, and proposed solving the nonconvex formulation

(I.1)

where is a proxy of (i.e., after appropriate processing), is the -th column of , and is a (convex) smooth approximation to the absolute-value function. The spherical constraint renders the problem nonconvex.

Despite the apparent nonconvexity, our prior analysis in [3] has showed that all local minimizers of (I.1) are qualitatively equally good, because each of them produces a close approximation to certain row of (Corollary II.4 in [3]). So the central issue is how to escape from saddle points. Fortunately, our previous results (Theorem II.3 in [3]) imply that all saddle points under consideration are ridable, i.e., the associated Hessians have both strictly positive and strictly negative values (see the recapitulation in Section II-B

). Particularly, eigenvectors of the negative eigenvalues are direction of negative curvature, which intuitively serve as directions of local descent.

Second-order methods can naturally exploit the curvature information to escape from ridable saddle points. To gain some intuition, consider an unconstrained optimization problem

The second-order Taylor expansion of at a saddle point is

When is chosen to align with an eigenvector of a negative eigenvalue , it holds that

Thus, minimizing returns a direction that tends to decrease the objective , provided local approximation of to is reasonably accurate. Based on this intuition, we derive a (second-order) Riemannian trust-region algorithm that exploits the second-order information to escape from saddle points and provably returns a local minimizer to (I.1), from arbitrary initializations. We provide rigorous guarantees for recovering a local minimizer in Section II.

Obtaining a local minimizer only helps approximate one row of . To recover the row, we derive a simple linear programming rounding procedure that provably works. To recover all rows of , one repeats the above process based on a carefully designed deflation process. The whole algorithmic pipeline and the related recovery guarantees are provided in Section III. Particularly, we show that when is reasonably large, with high probability (w.h.p.), our pipeline efficiently recovers and , even when each column of contains nonzeros.

I-a Prior Arts and Connections

In Section II.E of the companion paper [3], we provide detailed comparisons of our results with prior theoretical results on DR; we conclude that this is the first algorithmic framework that guarantees efficient recovery of complete dictionaries when the coefficients have up to constant fraction of nonzeros. We also draw methodological connections to work on understanding nonconvex heuristics, and other nonconvex problems with similar geometric structures. Here we focus on drawing detailed connections to the optimization literature.

Trust-region method (TRM) has a rich history dating back to 40’s; see the monograph [57] for accounts of the history and developments. The main motivation for early developments was to address limitations of the classic Newton’s method (see, e.g., Section 3 of [58]). The limitations include the technical subtleties to establish local and global convergence results. Moreover, when the Hessian is singular or indefinite, the movement direction is either not well-defined, or does not improve the objective function. [59, 60, 61, 62, 63, 64] initialized the line of work that addresses the limitations. Particularly, [58, 65] proposed using local second-order Taylor approximation as model function in the trust-region framework for unconstrained optimization. They showed that under mild conditions, the trust-region iterate sequence has a limit point that is critical and has positive semidefinite Hessian; see also Section 6.5-6.6 of [57]. Upon inspecting the relevant proofs, it seems not hard to strengthen the results to sequence convergence to local minimizers, under a ridable saddle condition as ours, for unconstrained optimization.

Research activities to port theories and algorithms of optimization in Euclidean space to Riemannian manifolds are best summarized by three monographs: [66, 67, 68]. [69] developed Newton and conjugate-gradient methods for the Stiefel manifolds, of which the sphere is a special case; [68] presents a complete set of first- and second-order Riemannian algorithms and convergence analyses; see also the excellent associated optimization software toolbox [70]. Among these, trust-region method was first ported to the Riemannian setting in [71], with emphasis on efficient implementation which only approximately solves the trust-region subproblem according to the Cauchy point scheme. The Cauchy point definition adopted there was the usual form based on the gradient, not strong enough to ensure the algorithm escape from ridable saddle points even if the true Hessian is in use in local approximation. In comparison, in this work we assume that the trust-region subproblem is exactly solved, such that ridable saddles (the only possible saddles for our problem) are properly skipped. By this, we obtain the strong guarantee that the iterate sequence converges to a local minimizer, in contrast to the weak global convergence (gradient sequence converging to zero) or local convergence (sequence converging to a local minimizer within a small radius) established in [71]. To the best of our knowledge, our convergence result is first of its kind for a specific problem on sphere. After our initial submission, [72] has recently established worst-case iteration complexity of Riemannian TRM to converge to second-order critical points (i.e., critical points with positive semidefinite Hessians), echoing the results in the Euclidean case [73]. Their results are under mild Lipschitz-type assumptions and allow inexact subproblem solvers, and hence are very practical and general. However, on our particular problem, their result is considerably pessimistic, compared to our convergence result obtained from a specialized analysis.

Solving the trust-region subproblem exactly is expensive. Practically, often a reasonable approximate solution with controlled quality is adequate to guarantee convergence. In this regard, the truncated conjugate gradient (tCG) solver with a good initial search direction is commonly employed in practice (see, e.g., Section 7.5 in [57]). To ensure ridable saddle points be properly escaped from, the eigenpoint idea (see, e.g., Section 6.6 of [57]) is particularly relevant; see also Algorithm 3 and Lemma 10 in [72].

The benign function landscape we characterized in the first paper allows any reasonable iterative method that is capable of escaping from ridable saddles to find a local minimizer, with possibly different performance guarantees. The trust-region method we focus on here, and the curviliear search method [61] are second-order methods that guarantee global optimization from arbitrary initializations. Typical first-order methods such as the vanilla gradient descent can only guarantee convergence to a critical point. Nonetheless, for our particular function, noisy/stochastic gradient method guarantees to find a local minimizer from an arbitrary initialization with high probability [74].

I-B Notations, and Reproducible Research

We use bold capital and small letters such as and to denote matrices and vectors, respectively. Small letters are reserved for scalars. Several specific mathematical objects we will frequently work with: for the orthogonal group of order , for the unit sphere in , for the unit ball in , and for positive integers . We use for matrix transposition, causing no confusion as we will work entirely on the real field. We use superscript to index rows of a matrix, such as for the -th row of the matrix , and subscript to index columns, such as . All vectors are defaulted to column vectors. So the -th row of as a row vector will be written as . For norms, is the usual norm for a vector and the operator norm (i.e., ) for a matrix; all other norms will be indexed by subscript, for example the Frobenius norm for matrices and the element-wise max-norm . We use

to mean that the random variable

is distributed according to the law . Let denote the Gaussian law. Then means that is a standard Gaussian vector. Similarly, we use to mean elements of are independently and identically distributed according to the law . So the fact is equivalent to that . One particular distribution of interest for this paper is the Bernoulli-Gaussian with rate : , with and . We also write this compactly as . We frequently use indexed and for numerical constants when stating and proving technical results. The scopes of such constants are local unless otherwise noted. We use standard notations for most other cases, with exceptions clarified locally.

The codes to reproduce all the figures and experimental results are available online:

https://github.com/sunju/dl_focm .

Ii Finding One Local Minimizer via the Riemannian Trust-Region Method

We are interested to seek a local minimizer of (I.1). The presence of saddle points have motivated us to develop a second-order Riemannian trust-region algorithm over the sphere; the existence of descent directions at nonoptimal points drives the trust-region iteration sequence towards one of the minimizers asymptotically. We will prove that under our modeling assumptions, this algorithm with an arbitrary initialization efficiently produces an accurate approximation111By “accurate” we mean one can achieve an arbitrary numerical accuracy with a reasonable amount of time. Here the running time of the algorithm is on the order of in the target accuracy , and polynomial in other problem parameters. to one of the minimizers. Throughout the exposition, basic knowledge of Riemannian geometry is assumed. We will try to keep the technical requirement minimal possible; the reader can consult the excellent monograph [68] for relevant background and details.

Ii-a Some Basic Facts about the Sphere and

For any point , the tangent space and the orthoprojector onto are given by

where is an arbitrary orthonormal basis for (note that the orthoprojector is independent of the basis we choose). Consider any . The map

defines a smooth curve on the sphere that satisfies and . Geometrically, is a segment of the great circle that passes and has as its tangent vector at . The exponential map for is defined as

It is a canonical way of pulling to the sphere.

Fig. 1: Illustrations of the tangent space and exponential map defined on the sphere .

In this paper we are interested in the restriction of to the unit sphere . For the sake of performing optimization, we need local approximations of . Instead of directly approximating the function in , we form quadratic approximations of in the tangent spaces of . We consider the smooth function , where is the usual function composition operator. An applications of vector space Taylor’s theorem gives

when is small. Thus, we form a quadratic approximation as

(II.1)

Here and denote the usual (Euclidean) gradient and Hessian of w.r.t. in . For our specific defined in (I.1), it is easy to check that

(II.2)
(II.3)

The quadratic approximation also naturally gives rise to the Riemannian gradient and Riemannian Hessian defined on as

(II.4)
(II.5)

Thus, the above quadratic approximation can be rewritten compactly as

The first order necessary condition for unconstrained minimization of function over is

(II.6)

If is positive semidefinite and has “full rank” (hence “nondegenerate”222Note that the matrix has rank at most , as the nonzero obviously is in its null space. When has rank , it has no null direction in the tangent space. Thus, in this case it acts on the tangent space like a full-rank matrix. ), the unique solution is

which is also invariant to the choice of basis . Given a tangent vector , let denote a geodesic curve on . Following the notation of [68], let

denotes the parallel translation operator, which translates the tangent vector at to a tangent vector at , in a “parallel” manner. In the sequel, we identify with the following matrix, whose restriction to is the parallel translation operator (the detailed derivation can be found in Chapter 8.1 of [68]):

(II.7)

Similarly, following the notation of [68], we denote the inverse of this matrix by , where its restriction to is the inverse of the parallel translation operator .

Ii-B The Geometric Results from [3]

Fig. 2: Why is dictionary learning over tractable? Assume the target dictionary . Left: Large sample objective function . The only local minimizers are the standard basis vectors ’s and their negatives. Right: A visualization of the function as a height above the equatorial section , i.e., . The derived function is obtained by assigning values of points on the upper hemisphere to their corresponding projections on the equatorial section . The minimizers for the derived function are . Around in , the function exhibits a small region of strong convexity, a region of large gradient, and finally a region in which the direction away from is a direction of negative curvature.

We reproduce the main geometric theorems from [3] here for the sake of completeness. To characterize the function landscape of over , we mostly work with the function

(II.8)

induced by the reparametrization

(II.9)

Geometrically, this corresponds to projection of the function above the equatorial section onto (see Fig. 2 (right) for illustration). In particular, we focus our attention to the smaller set of the ball:

(II.10)

because contains all points with . We can similarly characterize other parts of on using projection onto other equatorial sections.

Theorem II.1 (High-dimensional landscape - orthogonal dictionary).

Suppose and hence . There exist positive constants and , such that for any and , whenever

(II.11)

the following hold simultaneously with probability at least :

(II.12)
(II.13)
(II.14)

and the function has exactly one local minimizer over the open set , which satisfies

(II.15)

Here through are all positive constants.

Recall that the reason we just need to characterize the geometry for the case is that for other orthogonal , the function landscape is simply a rotated version of that of .

Theorem II.2 (High-dimensional landscape - complete dictionary).

Suppose is complete with its condition number . There exist positive constants (particularly, the same constant as in Theorem II.1) and , such that for any and , when

(II.16)

and , , the following hold simultaneously with probability at least :

(II.17)
(II.18)
(II.19)

and the function has exactly one local minimizer over the open set , which satisfies

(II.20)

Here are both positive constants.

From the above theorems, it is clear that for any saddle point in the space, the Hessian of has at least one negative eigenvalue. Now the problem is whether all saddle points of on are “ridable”, because as alluded to in previous discussion, we need to perform actual optimization in the space. Instead of presenting a rigorous technical statement and detailed proof, we include here just an informal argument; our actual proof for algorithmic convergence runs back and forth in and space and such lack will not affect our arguments there.

It is very easy to verify the following fact (see proof of Lemma II.13 on page VI-G):

Thus, if and only if , implying that will never be zero in the spherical region corresponding to . Moreover, it is shown in Lemma II.11 below that the Riemannian Hessian is positive definite for the spherical region corresponding to , so there is no saddle point in this region either. Over , potential saddle points lie only in the region corresponding to . Theorem II.1 and Theorem II.2 imply that around each point in this region, a cross section of the function is strictly concave locally. Intuitively, by the mapping the same happens in the space, i.e., the Riemannian Hessian has a strictly negative eigenvalue.

Ii-C The Riemannian Trust-Region Algorithm over the Sphere

For a function in the Euclidean space, the typical TRM starts from some initialization , and produces a sequence of iterates , by repeatedly minimizing a quadratic approximation to the objective function , over a ball centered around the current iterate.

For our defined over , given the previous iterate , the TRM produces the next movement by generating a solution to

(II.21)

where is the local quadratic approximation defined in (II.1). The solution is then pulled back to from . If we choose the exponential map to pull back the movement 333The exponential map is only one of the many possibilities; also for general manifolds other retraction schemes may be more practical. See exposition on retraction in Chapter 4 of [68]. the next iterate then reads

(II.22)

To solve the subproblem (II.21) numerically, we can take any matrix whose columns form an orthonormal basis for , and produce a solution to

(II.23)

Solution to (II.21) can then be recovered as .

The problem (II.23) is an instance of the classic trust region subproblem, i.e., minimizing a quadratic function subject to a single quadratic constraint. Albeit potentially nonconvex, this notable subproblem can be solved in polynomial time by several numerical methods [65, 57, 75, 76, 77, 78]. Approximate solution of the subproblem suffices to guarantee convergence in theory, and lessens the storage and computational burden in practice. We will deploy the approximate version in simulations. For simplicity, however, our subsequent analysis assumes the subproblem is solved exactly. We next briefly describe how one can deploy the semidefinite programming (SDP) approach [75, 76, 77, 78] to solve the subproblem exactly. This choice is due to the well-known effectiveness and robustness of the SDP approach on this problem. We introduce

(II.24)

where and . The resulting SDP to solve is

(II.25)

where . Once the problem (II.25) is solved to its optimum , one can provably recover the minimizer of (II.23) by computing the SVD of , and extract as a subvector the first coordinates of the principal eigenvector (see Appendix B of [79]).

Ii-D Main Convergence Results

Using general convergence results on Riemannian TRM (see, e.g., Chapter 7 of [68]), it is not difficult to prove that the gradient sequence produced by TRM converges to zero (i.e., global convergence), or the sequence converges (at quadratic rate) to a local minimizer if the initialization is already close a local minimizer (i.e., local convergence). In this section, we show that under our probabilistic assumptions, these results can be substantially strengthened. In particular, the algorithm is guaranteed to produce an accurate approximation to a local minimizer of the objective function, in a number of iterations that is polynomial in the problem size, from arbitrary initializations. The arguments in the companion paper [3] showed that w.h.p. every local minimizer of produces a close approximation to a row of . Taken together, this implies that the algorithm efficiently produces a close approximation to one row of .

Thorough the analysis, we assume the trust-region subproblem is exactly solved and the step size parameter is fixed. Our next two theorems summarize the convergence results for orthogonal and complete dictionaries, respectively.

Theorem II.3 (TRM convergence - orthogonal dictionary).

Suppose the dictionary is orthogonal. There exists a positive constant , such that for all and , whenever

with probability at least the Riemannian trust-region algorithm with input data matrix , any initialization on the sphere, and a step size satisfying

(II.26)

returns a solution which is near to one of the local minimizers (i.e., ) in at most

(II.27)

iterations. Here is as defined in Theorem II.1, and through are all positive constants.

Theorem II.4 (TRM convergence - complete dictionary).

Suppose the dictionary is complete with condition number . There exists a positive constant , such that for all , and , whenever

with probability at least the Riemannian trust-region algorithm with input data matrix where , any initialization on the sphere and a step size satisfying

(II.28)

returns a solution which is near to one of the local minimizers (i.e., ) in at most

(II.29)

iterations. Here is as in Theorem II.1, and through are all positive constants.

Our convergence result shows that for any target accuracy the algorithm terminates within polynomially many steps. Specifically, the first summand in (II.27) or (II.29) is the number of steps the sequence takes to enter the strongly convex region and be “reasonably” close to a local minimizer. All subsequent trust-region subproblems are then unconstrained (proved below) – the constraint is inactive at optimal point, and hence the steps behave like Newton steps. The second summand reflects the typical quadratic local convergence of the Newton steps.

Our estimate of the number of steps is pessimistic: the running time is a relatively high-degree polynomial in

and . We will discuss practical implementation details that help speed up in Section IV. Our goal in stating the above results is not to provide a tight analysis, but to prove that the Riemannian TRM algorithm finds a local minimizer in polynomial time. For nonconvex problems, this is not entirely trivial – results of [80] show that in general it is NP-hard to find a local minimizer of a nonconvex function.

Ii-E Sketch of Proof for Orthogonal Dictionaries

The reason that our algorithm is successful derives from the geometry formalized in Theorem II.1. Basically, the sphere can be divided into three regions. Near each local minimizer, the function is strongly convex, and the algorithm behaves like a standard (Euclidean) TRM algorithm applied to a strongly convex function – in particular, it exhibits a quadratic asymptotic rate of convergence. Away from local minimizers, the function always exhibits either a strong gradient, or a direction of negative curvature (i.e., the Hessian has a strictly negative eigenvalue). The Riemannian TRM algorithm is capable of exploiting these quantities to reduce the objective value by at least a fixed amount in each iteration. The total number of iterations spent away from the vicinity of the local minimizers can be bounded by comparing this amount to the initial objective value. Our proofs follow exactly this line and make the various quantities precise.

Note that for any orthogonal , . In words, this is the established fact that the function landscape of is a rotated version of that of . Thus, any local minimizer of is rotated to , a local minimizer of . Also if our algorithm generates iteration sequence for upon initialization , it will generate the iteration sequence for . So w.l.o.g. it is adequate that we prove the convergence results for the case . So in this section (Section II-E), we write to mean .

We partition the sphere into three regions, for which we label as , , , corresponding to the strongly convex, nonzero gradient, and negative curvature regions, respectively (see Theorem II.1). That is, consists of a union of spherical caps of radius , each centered around a signed standard basis vector . consist of the set difference of a union of spherical caps of radius , centered around the standard basis vectors , and . Finally, covers the rest of the sphere. We say a trust-region step takes an step if the current iterate is in ; similarly for and steps. Since we use the geometric structures derived in Theorem II.1 and Corollary II.2 in [3], the conditions

(II.30)

are always in force.

At step of the algorithm, suppose is the minimizer of the trust-region subproblem (II.21). We call the step “constrained” if (the minimizer lies on the boundary and hence the constraint is active), and call it “unconstrained” if (the minimizer lies in the relative interior and hence the constraint is not in force). Thus, in the unconstrained case the optimality condition is (II.6).

The next lemma provides some estimates about and that are useful in various contexts.

Lemma II.5.

We have the following estimates about and :

Proof.

See Page VI-A under Section VI. ∎

Our next lemma says if the trust-region step size is small enough, one Riemannian trust-region step reduces the objective value by a certain amount when there is any descent direction.

Lemma II.6.

Suppose that the trust region size , and there exists a tangent vector with , such that

for some positive scalar . Then the trust region subproblem produces a point with

where and , , , are the quantities defined in Lemma II.5.

Proof.

See Page VI-B under Section VI. ∎

To show decrease in objective value for and , now it is enough to exhibit a descent direction for each point in these regions. The next two lemmas help us almost accomplish the goal. For convenience again we choose to state the results for the “canonical” section that is in the vicinity of and the projection map , with the idea that similar statements hold for other symmetric sections.

Lemma II.7.

Suppose that the trust region size , for some scalar , and that is -Lipschitz on an open ball centered at . Then there exists a tangent vector with , such that

Proof.

See Page VI-C under Section VI. ∎

Lemma II.8.

Suppose that the trust-region size , for some , and that is Lipschitz on the open ball centered at . Then there exists a tangent vector with , such that

Proof.

See Page VI-D under Section VI. ∎

One can take as shown in Theorem II.1, and take the Lipschitz results in Proposition B.4 and Proposition B.3 (note that w.h.p. by Lemma B.6), repeat the argument for other symmetric regions, and conclude that w.h.p. the objective value decreases by at least a constant amount. The next proposition summarizes the results.

Proposition II.9.

Assume (II.30). In regions and , each trust-region step reduces the objective value by at least

(II.31)

respectively, provided that