On the Convergence of Projected-Gradient Methods with Low-Rank Projections for Smooth Convex Minimization over Trace-Norm Balls and Related Problems

02/05/2019 ∙ by Dan Garber, et al. ∙ Technion 0

Smooth convex minimization over the unit trace-norm ball is an important optimization problem in machine learning, signal processing, statistics and other fields, that underlies many tasks in which one wishes to recover a low-rank matrix given certain measurements. While first-order methods for convex optimization enjoy optimal convergence rates, they require in worst-case to compute a full-rank SVD on each iteration, in order to compute the projection onto the trace-norm ball. These full-rank SVD computations however prohibit the application of such methods to large problems. A simple and natural heuristic to reduce the computational cost is to approximate the projection using only a low-rank SVD. This raises the question if, and under what conditions, this simple heuristic can indeed result in provable convergence to the optimal solution. In this paper we show that any optimal solution is a center of a Euclid. ball inside-which the projected-gradient mapping admits rank that is at most the multiplicity of the largest singular value of the gradient vector. Moreover, the radius of the ball scales with the spectral gap of this gradient vector. We show how this readily implies the local convergence (i.e., from a "warm-start" initialization) of standard first-order methods, using only low-rank SVD computations. We also quantify the effect of "over-parameterization", i.e., using SVD computations with higher rank, on the radius of this ball, showing it can increase dramatically with moderately larger rank. We extend our results also to the setting of optimization with trace-norm regularization and optimization over bounded-trace positive semidefinite matrices. Our theoretical investigation is supported by concrete empirical evidence that demonstrates the correct convergence of first-order methods with low-rank projections on real-world datasets.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The main subject of investigation in this paper is the following optimization problem:

(1)

where is convex and -smooth (i.e., gradient-Lipschitz), and denotes the trace-norm, i.e., sum of singular values (aka the nuclear norm).

Problem (1) has received much attention in recent years and has many applications in machine learning, signal processing, statistics and engneering, such as the celebrated matrix completion problem [10, 28, 20], affine rank minimization problems [29, 21], robust PCA [9], and more.

Many standard first-order methods such as the projected-gradient descent [26], Nesterov’s accelerated gradient method [26] and FISTA [4], when applied to Problem (1), require on each iteration to compute the projected-gradient mapping w.r.t. the trace-norm ball, given by

(2)

for some current iterate and step-size , where denotes the Euclidean projection onto the unit trace-norm ball.

It is well known that computing the projection step in (2) amounts to computing the singular value decomposition of the matrix and projecting the vector of singular values onto the unit-simplex (keeping the left and right singular vectors without change). Unfortunately, in worst-case, a full-rank SVD computation is required which amounts to runtime per iteration, assuming . This naturally prohibits the use of such methods for large scale problems in which both are large.

Since (2) requires in general expensive full-rank SVD computations, a very natural and simple heuristic to reduce the computational complexity is to replace the expensive projection operation with an approximate ”projection”, which (accurately) projects the matrix - the best rank- approximation of . That is, we consider replacing with the operation

where corresponds to the rank- truncated SVD of (i.e., we consider only the top components of the SVD).

Using state-of-the-art Krylov subspace methods, such as the Power Iterations algorithm or the Lanczos algorithm (see for instance the classical text [16] and also the recent works [24, 3]), could be computed with runtime proportional to only - a very significant speedup when . Moreover, in many problems the gradient matrix is sparse (e.g., in the well studied matrix completion problem), in which case further significant accelerations apply. The drawback of course is that when using the approximated procedure , the highly desired convergence guarantees of first-order methods need no longer hold.

A motivation for the plausible effectiveness of this heuristic is that Problem (1) is often used as a convex relaxation for non-convex rank-constrained optimization problems, which are often assumed to admit a low-rank global minimizer (as is in all examples provided above). Given this low-rank structure of the optimal solution, one may wonder if indeed storing and manipulating high-rank matrices, when optimizing (1), is mandatory, or alternatively, that at some stage during the run of the algorithm the iterates all become low-rank.

It is thus natural to ask: under which conditions is it possible to replace the projection with the approximation , while keeping the original convergence guarantees of first-order methods?

Or, put differently, we ask for which (and a suitable ) does

(3)

hold?

Our main result in this paper is the formulation and proof of the following proposition, presented at this point only informally.

Proposition 1.

For any optimal solution to Problem (1), if the truncated-SVD rank parameter satisfies - the multiplicity of the largest singular value in the gradient vector then, there exists a Euclidean ball centered at , inside-which (3) holds. Moreover, the radius of the ball scales with the spectral gap .

As we show, Proposition 1 readily implies that standard gradient methods such as the Projected Gradient Method, Nesterov’s Accelerated Gradient Method, and FISTA, when initialized in the proximity of an optimal solution, converge with their original convergence guarantees, i.e., producing the exact same sequences of iterates, when the exact Euclidean projection is replaced with the truncated-SVD-based projection .

Some complexity implications of our results to first-order methods for Problem (1) are summarized in Table 1, together with comparison to other first-order methods.

The connection between - the rank parameter in the approximated projection and the parameter may seem unintuitive at first. In particular, one might expect that should be comparable directly with . However, as we show they are indeed tightly related. In particular, the radius of the ball around an optimal solution in which (3) holds is strongly related to spectral gaps in the gradient vector . This further implies ”over-parameterization” results in which we show how the radius of the ball in which (3) applies, increases with the rank parameter , showing it can increase quite dramatically with only a moderate increase in . We also bring two complementary results showing that implies that the optimization problem (1) is ill-posed in a sense, and that in general, a result in the spirit of Proposition 1 may not hold when .

Algorithm Conv. Rate SVD size Sol. Rank
   -smooth and convex   
Proj. Grad. global
Acc. Grad. global
Frank-Wolfe [20] global
Proj. Grad. (this paper) local
Acc. Grad. (this paper) local
   -smooth and -strongly convex   
Proj. Grad. global
Acc. Grad. global
ROR-FW [13] global
BlockFW [2] global
Proj. Grad. (this paper) local
Acc. Grad. (this paper) local
Table 1: Comparison of first-order methods for solving Problem (1). The 2nd column (from the left) states the type of convergence (either from arbitrary initialization or from a ”warm-start”), the 3rd column states the number iterations to reach accuracy, the 4th column states and upper bound on the rank of SVD required on each iteration, and the last column states an upper-bound on the rank of iterates produced by the method.

1.1 Organization of this paper

The rest of this paper is organized as follows. In the remaining of this section we discuss related work. In Section 2 we present our main result: we formalize and prove Proposition 1 in the context of Problem (1). In this section we also present several complementing results that further strengthen our claims. In Section 3 we demonstrate how the results of Section 2 readily imply the local convergence of standard projection-based first-order methods for Problem (1), using only low-rank SVD to compute the Euclidean projection. In Sections 4 and 5 we formalize and prove versions of Proposition 1 for smooth convex optimization with trace-norm regularization, and smooth convex optimization over the set of unit-trace positive semidefinite matrices, respectively. Finally, in Section 6 we present supporting empirical evidence.

1.2 Related work

The subject of efficient algorithms for low-rank matrix optimization problems has enjoyed significant interest in recent years. Below we survey some notable results both for the convex problem (1), as-well as other related convex models, and also for related non-convex optimization problems.

Convex methods:

Besides projection-based methods, other highly popular methods for Problem (1) are conditional gradient methods (aka Frank-Wolfe algorithm) [12, 19, 20, 18]. These algorithms require only a rank-one SVD computation on each iteration, hence each iteration is very efficient, however their convergence rates, which are typically of the form for smooth problems (even when the objective is also strongly convex) are in general inferior to projection-based methods such as Nesterov’s accelerated method [26] and FISTA [4]. Recently, several works have developed variants of the basic method with faster rates, though these hold only under the additional assumption that the objective is also strongly convex [13, 2, 14]. Additionally, these new variants require to store in memory potentially high-rank matrices, which may limit their applicability to large problems. In [31] the authors present a novel conditional gradient method which enjoys a low-memory footprint for certain instances of (1) such as the well known matrix completion problem, however there is no improvement in convergence rate beyond that of the standard method.

Besides first-order conditional gradient-type methods, in [23] the authors present a second-order trust-region algorithm for the trace norm-regularized variant of (1).

Nonconvex methods:

Problem (1) is often considered as a convex relaxation to the non-convex problem of minimizing under an explicit rank constraint. Two popular approaches to solving this non-convex problem are i) apply projected gradient descent to the rank-constrained formulation, in which case the projection is onto the set of low-rank matrices, and ii) incorporating the rank-constraint in the objective by considering the factorized objective , where are and respectively, where is an upper-bound on the rank, but otherwise unconstrained. Obtaining global convergence guarantees for these non-convex optimization problems is a research direction of significant interest in recent years, however efficient algorithms are obtained usually only under specific statistical assumptions on the data, which we do not make in this current work, see for instance [21, 22, 11, 6, 15] and references therein.

In the works [5, 27] the authors consider first-order methods for factorized formulations of problems related to (1), which are not based on statistical assumptions. In these works the authors establish the convergence of specific algorithms from a good initialization point to the global low-rank optimum with convergence rates similar to that of the standard projected gradient descent method.

2 Optimization over the Unit Trace-Norm Ball

We begin with introducing some notation. For a positive integer we let denote the set . We let denote the standard inner product for matrices, i.e., . For a real matrix , we let denote its th largest singular value (including multiplicities), and we let denote the multiplicity of the th largest singular value. Similarly, for a real symmetric matrix , we let denote its

th largest (signed) eigenvalue, and we let

denote the multiplicity of the the largest eigenvalue. We denote by the set of optimal solutions to Problem (1), and by the corresponding optimal value.

For any , step-size and radius we denote the projected gradient mapping w.r.t. the trace-norm ball of radius :

When , i.e., we consider the unit trace-norm ball, we will omit the subscript and simply write .

Given an optimal solution , a step-size , and an integer in the range , we let denote the radius of the largest Euclidean ball centered at , such that for all in the ball it holds that . Or equivalently, is the solution to the optimization problem

where denotes the Euclidean ball of radius centered at .

Similarly, for any and in the range , we also define

Towards formalizing and proving Proposition 1, deriving lower-bounds on the radius will be our main interest in this section.

Since our objective is to study the properties of the projected-gradient mapping over the trace-norm ball, we begin with the following well-known lemma which connects between the SVD of the point to project and the resulting projection.

Lemma 1 (projection onto the trace-norm ball).

Fix a parameter . Let and consider its singular-value decomposition . If , then the projection of onto the trace-norm ball of radius is given by

(4)

where is the unique solution to the equation .

Moreover, if there exists such that , then .

Proof.

The first part of the lemma is a well-known fact. The second part of the lemma comes from the simple observation, that if for some , then , as defined in the lemma, must satisfy , in which case Eq. (4) sets all bottom components of the SVD of to zero, and hence the projection is of rank at most . ∎

The following lemma which connects between the singular value decomposition of an optimal solution and its corresponding gradient vector, will play an important technical role in our analysis. The proof of the lemma follows essentially from simple optimality conditions.

Lemma 2.

Let be any optimal solution and write its singular value decomposition as . Then, the gradient vector admits a singular-value decomposition such that the set of pairs of vectors is a set of top singular-vector pairs of which corresponds to the largest singular value .

Proof.

First, note that if then the claim holds trivially. Thus, henceforth we consider the case .

It suffices to show that for all it holds that .

Assume by contradiction that for some it holds that . Let denote a singular vector pair corresponding to the top singular value . Observe that for all , the point is a feasible solution to Problem (1), i.e., . Moreover, it holds that

which clearly contradicts the optimality of . ∎

Corollary 1.

For any it holds that . Moreover, if is non-zero over , it holds that

(5)
Proof.

Lemma 2 directly implies that for all it holds that .

For the second part of the lemma, suppose there exist such that .

Since , it follows from simple optimality conditions that , which together with Lemma 2 implies that . Moreover, since and , it follows that .

Thus, using again the convexity of we have that

and hence we arrive at a contradiction. ∎

One may wonder if the reversed inequality to (5) holds (i.e., the inequality holds with equality). The following simple example shows that in general the inequality can be strict. Consider the following example.

for some .

Clearly, using Lemma 1, the problem admits a unique optimal rank-one solution solution , where denotes the diagonal matrix with only the first entry along the main diagonal is non-zero and equal to 1. However, one can easily observe that , meaning .

While the above example demonstrates that in general it is possible that and that, as a result, Proposition 1 may not imply significant computational benefits for Problem (1), the following lemma shows that such cases always imply that the optimization problem (1) is ill-posed in the following sense: increasing the radius of the trace-norm ball by an arbitrary small amount will cause the projected gradient mapping to map such original low-rank solution to a higher rank matrix, implying certain instability of low-rank optimal solutions.

Lemma 3 (gap necessary for stability of rank of optimal solutions).

Suppose there exists of rank such that , and suppose that . Then, for any step-size and for any small enough, it holds that the projected-gradient mapping at w.r.t. the trace-norm ball of radius satisfies

Proof.

Fix some and denote . Using Lemma 2 we have that the singular values of are given by

where are the singular values of . Since , which implies that , it holds that . Let be such that . Then, by Lemma 1, we have that the projected-gradient mapping w.r.t. the trace-norm ball of radius satisfies:

where satisfies: . Observe that for , we have that

Thus, it must hold that . However, then it follows that for all , and thus, . ∎

The following lemma demonstrates why setting the rank of the truncated-SVD projection to be at least is necessary. The lemma shows that in general, a result similar in spirit to Proposition 1 may not hold with SVD rank parameter satisfying .

Lemma 4.

Fix a positive integer and . Then, for any small enough and for any , there exists a convex and -smooth function such that

  1. admits a rank- minimizer over the unit trace-norm ball for which it holds that and the spectral gap is ,

  2. there exists a matrix such that , , , and for any it holds that .

Proof.

Consider the following function .

where denotes the indicator for the th diagonal entry. Note that is indeed -smooth.

We set values , . It is not hard to verify that the rank- matrix

is a minimizer of over the unit trace-norm ball. In particular, it holds that

Hence, we have .

Consider now the matrix given by

Note that is rank- as well. Clearly, it holds that .

Also,

Thus, for any step-size we have

Note that has positive singular values, which we denote (in non-increasing order) by . In particular, for any it holds that for .

Note that . Thus, by Lemma 1, the singular values of are given by , where satisfies

For to hold, it must hold that . We consider now two cases.

In the first case we have , i.e., , . Then, for to hold, it must hold that

and hence we arrive at a contradiction.

In the second case we have , i.e., , . As in the first case, in order for to hold, it must hold that

and thus, in this case also we arrive at a contradiction.

We thus conclude that . ∎

We now present and prove our main technical theorem which lower bounds - the radius of the ball around an optimal solution , in which the projected gradient mapping has rank at most , hence proving Proposition 1.

Theorem 1.

Assume is non-zero over the unit trace-norm ball and fix some . Let denote the multiplicity of , and let denote the singular values of (including multiplicities). Then, for any it holds that

(6)

More generally, for any and , it holds that

(7)

Moreover, for any and , it holds that

(8)
Proof.

Throughout the proof we assume without loss of generality that . Fix a step-size .

Denote and let denote the singular values of . Let us also denote by the singular values of , and . From Lemma 2 we can deduce that

(9)

For any integer let us define

Since , it follows that , we have that

(10)

where (a) follows from (2).

Now, given some , denote and let denote the singular values of . It holds that

(11)

where (a) follows from Ky Fan’s inequality for the singular values, and (b) follows from the -smoothness of .

Also, similarly, using Weyl’s inequality, it holds that

(12)

Combining Eq. (10), (11), (12), we have that

(13)

Thus, it follows that if satisfies:

we have that , which implies via Lemma 1 that . This proves (6), (7).

Alternatively, for any , using the more general version of Weyl’s inequality, we can replace Eq. (12) with

(14)

Thus, similarly to Eq. (13), but replacing Eq. (12) with Eq. (13), we obtain