The main subject of investigation in this paper is the following optimization problem:
where is convex and -smooth (i.e., gradient-Lipschitz), and denotes the trace-norm, i.e., sum of singular values (aka the nuclear norm).
Problem (1) has received much attention in recent years and has many applications in machine learning, signal processing, statistics and engneering, such as the celebrated matrix completion problem [10, 28, 20], affine rank minimization problems [29, 21], robust PCA , and more.
Many standard first-order methods such as the projected-gradient descent , Nesterov’s accelerated gradient method  and FISTA , when applied to Problem (1), require on each iteration to compute the projected-gradient mapping w.r.t. the trace-norm ball, given by
for some current iterate and step-size , where denotes the Euclidean projection onto the unit trace-norm ball.
It is well known that computing the projection step in (2) amounts to computing the singular value decomposition of the matrix and projecting the vector of singular values onto the unit-simplex (keeping the left and right singular vectors without change). Unfortunately, in worst-case, a full-rank SVD computation is required which amounts to runtime per iteration, assuming . This naturally prohibits the use of such methods for large scale problems in which both are large.
Since (2) requires in general expensive full-rank SVD computations, a very natural and simple heuristic to reduce the computational complexity is to replace the expensive projection operation with an approximate ”projection”, which (accurately) projects the matrix - the best rank- approximation of . That is, we consider replacing with the operation
where corresponds to the rank- truncated SVD of (i.e., we consider only the top components of the SVD).
Using state-of-the-art Krylov subspace methods, such as the Power Iterations algorithm or the Lanczos algorithm (see for instance the classical text  and also the recent works [24, 3]), could be computed with runtime proportional to only - a very significant speedup when . Moreover, in many problems the gradient matrix is sparse (e.g., in the well studied matrix completion problem), in which case further significant accelerations apply. The drawback of course is that when using the approximated procedure , the highly desired convergence guarantees of first-order methods need no longer hold.
A motivation for the plausible effectiveness of this heuristic is that Problem (1) is often used as a convex relaxation for non-convex rank-constrained optimization problems, which are often assumed to admit a low-rank global minimizer (as is in all examples provided above). Given this low-rank structure of the optimal solution, one may wonder if indeed storing and manipulating high-rank matrices, when optimizing (1), is mandatory, or alternatively, that at some stage during the run of the algorithm the iterates all become low-rank.
It is thus natural to ask: under which conditions is it possible to replace the projection with the approximation , while keeping the original convergence guarantees of first-order methods?
Or, put differently, we ask for which (and a suitable ) does
Our main result in this paper is the formulation and proof of the following proposition, presented at this point only informally.
As we show, Proposition 1 readily implies that standard gradient methods such as the Projected Gradient Method, Nesterov’s Accelerated Gradient Method, and FISTA, when initialized in the proximity of an optimal solution, converge with their original convergence guarantees, i.e., producing the exact same sequences of iterates, when the exact Euclidean projection is replaced with the truncated-SVD-based projection .
The connection between - the rank parameter in the approximated projection and the parameter may seem unintuitive at first. In particular, one might expect that should be comparable directly with . However, as we show they are indeed tightly related. In particular, the radius of the ball around an optimal solution in which (3) holds is strongly related to spectral gaps in the gradient vector . This further implies ”over-parameterization” results in which we show how the radius of the ball in which (3) applies, increases with the rank parameter , showing it can increase quite dramatically with only a moderate increase in . We also bring two complementary results showing that implies that the optimization problem (1) is ill-posed in a sense, and that in general, a result in the spirit of Proposition 1 may not hold when .
|Algorithm||Conv.||Rate||SVD size||Sol. Rank|
|-smooth and convex|
|Proj. Grad. (this paper)||local|
|Acc. Grad. (this paper)||local|
|-smooth and -strongly convex|
|Proj. Grad. (this paper)||local|
|Acc. Grad. (this paper)||local|
1.1 Organization of this paper
The rest of this paper is organized as follows. In the remaining of this section we discuss related work. In Section 2 we present our main result: we formalize and prove Proposition 1 in the context of Problem (1). In this section we also present several complementing results that further strengthen our claims. In Section 3 we demonstrate how the results of Section 2 readily imply the local convergence of standard projection-based first-order methods for Problem (1), using only low-rank SVD to compute the Euclidean projection. In Sections 4 and 5 we formalize and prove versions of Proposition 1 for smooth convex optimization with trace-norm regularization, and smooth convex optimization over the set of unit-trace positive semidefinite matrices, respectively. Finally, in Section 6 we present supporting empirical evidence.
1.2 Related work
The subject of efficient algorithms for low-rank matrix optimization problems has enjoyed significant interest in recent years. Below we survey some notable results both for the convex problem (1), as-well as other related convex models, and also for related non-convex optimization problems.
Besides projection-based methods, other highly popular methods for Problem (1) are conditional gradient methods (aka Frank-Wolfe algorithm) [12, 19, 20, 18]. These algorithms require only a rank-one SVD computation on each iteration, hence each iteration is very efficient, however their convergence rates, which are typically of the form for smooth problems (even when the objective is also strongly convex) are in general inferior to projection-based methods such as Nesterov’s accelerated method  and FISTA . Recently, several works have developed variants of the basic method with faster rates, though these hold only under the additional assumption that the objective is also strongly convex [13, 2, 14]. Additionally, these new variants require to store in memory potentially high-rank matrices, which may limit their applicability to large problems. In  the authors present a novel conditional gradient method which enjoys a low-memory footprint for certain instances of (1) such as the well known matrix completion problem, however there is no improvement in convergence rate beyond that of the standard method.
Problem (1) is often considered as a convex relaxation to the non-convex problem of minimizing under an explicit rank constraint. Two popular approaches to solving this non-convex problem are i) apply projected gradient descent to the rank-constrained formulation, in which case the projection is onto the set of low-rank matrices, and ii) incorporating the rank-constraint in the objective by considering the factorized objective , where are and respectively, where is an upper-bound on the rank, but otherwise unconstrained. Obtaining global convergence guarantees for these non-convex optimization problems is a research direction of significant interest in recent years, however efficient algorithms are obtained usually only under specific statistical assumptions on the data, which we do not make in this current work, see for instance [21, 22, 11, 6, 15] and references therein.
In the works [5, 27] the authors consider first-order methods for factorized formulations of problems related to (1), which are not based on statistical assumptions. In these works the authors establish the convergence of specific algorithms from a good initialization point to the global low-rank optimum with convergence rates similar to that of the standard projected gradient descent method.
2 Optimization over the Unit Trace-Norm Ball
We begin with introducing some notation. For a positive integer we let denote the set . We let denote the standard inner product for matrices, i.e., . For a real matrix , we let denote its th largest singular value (including multiplicities), and we let denote the multiplicity of the th largest singular value. Similarly, for a real symmetric matrix , we let denote its
th largest (signed) eigenvalue, and we letdenote the multiplicity of the the largest eigenvalue. We denote by the set of optimal solutions to Problem (1), and by the corresponding optimal value.
For any , step-size and radius we denote the projected gradient mapping w.r.t. the trace-norm ball of radius :
When , i.e., we consider the unit trace-norm ball, we will omit the subscript and simply write .
Given an optimal solution , a step-size , and an integer in the range , we let denote the radius of the largest Euclidean ball centered at , such that for all in the ball it holds that . Or equivalently, is the solution to the optimization problem
where denotes the Euclidean ball of radius centered at .
Similarly, for any and in the range , we also define
Towards formalizing and proving Proposition 1, deriving lower-bounds on the radius will be our main interest in this section.
Since our objective is to study the properties of the projected-gradient mapping over the trace-norm ball, we begin with the following well-known lemma which connects between the SVD of the point to project and the resulting projection.
Lemma 1 (projection onto the trace-norm ball).
Fix a parameter . Let and consider its singular-value decomposition . If , then the projection of onto the trace-norm ball of radius is given by
where is the unique solution to the equation .
Moreover, if there exists such that , then .
The first part of the lemma is a well-known fact. The second part of the lemma comes from the simple observation, that if for some , then , as defined in the lemma, must satisfy , in which case Eq. (4) sets all bottom components of the SVD of to zero, and hence the projection is of rank at most . ∎
The following lemma which connects between the singular value decomposition of an optimal solution and its corresponding gradient vector, will play an important technical role in our analysis. The proof of the lemma follows essentially from simple optimality conditions.
Let be any optimal solution and write its singular value decomposition as . Then, the gradient vector admits a singular-value decomposition such that the set of pairs of vectors is a set of top singular-vector pairs of which corresponds to the largest singular value .
First, note that if then the claim holds trivially. Thus, henceforth we consider the case .
It suffices to show that for all it holds that .
Assume by contradiction that for some it holds that . Let denote a singular vector pair corresponding to the top singular value . Observe that for all , the point is a feasible solution to Problem (1), i.e., . Moreover, it holds that
which clearly contradicts the optimality of . ∎
For any it holds that . Moreover, if is non-zero over , it holds that
Lemma 2 directly implies that for all it holds that .
For the second part of the lemma, suppose there exist such that .
Since , it follows from simple optimality conditions that , which together with Lemma 2 implies that . Moreover, since and , it follows that .
Thus, using again the convexity of we have that
and hence we arrive at a contradiction. ∎
One may wonder if the reversed inequality to (5) holds (i.e., the inequality holds with equality). The following simple example shows that in general the inequality can be strict. Consider the following example.
for some .
Clearly, using Lemma 1, the problem admits a unique optimal rank-one solution solution , where denotes the diagonal matrix with only the first entry along the main diagonal is non-zero and equal to 1. However, one can easily observe that , meaning .
While the above example demonstrates that in general it is possible that and that, as a result, Proposition 1 may not imply significant computational benefits for Problem (1), the following lemma shows that such cases always imply that the optimization problem (1) is ill-posed in the following sense: increasing the radius of the trace-norm ball by an arbitrary small amount will cause the projected gradient mapping to map such original low-rank solution to a higher rank matrix, implying certain instability of low-rank optimal solutions.
Lemma 3 (gap necessary for stability of rank of optimal solutions).
Suppose there exists of rank such that , and suppose that . Then, for any step-size and for any small enough, it holds that the projected-gradient mapping at w.r.t. the trace-norm ball of radius satisfies
Fix some and denote . Using Lemma 2 we have that the singular values of are given by
where are the singular values of . Since , which implies that , it holds that . Let be such that . Then, by Lemma 1, we have that the projected-gradient mapping w.r.t. the trace-norm ball of radius satisfies:
where satisfies: . Observe that for , we have that
Thus, it must hold that . However, then it follows that for all , and thus, . ∎
The following lemma demonstrates why setting the rank of the truncated-SVD projection to be at least is necessary. The lemma shows that in general, a result similar in spirit to Proposition 1 may not hold with SVD rank parameter satisfying .
Fix a positive integer and . Then, for any small enough and for any , there exists a convex and -smooth function such that
admits a rank- minimizer over the unit trace-norm ball for which it holds that and the spectral gap is ,
there exists a matrix such that , , , and for any it holds that .
Consider the following function .
where denotes the indicator for the th diagonal entry. Note that is indeed -smooth.
We set values , . It is not hard to verify that the rank- matrix
is a minimizer of over the unit trace-norm ball. In particular, it holds that
Hence, we have .
Consider now the matrix given by
Note that is rank- as well. Clearly, it holds that .
Thus, for any step-size we have
Note that has positive singular values, which we denote (in non-increasing order) by . In particular, for any it holds that for .
Note that . Thus, by Lemma 1, the singular values of are given by , where satisfies
For to hold, it must hold that . We consider now two cases.
In the first case we have , i.e., , . Then, for to hold, it must hold that
and hence we arrive at a contradiction.
In the second case we have , i.e., , . As in the first case, in order for to hold, it must hold that
and thus, in this case also we arrive at a contradiction.
We thus conclude that . ∎
We now present and prove our main technical theorem which lower bounds - the radius of the ball around an optimal solution , in which the projected gradient mapping has rank at most , hence proving Proposition 1.
Assume is non-zero over the unit trace-norm ball and fix some . Let denote the multiplicity of , and let denote the singular values of (including multiplicities). Then, for any it holds that
More generally, for any and , it holds that
Moreover, for any and , it holds that
Throughout the proof we assume without loss of generality that . Fix a step-size .
Denote and let denote the singular values of . Let us also denote by the singular values of , and . From Lemma 2 we can deduce that
For any integer let us define
Since , it follows that , we have that
where (a) follows from (2).
Now, given some , denote and let denote the singular values of . It holds that
where (a) follows from Ky Fan’s inequality for the singular values, and (b) follows from the -smoothness of .
Also, similarly, using Weyl’s inequality, it holds that
Thus, it follows that if satisfies:
Alternatively, for any , using the more general version of Weyl’s inequality, we can replace Eq. (12) with