1 Introduction
The main subject of investigation in this paper is the following optimization problem:
(1) 
where is convex and smooth (i.e., gradientLipschitz), and denotes the tracenorm, i.e., sum of singular values (aka the nuclear norm).
Problem (1) has received much attention in recent years and has many applications in machine learning, signal processing, statistics and engneering, such as the celebrated matrix completion problem [10, 28, 20], affine rank minimization problems [29, 21], robust PCA [9], and more.
Many standard firstorder methods such as the projectedgradient descent [26], Nesterov’s accelerated gradient method [26] and FISTA [4], when applied to Problem (1), require on each iteration to compute the projectedgradient mapping w.r.t. the tracenorm ball, given by
(2) 
for some current iterate and stepsize , where denotes the Euclidean projection onto the unit tracenorm ball.
It is well known that computing the projection step in (2) amounts to computing the singular value decomposition of the matrix and projecting the vector of singular values onto the unitsimplex (keeping the left and right singular vectors without change). Unfortunately, in worstcase, a fullrank SVD computation is required which amounts to runtime per iteration, assuming . This naturally prohibits the use of such methods for large scale problems in which both are large.
Since (2) requires in general expensive fullrank SVD computations, a very natural and simple heuristic to reduce the computational complexity is to replace the expensive projection operation with an approximate ”projection”, which (accurately) projects the matrix  the best rank approximation of . That is, we consider replacing with the operation
where corresponds to the rank truncated SVD of (i.e., we consider only the top components of the SVD).
Using stateoftheart Krylov subspace methods, such as the Power Iterations algorithm or the Lanczos algorithm (see for instance the classical text [16] and also the recent works [24, 3]), could be computed with runtime proportional to only  a very significant speedup when . Moreover, in many problems the gradient matrix is sparse (e.g., in the well studied matrix completion problem), in which case further significant accelerations apply. The drawback of course is that when using the approximated procedure , the highly desired convergence guarantees of firstorder methods need no longer hold.
A motivation for the plausible effectiveness of this heuristic is that Problem (1) is often used as a convex relaxation for nonconvex rankconstrained optimization problems, which are often assumed to admit a lowrank global minimizer (as is in all examples provided above). Given this lowrank structure of the optimal solution, one may wonder if indeed storing and manipulating highrank matrices, when optimizing (1), is mandatory, or alternatively, that at some stage during the run of the algorithm the iterates all become lowrank.
It is thus natural to ask: under which conditions is it possible to replace the projection with the approximation , while keeping the original convergence guarantees of firstorder methods?
Or, put differently, we ask for which (and a suitable ) does
(3) 
hold?
Our main result in this paper is the formulation and proof of the following proposition, presented at this point only informally.
Proposition 1.
As we show, Proposition 1 readily implies that standard gradient methods such as the Projected Gradient Method, Nesterov’s Accelerated Gradient Method, and FISTA, when initialized in the proximity of an optimal solution, converge with their original convergence guarantees, i.e., producing the exact same sequences of iterates, when the exact Euclidean projection is replaced with the truncatedSVDbased projection .
Some complexity implications of our results to firstorder methods for Problem (1) are summarized in Table 1, together with comparison to other firstorder methods.
The connection between  the rank parameter in the approximated projection and the parameter may seem unintuitive at first. In particular, one might expect that should be comparable directly with . However, as we show they are indeed tightly related. In particular, the radius of the ball around an optimal solution in which (3) holds is strongly related to spectral gaps in the gradient vector . This further implies ”overparameterization” results in which we show how the radius of the ball in which (3) applies, increases with the rank parameter , showing it can increase quite dramatically with only a moderate increase in . We also bring two complementary results showing that implies that the optimization problem (1) is illposed in a sense, and that in general, a result in the spirit of Proposition 1 may not hold when .
Algorithm  Conv.  Rate  SVD size  Sol. Rank 

smooth and convex  
Proj. Grad.  global  
Acc. Grad.  global  
FrankWolfe [20]  global  
Proj. Grad. (this paper)  local  
Acc. Grad. (this paper)  local  
smooth and strongly convex  
Proj. Grad.  global  
Acc. Grad.  global  
RORFW [13]  global  
BlockFW [2]  global  
Proj. Grad. (this paper)  local  
Acc. Grad. (this paper)  local 
1.1 Organization of this paper
The rest of this paper is organized as follows. In the remaining of this section we discuss related work. In Section 2 we present our main result: we formalize and prove Proposition 1 in the context of Problem (1). In this section we also present several complementing results that further strengthen our claims. In Section 3 we demonstrate how the results of Section 2 readily imply the local convergence of standard projectionbased firstorder methods for Problem (1), using only lowrank SVD to compute the Euclidean projection. In Sections 4 and 5 we formalize and prove versions of Proposition 1 for smooth convex optimization with tracenorm regularization, and smooth convex optimization over the set of unittrace positive semidefinite matrices, respectively. Finally, in Section 6 we present supporting empirical evidence.
1.2 Related work
The subject of efficient algorithms for lowrank matrix optimization problems has enjoyed significant interest in recent years. Below we survey some notable results both for the convex problem (1), aswell as other related convex models, and also for related nonconvex optimization problems.
Convex methods:
Besides projectionbased methods, other highly popular methods for Problem (1) are conditional gradient methods (aka FrankWolfe algorithm) [12, 19, 20, 18]. These algorithms require only a rankone SVD computation on each iteration, hence each iteration is very efficient, however their convergence rates, which are typically of the form for smooth problems (even when the objective is also strongly convex) are in general inferior to projectionbased methods such as Nesterov’s accelerated method [26] and FISTA [4]. Recently, several works have developed variants of the basic method with faster rates, though these hold only under the additional assumption that the objective is also strongly convex [13, 2, 14]. Additionally, these new variants require to store in memory potentially highrank matrices, which may limit their applicability to large problems. In [31] the authors present a novel conditional gradient method which enjoys a lowmemory footprint for certain instances of (1) such as the well known matrix completion problem, however there is no improvement in convergence rate beyond that of the standard method.
Nonconvex methods:
Problem (1) is often considered as a convex relaxation to the nonconvex problem of minimizing under an explicit rank constraint. Two popular approaches to solving this nonconvex problem are i) apply projected gradient descent to the rankconstrained formulation, in which case the projection is onto the set of lowrank matrices, and ii) incorporating the rankconstraint in the objective by considering the factorized objective , where are and respectively, where is an upperbound on the rank, but otherwise unconstrained. Obtaining global convergence guarantees for these nonconvex optimization problems is a research direction of significant interest in recent years, however efficient algorithms are obtained usually only under specific statistical assumptions on the data, which we do not make in this current work, see for instance [21, 22, 11, 6, 15] and references therein.
In the works [5, 27] the authors consider firstorder methods for factorized formulations of problems related to (1), which are not based on statistical assumptions. In these works the authors establish the convergence of specific algorithms from a good initialization point to the global lowrank optimum with convergence rates similar to that of the standard projected gradient descent method.
2 Optimization over the Unit TraceNorm Ball
We begin with introducing some notation. For a positive integer we let denote the set . We let denote the standard inner product for matrices, i.e., . For a real matrix , we let denote its th largest singular value (including multiplicities), and we let denote the multiplicity of the th largest singular value. Similarly, for a real symmetric matrix , we let denote its
th largest (signed) eigenvalue, and we let
denote the multiplicity of the the largest eigenvalue. We denote by the set of optimal solutions to Problem (1), and by the corresponding optimal value.For any , stepsize and radius we denote the projected gradient mapping w.r.t. the tracenorm ball of radius :
When , i.e., we consider the unit tracenorm ball, we will omit the subscript and simply write .
Given an optimal solution , a stepsize , and an integer in the range , we let denote the radius of the largest Euclidean ball centered at , such that for all in the ball it holds that . Or equivalently, is the solution to the optimization problem
where denotes the Euclidean ball of radius centered at .
Similarly, for any and in the range , we also define
Towards formalizing and proving Proposition 1, deriving lowerbounds on the radius will be our main interest in this section.
Since our objective is to study the properties of the projectedgradient mapping over the tracenorm ball, we begin with the following wellknown lemma which connects between the SVD of the point to project and the resulting projection.
Lemma 1 (projection onto the tracenorm ball).
Fix a parameter . Let and consider its singularvalue decomposition . If , then the projection of onto the tracenorm ball of radius is given by
(4) 
where is the unique solution to the equation .
Moreover, if there exists such that , then .
Proof.
The first part of the lemma is a wellknown fact. The second part of the lemma comes from the simple observation, that if for some , then , as defined in the lemma, must satisfy , in which case Eq. (4) sets all bottom components of the SVD of to zero, and hence the projection is of rank at most . ∎
The following lemma which connects between the singular value decomposition of an optimal solution and its corresponding gradient vector, will play an important technical role in our analysis. The proof of the lemma follows essentially from simple optimality conditions.
Lemma 2.
Let be any optimal solution and write its singular value decomposition as . Then, the gradient vector admits a singularvalue decomposition such that the set of pairs of vectors is a set of top singularvector pairs of which corresponds to the largest singular value .
Proof.
First, note that if then the claim holds trivially. Thus, henceforth we consider the case .
It suffices to show that for all it holds that .
Assume by contradiction that for some it holds that . Let denote a singular vector pair corresponding to the top singular value . Observe that for all , the point is a feasible solution to Problem (1), i.e., . Moreover, it holds that
which clearly contradicts the optimality of . ∎
Corollary 1.
For any it holds that . Moreover, if is nonzero over , it holds that
(5) 
Proof.
Lemma 2 directly implies that for all it holds that .
For the second part of the lemma, suppose there exist such that .
Since , it follows from simple optimality conditions that , which together with Lemma 2 implies that . Moreover, since and , it follows that .
Thus, using again the convexity of we have that
and hence we arrive at a contradiction. ∎
One may wonder if the reversed inequality to (5) holds (i.e., the inequality holds with equality). The following simple example shows that in general the inequality can be strict. Consider the following example.
for some .
Clearly, using Lemma 1, the problem admits a unique optimal rankone solution solution , where denotes the diagonal matrix with only the first entry along the main diagonal is nonzero and equal to 1. However, one can easily observe that , meaning .
While the above example demonstrates that in general it is possible that and that, as a result, Proposition 1 may not imply significant computational benefits for Problem (1), the following lemma shows that such cases always imply that the optimization problem (1) is illposed in the following sense: increasing the radius of the tracenorm ball by an arbitrary small amount will cause the projected gradient mapping to map such original lowrank solution to a higher rank matrix, implying certain instability of lowrank optimal solutions.
Lemma 3 (gap necessary for stability of rank of optimal solutions).
Suppose there exists of rank such that , and suppose that . Then, for any stepsize and for any small enough, it holds that the projectedgradient mapping at w.r.t. the tracenorm ball of radius satisfies
Proof.
Fix some and denote . Using Lemma 2 we have that the singular values of are given by
where are the singular values of . Since , which implies that , it holds that . Let be such that . Then, by Lemma 1, we have that the projectedgradient mapping w.r.t. the tracenorm ball of radius satisfies:
where satisfies: . Observe that for , we have that
Thus, it must hold that . However, then it follows that for all , and thus, . ∎
The following lemma demonstrates why setting the rank of the truncatedSVD projection to be at least is necessary. The lemma shows that in general, a result similar in spirit to Proposition 1 may not hold with SVD rank parameter satisfying .
Lemma 4.
Fix a positive integer and . Then, for any small enough and for any , there exists a convex and smooth function such that

admits a rank minimizer over the unit tracenorm ball for which it holds that and the spectral gap is ,

there exists a matrix such that , , , and for any it holds that .
Proof.
Consider the following function .
where denotes the indicator for the th diagonal entry. Note that is indeed smooth.
We set values , . It is not hard to verify that the rank matrix
is a minimizer of over the unit tracenorm ball. In particular, it holds that
Hence, we have .
Consider now the matrix given by
Note that is rank as well. Clearly, it holds that .
Also,
Thus, for any stepsize we have
Note that has positive singular values, which we denote (in nonincreasing order) by . In particular, for any it holds that for .
Note that . Thus, by Lemma 1, the singular values of are given by , where satisfies
For to hold, it must hold that . We consider now two cases.
In the first case we have , i.e., , . Then, for to hold, it must hold that
and hence we arrive at a contradiction.
In the second case we have , i.e., , . As in the first case, in order for to hold, it must hold that
and thus, in this case also we arrive at a contradiction.
We thus conclude that . ∎
We now present and prove our main technical theorem which lower bounds  the radius of the ball around an optimal solution , in which the projected gradient mapping has rank at most , hence proving Proposition 1.
Theorem 1.
Assume is nonzero over the unit tracenorm ball and fix some . Let denote the multiplicity of , and let denote the singular values of (including multiplicities). Then, for any it holds that
(6) 
More generally, for any and , it holds that
(7) 
Moreover, for any and , it holds that
(8) 
Proof.
Throughout the proof we assume without loss of generality that . Fix a stepsize .
Denote and let denote the singular values of . Let us also denote by the singular values of , and . From Lemma 2 we can deduce that
(9) 
For any integer let us define
Since , it follows that , we have that
(10) 
where (a) follows from (2).
Now, given some , denote and let denote the singular values of . It holds that
(11)  
where (a) follows from Ky Fan’s inequality for the singular values, and (b) follows from the smoothness of .
Also, similarly, using Weyl’s inequality, it holds that
(12) 
Thus, it follows that if satisfies:
we have that , which implies via Lemma 1 that . This proves (6), (7).
Alternatively, for any , using the more general version of Weyl’s inequality, we can replace Eq. (12) with
(14)  