Optimization problems involving constraints on the rank of matrices are pervasive in applications. In Control Theory, such problems arise in the context of low-order controller design [9, 19], minimal realization theory , and model reduction . In Machine Learning, problems in inference with partial information , multi-task learning ,and manifold learning  have been formulated as rank minimization problems. Rank minimization also plays a key role in the study of embeddings of discrete metric spaces in Euclidean space 
. In certain instances with special structure, rank minimization problems can be solved via the singular value decomposition or can be reduced to the solution of a linear system[19, 20]. In general, however, minimizing the rank of a matrix subject to convex constraints is NP-HARD. The best exact algorithms for this problem involve quantifier elimination and such solution methods require at least exponential time in the dimensions of the matrix variables.
A popular heuristic for solving rank minimization problems in the controls community is the “trace heuristic” where one minimizes the trace of a positive semidefinite decision variable instead of the rank (see, e.g., [4, 19]). A generalization of this heuristic to non-symmetric matrices introduced by Fazel in  minimizes the nuclear norm, or the sum of the singular values of the matrix, over the constraint set. When the matrix variable is symmetric and positive semidefinite, this heuristic is equivalent to the trace heuristic, as the trace of a positive semidefinite matrix is equal to the sum of its singular values. The nuclear norm is a convex function and can be optimized efficiently via semidefinite programming. Both the trace heuristic and the nuclear norm generalization have been observed to produce very low-rank solutions in practice, but, until very recently, conditions where the heuristic succeeded were only available in cases that could also be solved by elementary linear algebra .
The first non-trivial sufficient conditions that guaranteed the success of the nuclear norm heuristic were provided in . Focusing on the special case where one seeks the lowest rank matrix in an affine subspace, the authors provide a “restricted isometry” condition on the linear map defining the affine subspace which guarantees the minimum nuclear norm solution is the minimum rank solution. Moreover, they provide several ensembles of affine constraints where this sufficient condition holds with overwhelming probability. Their work builds on seminal developments in “compressed sensing” that determined conditions for when minimizing the
norm of a vector over an affine space returns the sparsest vector in that space (see, e.g.,[6, 5, 3]). There is a strong parallelism between the sparse approximation and rank minimization settings. The rank of a diagonal matrix is equal to the number of non-zeros on the diagonal. Similarly, the sum of the singular values of a diagonal matrix is equal to the norm of the diagonal. Exploiting the parallels, the authors in  were able to extend much of the analysis developed for the heuristic to provide guarantees for the nuclear norm heuristic.
and sufficient condition for the solution of the nuclear norm heuristic to coincide with the minimum rank solution in an affine space. The condition characterizes a particular property of the null-space of the linear map which defines the affine space. We show that when the linear map defining the constraint set is generated by sampling its entries independently from a Gaussian distribution, the null-space characterization holds with overwhelming probability provided the dimensions of the equality constraints are of appropriate size. We provide numerical experiments demonstrating that even when matrix dimensions are small, the nuclear norm heuristic does indeed always recover the minimum rank solution when the number of constraints is sufficiently large. Empirically, we observe that our probabilistic bounds accurately predict when the heuristic succeeds.
1.1 Main Results
Let be an matrix decision variable. Without loss of generality, we will assume throughout that . Let be a linear map, and let . The main optimization problem under study is
This problem is known to be NP-HARD and is also hard to approximate . As mentioned above, a popular heuristic for this problem replaces the rank function with the sum of the singular values of the decision variable. Let denote the -th largest singular value of (equal to the square-root of the
-th largest eigenvalue of). Recall that the rank of is equal to the number of nonzero singular values. In the case when the singular values are all equal to one, the sum of the singular values is equal to the rank. When the singular values are less than or equal to one, the sum of the singular values is a convex function that is strictly less than the rank. This sum of the singular values is a unitarily invariant matrix norm, called the nuclear norm, and is denoted
This norm is alternatively known by several other names including the Schatten -norm, the Ky Fan norm, and the trace class norm.
As described in the introduction, our main concern is when the optimal solution of (1.1) coincides with the optimal solution of
This optimization is convex, and can be efficiently solved via a variety of methods including semidefinite programming (see  for a survey).
Whenever , the null space of , that is the set of such that , is not empty. Note that is an optimal solution for (1.2) if and only if for every in the null-space of
The following theorem generalizes this null-space criterion to a critical property that guarantees when the nuclear norm heuristic finds the minimum rank solution of for all values of the vector . Our main result is the following
Let be the optimal solution of (1.1) and assume that has rank . Then
If for every in the null space of and for every decomposition
where has rank and has rank greater than , it holds that
then is the unique minimizer of (1.2).
Conversely, if the condition of part 1 does not hold, then there exists a vector such that the minimum rank solution of has rank at most and is not equal to the minimum nuclear norm solution.
This result is of interest for multiple reasons. First, as shown in , a variety of the rank minimization problems, including those with inequality and semidefinite cone constraints, can be reformulated in the form of (1.1). Secondly, we now present a family of random equality constraints under which the nuclear norm heuristic succeeds with overwhelming probability. We prove both of the following two theorems by showing that obeys the null-space criteria of Equation (1.3) and Theorem 1.1 respectively with overwhelming probability.
Note that for a linear map , we can always find an matrix such that
In the case where
has entries sampled independently from a zero-mean, unit-variance Gaussian distribution, then the null space characterization of theorem1.1 holds with overwhelming probability provided
is large enough. For simplicity of notation in the theorem statements, we consider the case of square matrices. These results can be then translated into rectangular matrices by padding with rows/columns of zeros to make the matrix square. We define the random ensemble ofmatrices to be the Gaussian ensemble, with each entry sampled i.i.d. from a Gaussian distribution with zero-mean and variance one. We also denote by .
The first result characterizes when a particular low-rank matrix can be recovered from a random linear system via nuclear norm minimization.
Theorem 1.2 (Weak Bound)
Let be an matrix of rank . Let denote the random linear transformation
denote the random linear transformation
where is sampled from . Then whenever
there exists a numerical constant such that with probability exceeding ,
In particular, if and satisfy (1.5), then nuclear norm minimization will recover from a random set of constraints drawn from the Gaussian ensemble almost surely as .
The second theorem characterizes when the nuclear norm heuristic succeeds at recovering all low rank matrices.
Theorem 1.3 (Strong Bound)
Let be defined as in Theorem 1.2. Define the two functions
Then there exists a numerical constant such that with probability exceeding , for all matrices of rank
In particular, if and satisfy (1.5), then nuclear norm minimization will recover all rank matrices from a random set of constraints drawn from the Gaussian ensemble almost surely as .
Figure 1 plots the bound from Theorems 1.2 and 1.3. We call (1.5) the Weak Bound because it is a condition that depends on the optimal solution of (1.1). On the other hand, we call (1.6) the Strong Bound as it guarantees the nuclear norm heuristic succeeds no matter what the optimal solution. The Weak Bound is the only bound that can be tested experimentally, and, in Section 4, we will show that it corresponds well to experimental data. Moreover, the Weak Bound provides guaranteed recovery over a far larger region of parameter space. Nonetheless, the mere existence of a Strong Bound is surprising in of itself and results in a much better bound than what was available from previous results (c.f., ).
1.2 Notation and Preliminaries
For a rectangular matrix , denotes the transpose of . denotes the vector in with the columns of stacked on top of one and other.
For vectors , the only norm we will ever consider is the Euclidean norm
On the other hand, we will consider a variety of matrix norms. For matrices and of the same dimensions, we define the inner product in as . The norm associated with this inner product is called the Frobenius (or Hilbert-Schmidt) norm . The Frobenius norm is also equal to the Euclidean, or , norm of the vector of singular values, i.e.,
The operator norm (or induced 2-norm) of a matrix is equal to its largest singular value (i.e., the norm of the singular values):
The nuclear norm of a matrix is equal to the sum of its singular values, i.e.,
These three norms are related by the following inequalities which hold for any matrix of rank at most :
To any norm, we may associate a dual norm via the following variational definition
One can readily check that the dual norm the Frobenius norm is the Frobenius norm. Less trivially, one can show that the dual norm of the operator norm is the nuclear norm (See, for example, ). We will leverage the duality between the operator and nuclear norm several times in our analysis.
2 Necessary and Sufficient Conditions
We first prove our necessary and sufficient condition for success of the nuclear norm heuristic. We will need the following two technical lemmas. The first is an easily verified fact.
Suppose and are matrices such that and . Then .
Indeed, if and , we can find a coordinate system in which
from which the lemma trivially follows. The next Lemma allows us to exploit Lemma 2.1 in our proof.
Let be an matrix with rank and be an arbitrary matrix. Let and be the matrices that project onto the column and row spaces of respectively. Then if has full rank, can be decomposed as
where has rank , and
Proof Without loss of generality, we can write as
where is and full rank. Accordingly, becomes
where is full rank since is. The decomposition is now clearly
That has rank follows from the fact that the rank of a block matrix is equal to the rank of a diagonal block plus the rank of its Schur complement (see, e.g., [14, §2.2]). That follows from Lemma 2.1.
We can now provide a proof of Theorem 1.1.
Proof We begin by proving the converse. Assume the condition of part 1 is violated, i.e., there exists some , such that , , , yet . Now take and . Clearly, (since is in the null space) and so we have found a matrix of higher rank, but lower nuclear norm.
For the other direction, assume the condition of part 1 holds. Now use Lemma 2.2 with and . That is, let and be the matrices that project onto the column and row spaces of respectively and assume that has full rank. Write where has rank and . Assume further that has rank larger than (recall ). We will consider the case where does not have full rank and/or has rank less than or equal to in the appendix. We now have:
But , so non-negative and therefore . Since is the minimum nuclear norm solution, implies that .
For the interested reader, the argument for the case where does not have full rank or has rank less than or equal to can be found in the appendix.
3 Proofs of the Probabilistic Bounds
We now turn to the proofs of the probabilistic bounds 1.5 and 1.6. We first provide a sufficient condition which implies the necessary and sufficient null-space conditions. Then, noting that the null space of is spanned by Gaussian vectors, we use bounds from probability on Banach Spaces to show that the sufficient conditions are met. The will require the introduction of two useful auxiliary functions whose actions on Gaussian processes are explored in Section 3.4.
3.1 Sufficient Condition for Null-space Characterizations
The following theorem gives us a new condition that implies our necessary and sufficient condition.
Let be a linear map of matrices into . Suppose that for every in the null-space of and any projection operators onto -dimensional subspaces and that
Then for every matrix with row and column spaces equal to the range of and respectively,
for all in the null-space of . In particular, if 3.1 holds for every pair of projection operators and , then for every in the null space of and for every decomposition where has rank and has rank greater than , it holds that
We will need the following lemma
For any block partitioned matrix
we have .
Proof This lemma follows from the dual description of the nuclear norm:
Theorem 3.1 now trivially follows
Proof [of Theorem 3.1] Without loss of generality, we may choose coordinates such that and both project onto the space spanned by first standard basis vectors. Then we may partition as
and write, using Lemma 3.2,
which is non-negative by assumption. Note that if the theorem holds for all projection operators and whose range has dimension , then for all matrices of rank and hence the second part of the theorem follows.
3.2 Proof of the Weak Bound
The null space of is identically distributed to the span of matrices where each is sampled i.i.d. from .
This is nothing more than a statement that the null-space of is a random subspace. However, when we parameterize elements in this subspace as linear combinations of Gaussian vectors, we can leverage Comparison Theorems for Gaussian processes to yield our bounds.
Let and let be i.i.d. samples from . Let be a matrix of rank . Let and denote the projections onto the column and row spaces of respectively. By theorem 3.1 and Lemma 3.3, we need to show that for all ,
That is, is an arbitrary element of the null space of , and this equation restates the sufficient condition provided by Theorem 3.1. Now it is clear by homogeneity that we can restrict our attention to those with norm . The following crucial lemma characterizes when the expected value of this difference is nonnegative
Let and and suppose and are projection operators onto -dimensional subspaces of . For let be sampled from . Then
We will prove this Lemma and a similar inequality required for the proof the Strong Bound in Section 3.4 below. But we now show how using this Lemma and a concentration of measure argument, we prove Theorem 1.2.
First note, that if we plug in and divide the right hand side by , the right hand side of (3.5) is non-negative if (1.5) holds. To bound the probability that(3.4) is non-negative, we employ a powerful concentration inequality for the Gaussian distribution bounding deviations of smoothly varying functions from their expected value.
To quantify what we mean by smoothly varying, recall that a function is Lipshitz with respect to the Euclidean norm if there exists a constant such that for all and . The smallest such constant is called the Lipshitz constant of the map . If is Lipshitz, it cannot vary too rapidly. In particular, note that if is differentiable and Lipshitz, then is a bound on the norm of the gradient of
. The following theorem states that the deviations of a Lipshitz function applied to a Gaussian random variable have Gaussian tails.
Let be a normally distributed random vector and let
be a normally distributed random vector and letbe a function with Lipshitz constant . Then
See  for a proof of this theorem with slightly weaker constants and several references for more complicated proofs that give rise to this concentration inequality. The following Lemma bounds the Lipshitz constant of interest
For , let and . Define the function
Then the Lipshitz constant of is at most .
3.3 Proof of the Strong Bound
The proof of the Strong Bound is similar to that of the Weak Bound except we prove that (3.4) holds for all operators and that project onto -dimensional subspaces. Our proof will require an -net for the projection operators—a set of points such that any projection operator is within of some element in the set. We will show that if a slightly stronger bound that (3.4) holds on the -net, then (3.4) holds for all choices of row and column spaces.
Let us first examine how (3.4) changes when we perturb and . Let , , and all be projection operators onto -dimensional subspaces. Let be some matrix and observe that
Here, the first and second lines follow from the triangle inequality, the third line follows because , and the fourth line follows because , , , and are all projection operators. Rearranging this inequality gives
As we have just discussed, if we can prove that with overwhelming probability
for all and in an -net for the projection operators onto -dimensional subspaces, we will have proved the Strong Bound.
To proceed, we need to know the size of an -net. The following bound on such a net is due to Szarek.
Theorem 3.7 (Szarek )
Consider the space of all projection operators on projecting onto dimensional subspaces endowed with the metric
Then there exists an -net in this metric space with cardinality at most .
With this in hand, we now calculate the probability that for a given and in the -net,
For let be sampled from . Then
Moreover, we prove the following in the appendix.
For , let and define the function
Then the Lipshitz constant of is at most .
Now, let be an -net for the set of projection operators discussed above. Again by the union bound, we have that
Finding the parameters , , and that make the terms multiplying negative completes the proof of the Strong Bound.
Both of the two following Comparison Theorems provide sufficient conditions for when the expected supremum or infimum of one Gaussian process is greater to that of another. Elementary proofs of both of these Theorems and several other Comparison Theorems can be found in §3.3 of .
Theorem 3.10 (Slepian’s Lemma )
Let and by Gaussian random variables in such that
The following two lemmas follow from applications of these Comparison Theorems. We prove them in more generality than necessary for the current work because both Lemmas are interesting in their own right. Let be any norm on matrices and let be its associated dual norm (See Section 1.2). Let us define the quantity as
and note that by this definition, we have
motivating the notation.
This first Lemma is now a straightforward consequence of Slepian’s Lemma
Let and let be a Gaussian random vector in . Let be sampled i.i.d. from . Then
We follow the strategy used prove Theorem 3.20 in . Let be sampled i.i.d. from and be a Gaussian random vector and let be a zero-mean, unit-variance Gaussian random variable. For and define
Now observe that for any unit vectors in , and any matrices , with dual norm