1 Introduction
Let be a matrix of interest and
. Suppose we observe vectors
,(1) 
where and are random errors. We are interested in estimating when is a small fraction of . A wellknown application of matrix completion is the Netflix problem where is the rating of movie by user for [1]. In such applications, the proportion of the observed entries is typically very small, so that the estimation or recovery of is impossible without a structure assumption on . In this paper, we assume that is of low rank.
A focus of recent studies of matrix completion has been on a simpler formulation, also known as exact recovery, where the observations are assumed to be uncorrupted, i.e. . A direct approach is to minimize subject to . An iterative algorithm was proposed in [5] to project a trimmed SVD of the incomplete data matrix to the space of matrices of a fixed rank . The nuclear norm was proposed as a surrogate for the rank, leading to the following convex minimization problem in a linear space [2]:
We denote the nuclear norm by here and throughout this paper. This procedure, analyzed in [2, 3, 4, 11] among others, is parallel to the replacement of the penalty by the penalty in solving the sparse recovery problem in a linear space.
In this paper, we focus on the problem of matrix completion with noisy observations (1) and take the exact recovery as a special case. Since the exact constraint is no longer appropriate in the presence of noise, penalized squared error is considered. By reformulating the problem in Lagrange form, [8] proposed the spectrum Lasso
(2) 
along with an iterative convex minimization algorithm. However, (2) is difficult to analyze when the sample fraction is small, due to the illposedness of the quadratic term . This has led to two alternatives in [7] and [9]. While [9] proposed to minimize (2) under an additional constraint on , [7] modified (2) by replacing the quadratic term with . Both [7, 9] provided nearly optimal error bounds when the noise level is of no smaller order than the norm of the target matrix , but not of smaller order, especially not for exact recovery. In a different approach, [6] proposed a nonconvex recursive algorithm and provided error bounds in proportion to the noise level. However, the procedure requires the knowledge of the rank of the unknown and the error bound is optimal only when and are of the same order.
Our goal is to develop an algorithm for matrix completion that can be as easily computed as the spectrum Lasso (2) and enjoys a nearly optimal error bound proportional to the noise level to continuously cover both the noisy and noiseless cases. We propose to use an elastic penalty, a linear combination of the nuclear and Frobenius norms, which leads to the estimator
(3) 
where and are the nuclear and Frobenius norms, respectively. We call (3
) spectrum elastic net (Enet) since it is parallel to the Enet in linear regression, the least squares estimator with a sum of the
and penalties, introduced in [15]. Here the nuclear penalty provides the sparsity in the spectrum, while the Frobenius penalty regularizes the inversion of the quadratic term. Meanwhile, since the Frobenius penalty roughly shrinks the estimator by a factor , we correct this bias by a calibration step,(4) 
We call this estimator calibrated spectrum Enet.
Motivated by [8], we develop an EM algorithm to solve (3) for matrix completion. The algorithm iteratively replaces the missing entries with those obtained from a scaled softthresholding singular value decomposition (SVD) until the resulting matrix converges. This EM algorithm is guaranteed to converge to the solution of (3).
Under proper coherence conditions, we prove that for suitable penalty levels and , the calibrated spectrum Enet (4) achieves a desired error bound in the Frobenius norm. Our error bound is of nearly optimal order and in proportion to the noise level. This provides a sharper result than those of [7, 9] when the noise level is of smaller order than the norm of , and than that of [6] when is large. Our simulation results support the use of the calibrated spectrum Enet. They illustrate that (4) performs comparably to (2) and outperforms the modified method of [7].
Our analysis of the calibrated spectrum Enet uses an inequality similar to a duel certificate bound in [3]. The bound in [3] requires sample size , where
. We use the method of moments to remove a
factor in the first component of their sample size requirement. This leads to a sample size requirement of , with an extra in comparison to the ideal . Since the extra does not appear in our error bound, its appearance in the sample size requirement seems to be a technicality.The rest of the paper is organized as follows. In Section 2, we describe an iterative algorithm for the computation of the spectrum Enet and study its convergence. In Section 3, we derive error bounds for the calibrated spectrum Enet. Some simulation results are presented in Section 4. Section 5 provides the proof of our main result.
We use the following notation throughout this paper. For matrices , is the nuclear norm (the sum of all singular values of ), is the spectrum norm (the largest singular value), is the Frobenius norm (the norm of vectorized ), and . Linear mappings from to are denoted by the calligraphic letters. For a linear mapping , the operator norm is . We equip with the inner product so that . For projections , with being the identity. We denote by the unit matrix with 1 at , and by the projection to : .
2 An algorithm for spectrum elastic regularization
We first present a lemma for the Mstep of our iterative algorithm.
Lemma 1
Suppose the matrix has rank . The solution to the optimization problem
is given by with , where is the SVD of , and .
The minimization problem in Lemma 1 is solved by a scaled softthresholding SVD. This is parallel to Lemma 1 in [8] and justified by Remark 1 there. We use Lemma 1 to solve the Mstep of the EM algorithm for the spectrum Enet (3).
We still need an Estep to impute a complete matrix given the observed data . Since are allowed to have ties, we need the following notation. Let be the multiplicity of observations at and be the maximum multiplicity. Suppose that the complete data is composed of observations at each for a certain integer . Let be the sample mean of the complete data at and be the matrix with components . If the complete data are available, (3) is equivalent to
Let be the sample mean of the observations at and
. In the white noise model, the conditional expectation of
given is for . This leads to a generalized Estep:(5) 
where is the estimation of in the previous iteration. This is a genuine Estep when but also allows a smaller to reduce the proportion of missing data.
We now present the EMalgorithm for the computation of the spectrum Enet in (3).
Algorithm 1
Initialize with and . Repeat the following steps:

Estep: Compute in (5) with and assign ,

Mstep: Compute ,
until . Then, return .
The following theorem states the convergence of Algorithm 1.
Theorem 1
As , converges to a limit as a function of the data and , and for .
Theorem 1 is a variation of a parallel result in [8] and follows from the same proof there. As [8] pointed out, a main advantage of Algorithm 1 is the speed of each iteration. When the maximum multiplicity is small, we simply use and ; Otherwise, we may first run the EMalgorithm for an and use the output as the initialization for a second run of the EMalgorithm with .
3 Analysis of estimation accuracy
In this section, we derive error bounds for the calibrated spectrum Enet. We need the following notation. Let , be the SVD of , and be the nonzero singular values of . Let be the tangent space with respect to , the space of all matrices of the form . The orthogonal projection to is given by
(6) 
Theorem 2
The proof of Theorem 2 is provided in Section 5. When are random entries in , , so that (8) and the first inequality of (7) are expected to hold under proper conditions. Since the rank of is no greater than , (9) essentially requires . Our analysis allows to lie in a certain range , and is large under proper conditions. Still, the choice of is constrained by (7) and (8) since is linear in . When diverges to infinity, the calibrated spectrum Enet (4) becomes the modified spectrum Lasso of [7].
Theorem 2 provides sufficient conditions on the target matrix and the noise for achieving a certain level of estimation error. Intuitively, these conditions on the target matrix must imply a certain level of coherence (or flatness) of the unknown matrix since it is impossible to distinguish the unknown from zero when the observations are completely outside its support. In [2, 3, 4, 11], coherence conditions are imposed on
(11) 
where and are matrices of singular vectors of . [9] considered a more general notation of spikiness of a matrix , defined as the ratio between the and dimensionnormalized norms,
(12) 
Suppose in the rest of the section that
are iid points uniformly distributed in
and are iid variables independent of . The following theorem asserts that under certain coherence conditions on the matrices , , and , all conditions of Theorem 2hold with large probability when the sample size
is of the order .Theorem 3
Let . Consider and satisfying
(13) 
Then, there exists a constant such that
(14) 
implies
with probability at least , where and are the coherence constants in (11), is the spikiness of and .
We require the knowledge of noise level to determine the penalty level that is usually considered as tuning parameter in practice. The Frobenius norm in (13) can be replaced by an estimate of the same magnitude in Theorem 3. In our simulation experiment, we use with . The Chebyshev inequality provides when and .
A key element in our analysis is to find a probabilistic bound for the second inequality of (8), or equivalently an upper bound for
(15) 
This guarantees the existence of a primal dual certificate for the spectrum Enet penalty [14]. For , a similar inequality was proved in [3], where the sample size requirement is for a certain coherence factor . We remove a log factor in the first bound, resulting in the sample size requirement in (14), which is optimal when . For exact recovery in the noiseless case, the sample size is sufficient if a golfing scheme is used to construct an approximate dual certificate [4, 11]. We use the following lemma to bound (15).
Lemma 2
Let where are iid points uniformly distributed in . Let and . Let be a deterministic matrix. Then, there exists a numerical constant such that, for all and ,
(16) 
We use a different graphical approach than those in [3] to bound in the proof of Lemma 2. The rest of the proof of Theorem 3 can be outlined as follows. Assume that all coherence factors are . Let and write . By (16) with for and an even simpler bound for and Rem, (15) holds when , where . Since , this is equivalent to . Finally, we use matrix exponential inequalities [10, 12] to verify other conditions of Theorem 2. We omit technical details of the proof of Lemma 2 and Theorem 3. We would like to point out that if the in (16) can be replaced by , e.g. in view of [3], the rest of the proof of Theorem 3 is intact with and a proper adjustment of in (13).
Compared with [7] and [9], the main advantage of Theorem 3 is the proportionality of its error bound to the noise level. In [7], the quadratic term in (2) is replaced by its expectation and the resulting minimizer is proved to satisfy
(17) 
with large probability, where is a numerical constant. This error bound achieves the squared error rate as in Theorem 3 when the noise level is of no smaller order than , but not of smaller order. In particular, (17) does not imply exact recovery when . In Theorem 3, the error bound converges to zero as the noise level diminishes, implying exact recovery in the noiseless case. In [9], a constrained spectrum Lasso was proposed that minimizes (2) subject to . For and , [9] proved
(18) 
with large probability. Scale change from the above error bound yields
Since and , the righthand side of (18) is of no smaller order than that of (17). We shall point out that (17) and (18) only require sample size . In addition, [9] allows more practical weighted sampling models.
Compared with [6], the main advantage of Theorem 3 is the independence of its sample size requirement on the aspect ratio , where is assumed without loss of generality by symmetry. The error bound in [6] implies
(19) 
for sample size , where are constants depending on the same set of coherence factors as in (14) and are the singular values of . Therefore, Theorem 3 effectively replaces the root aspect ratio in the sample size requirement of (19) with a log factor, and removes the coherence factor on the righthand side of (19). We note that is a larger coherence factor than in the sample size requirement in Theorem 3. The root aspect ratio can be removed from the sample size requirement for (19) if can be divided into square blocks uniformly satisfying the coherence conditions.
4 Simulation study
This experiment has the same setting as in Section 9 of [8]. We provide the description of the simulation settings in our notation as follows: The target matrix is , where and are random matrices with independent standard normal entries. The sampling points have no tie and is a uniformly distributed random subset of , where is fixed. The errors are iid variables. Thus, the observed matrix is with being a projection. The signal to noise ratio (SNR) is defined as .
We compare the calibrated spectrum Enet (4) with the spectrum Lasso (2) and its modification of [7]. For all methods, we compute a series of estimators with 100 different penalty levels, where the smallest penalty level corresponds to a fullrank solution and the largest penalty level corresponds to a zero solution. For the calibrated spectrum Enet, we always use , where is an estimator for . We plot the training errors and test errors as functions of estimated ranks, where the training and test errors are defined as
In Figure 1, we report the estimation performance of three methods. The rank of is 10 but are regenerated in each replication. Different noise levels and proportions of the observed entries are considered. All the results are averaged over 50 replications. In this experiment, the calibrated spectrum Enet and the spectrum Lasso estimator have very close testing and training errors, and both of them significantly outperform the modified Lasso. Figure 1 also illustrates that in most cases, the calibrated spectrum Enet and spectrum Lasso achieve the optimal test error when the estimated rank is around the true rank.
We note that the constrained spectrum Lasso estimator would have the same performance as the spectrum Lasso when the constraint is set with a sufficiently high . However, analytic properties of the spectrum Lasso is unclear without constraint or modification.
5 Proof of Theorem 2
The proof of Theorem 2 requires the following proposition that controls the approximation error of the Taylor expansion of the nuclear norm with subdifferentiation. The result, closely related to those in [13], is used to control the variation of the tangent space of the spectrum Enet estimator. We omit its proof.
Proposition 1
Let be the SVD and be another matrix. Then,
Case 1: . By algebra , so that
(22) 
The last inequality above follows from the first inequalities in (7), (8) and (9). It remains to bound . Let . We write the spectrum Enet estimator (3) as
It follows that for a certain member in the subdifferential of at ,
Let . Since , we have
(23)  
(24)  
(25)  
(26) 
The second inequality in (23) is due to and . The last equality in (23) follows from the definition of , since it gives . By the definitions of , and , . Since and , we find
Thus, by the second inequalities of (8) and (9),
Comments
There are no comments yet.