 # Sparse Trace Norm Regularization

We study the problem of estimating multiple predictive functions from a dictionary of basis functions in the nonparametric regression setting. Our estimation scheme assumes that each predictive function can be estimated in the form of a linear combination of the basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program regularized by the trace norm and the ℓ_1-norm simultaneously. We propose to solve the convex program using the accelerated gradient (AG) method and the alternating direction method of multipliers (ADMM) respectively; we also develop efficient algorithms to solve the key components in both AG and ADMM. In addition, we conduct theoretical analysis on the proposed function estimation scheme: we derive a key property of the optimal solution to the convex program; based on an assumption on the basis functions, we establish a performance bound of the proposed function estimation scheme (via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the proposed algorithms.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study the problem of estimating multiple predictive functions from noisy observations. Such a problem has received broad attention in many areas of statistics and machine learning

[6, 15, 16, 18]. This line of work can be roughly divided into two categories: parametric estimation and non-parametric estimation; a common and important theme for both categories is the appropriate assumption of the structure in the model parameters (parametric setting) or the coefficients of the dictionary (nonparametric setting).

There has been an enormous amount of literature on effective function estimation based on different sparsity constraints, including the estimation of the sparse linear regression via

-norm penalty [3, 6, 27, 32], and the estimation of the linear regression functions using group lasso estimator [15, 16]. More recently, trace norm regularization has become a popular tool for approximating a set of linear models and the associated low-rank matrices in the high-dimensional setting [18, 24]; the trace norm is the tightest convex surrogate 

for the (non-convex) rank function under certain conditions, encouraging the sparsity in the singular values of the matrix of interest. One limitation of the use of trace norm regularization is that the resulting model is dense in general. However, in many real-world applications

, the underlying structure of multiple predictive functions may be sparse as well as low-rank; the sparsity leads to explicitly interpretable prediction models and the low-rank implies essential subspace structure information. Similarly, the -norm is the tightest convex surrogate for the non-convex cardinality function , encouraging the sparsity in the entries of the matrix. This motivates us to explore the use of the combination of the trace norm and the -norm as a composite regularization (called sparse trace norm regularization) to induce the desirable sparse low-rank structure.

Trace norm regularization (minimization) has been investigated extensively in recent years. Efficient algorithms have been developed for solving convex programs with trace norm regularization [29, 12]; sufficient conditions for exact recovery from trace norm minimization have been established in ; consistency of trace norm minimization has been studied in ; trace norm minimization has been applied for matrix completion  and collaborative filtering [25, 23]. Similarly, -norm regularization has been well studied in the literature, just to mention a few, from the efficient algorithms for convex optimization [11, 13, 29], theoretical guarantee of the performance [9, 32], and model selection consistency .

In this paper, we focus on estimating multiple predictive functions simultaneously from a finite dictionary of basis functions in the nonparametric regression setting. Our function estimation scheme assumes that each predictive function can be approximated using a linear combination of those basis functions. By assuming that the coefficient matrix of the basis functions admits a sparse low-rank structure, we formulate the function estimation problem as a convex formulation, in which the combination of the trace norm and the -norm is employed as a composite regularization to induce a sparse low-rank structure in the coefficient matrix. The simultaneous sparse and low-rank structure is different from the incoherent sparse and low-rank structures studied in [8, 10]. We propose to solve the function estimation problem using the accelerated gradient method and the alternating direction method of multipliers; we also develop efficient algorithms to solve the key components involved in both methods. We conduct theoretical analysis on the proposed convex formulation: we first present some basic properties of the optimal solution to the convex formulation (Lemma 4.1); we then present an assumption associated with the geometric nature of the basis functions over the prescribed observations; based on such an assumption, we derive a performance bound for the combined regularization for function estimation (Theorem 4.1). We conduct simulations on benchmark data to demonstrate the effectiveness and efficiency of the proposed algorithms. Notation Denote . For any matrix , denote its trace norm by , i.e., the sum of the singular values; denote its operator norm by , i.e., the largest singular value; denote its -norm by , i.e., the sum of absolute value of all entries.

## 2 Problem Formulation

Let be a set of prescribed sample pairs (fixed design) associated with unknown functions as

 yij=fj(xi)+wij,i∈Nn,j∈Nk, (1)

where is an unknown regression function, denotes the

-th entry of the response vector

, and is a stochastic noise variable. Let , , and . Denoting

 F=(fj(xi))i,j∈Rn×k,i∈Nn,j∈Nk, (2)

we can rewrite Eq. (1) in a compact form as . Let be a set of pre-specified basis functions as , and let be the coefficient matrix. We define

 ^gj(x)=h∑i=1θijgi(x),j∈Nk, (3)

where denotes the -th entry in the vector . Note that in practice the basis functions can be estimators from different methods, or different values of the tuning parameters of the same method.

We consider the problem of estimating the unknown functions using the composite functions defined in Eq. (3), respectively. Denote

 GX=(gj(xi))i,j∈Rn×h,i∈Nn,j∈Nh, (4)

and define the empirical error as

 (5)

where . Our goal is to estimate the model parameter of a sparse low-rank structure from the given sample pairs . Such a structure induces the sparsity and the low rank simultaneously in a single matrix of interest.

Given that the functions are coupled via in some coherent sparse and low-rank structure, we propose to estimate as

 ˆΘ=argminΘ(ˆS(Θ)+α∥Θ∥∗+β∥Θ∥1), (6)

where and are regularization parameters (estimated via cross-validation), and the linear combination of and is used to induce the sparse low-rank structure in . The optimization problem in Eq. (6) is non-smooth convex and hence admits a globally optimal solution; it can be solved using many sophisticated optimization techniques [28, 12]; in Section 3, we propose to apply the accelerated gradient method  and the alternating direction method of multipliers  to solve the optimization problem in Eq. (6).

## 3 Optimization Algorithms

In this section, we consider to apply the accelerated gradient (AG) algorithm [2, 19, 20] and the alternating direction method of multipliers (ADMM) , respectively, to solve the (non-smooth and convex) optimization problem in Eq. (6). We also develop efficient algorithms to solve the key components involved in both AG and ADMM.

The AG algorithm has attracted extensive attention in the machine learning community due to its optimal convergence rate among all first order techniques and its ability of dealing with large scale data. The general scheme in AG for solving Eq. (6) can be described as below: at the -th iteration, the intermediate (feasible) solution can be obtained via

 (7)

where denotes a searching point constructed on the intermediate solutions from previous iterations,

denotes the derivative of the loss function in Eq. (

5) at , and specifies the step size which can be determined by iterative increment until the condition

 ˆS(Θk)≤ˆS(Φk)+⟨∇f(Φk),Θk−Φk⟩+γk2∥Θk−Φk∥2F

is satisfied. The operation in Eq. (7) is commonly referred to as proximal operator , and its efficient computation is critical for the practical convergence of the AG-type algorithm. Next we present an efficient alternating optimization procedure to solve Eq. (7) with a given .

#### 3.1.1 Dual Formulation

The problem in Eq. (7) is not easy to solve directly; next we show that this problem can be efficiently solved in its dual form. By reformulating and into the equivalent dual forms, we convert Eq. (7) into a max-min formulation as

 maxL,SminΘ∥Θ−ˆΦ∥2F+ˆα⟨L,Θ⟩+ˆβ⟨S,Θ⟩,subject to∥L∥2≤1,∥S∥∞≤1, (8)

where , , and . It can be verified that in Eq. (8) the Slater condition is satisfied and strong duality holds . Also the optimal can be expressed as a function of and given by

 Θ=ˆΦ−12(ˆαL+ˆβS). (9)

By substituting Eq. (9) into Eq. (8), we obtain the dual form of Eq. (7) as

 minL,S∥ˆαL+ˆβS−2ˆΦ∥2F,subject to∥L∥2≤1,∥S∥∞≤1. (10)

#### 3.1.2 Alternating Optimization

The optimization problem in Eq. (10) is smooth convex and it has two optimization variables. For such type of problems, coordinate descent (CD) method is routinely used to compute its globally optimal solution . To solve Eq. (10), the CD method alternatively optimizes one of the two variables with the other variable fixed. Our analysis below shows that the variables and in Eq. (10) can be optimized efficiently. Note that the convergence rate of the CD method is not known, however, it converges very fast in practice (less than iterations in our experiments).

Optimization of L For a given , the variable can be optimized via solving the following problem:

 minL∥L−ˆL∥2F,subject to∥L∥2≤1, (11)

where . The optimization on above can be interpreted as computing an optimal projection of a given matrix over a unit spectral norm ball. Our analysis shows that the optimal solution to Eq. (11) can be expressed in an analytic form as summarized in the following theorem.

###### Theorem 3.1.

For arbitrary in Eq. (11), denote its SVD by , where , , , and . Let . Then the optimal solution to Eq. (11) is given by

 L∗=U^ΣVT,^Σ=diag(^σ∗1,⋯,^σ∗r). (12)
###### Proof.

Assume the existence of a set of left and right singular vector pairs shared by the optimal to Eq. (11) and the given for their non-zero singular values. Under such an assumption, it can be verified that the singular values of can be obtained via

 min{^σi}(^σi−σi)2,subject to0≤^σi≤1,i=1,⋯,r,

to which the optimal solution is given by ; hence the expression of coincides with Eq. (12). Therefore, all that remains is to show that our assumption (on the left and right singular vector pairs of and ) holds.

Denote the Lagrangian associated with the problem in Eq. (11) as , where denotes the dual variable. Since is strictly feasible in Eq. (11), namely, , strong duality holds for Eq. (11). Let be the optimal dual variable to Eq. (11). Therefore we have . It is well known that minimizes if and only if is a subgradient of at , i.e.,

 0∈2(L∗−ˆL)+λ∗∂∥L∗∥2. (13)

For any matrix , the subdifferential of is given by  , where denotes the convex hull of the set . Specifically, any element of has the form

 ∑iαiuzivTzi,αi≥0,∑iαi=1,

where and are any left and right singular vectors of corresponding to its largest singular value (the top singular values may share a common value). From Eq. (13) and the definition of , there exist such that , and

 ˆL=L∗+λ∗2∑i^αiulivTli, (14)

where and correspond to any left and right singular vectors of corresponding to its largest singular value. Since , Eq. (14) verifies the existence of a set of left and right singular vector pairs shared by and . This completes the proof. ∎

Optimization of S For a given , the variable can be optimized via solving the following problem:

 minS∥S−ˆS∥2F,subject to∥S∥∞≤1, (15)

where . Similarly, the optimization on can be interpreted as computing a projection of a given matrix over an infinity norm ball. It also admits an analytic solution as summarized in the following theorem.

###### Lemma 3.1.

For any matrix , the optimal solution to Eq. (15) is given by

 S∗=sgn(ˆS)∘min(|ˆS|,1), (16)

where denotes the component-wise multiplication operator, and denotes the matrix with entries of appropriate size.

### 3.2 Alternating Direction Method of Multipliers

The ADMM algorithm  is suitable for dealing with non-smooth (convex) optimizations problems, as it blends the decomposability of dual ascent with the superior convergence of the method of multipliers. We present two implementations of the ADMM algorithm for solving Eq. (6). Due to the space constraint, we move the detailed discussion of two ADMM implementations to the supplemental material.

## 4 Theoretical Analysis

In this section, we present a performance bound for the function estimation scheme in Eq. (3). Such a performance bound measures how well the estimation scheme can approximate the regression functions in Eq. (2) via the sparse low-rank coefficient .

### 4.1 Basic Properties of the Optimal Solution

We first present some basic properties of the optimal solution defined in Eq. (6); these properties are important building blocks of our following theoretical analysis.

###### Lemma 4.1.

Consider the optimization problem in Eq. (6) for and . Given sample pairs as and . Let and be defined in Eq. (2) and Eq. (4), respectively; let be the largest singular values of . Assume that has independent and identically distributed (i.i.d.) entries as . Take

 α+β=2σX(l)σw√nN(1+√kn+t), (17)

where and

is a universal constant. Then with probability of at least

, for the minimizer in Eq. (6) and any , we have

 1N∥GXˆΘ−F∥2F≤1N∥GXΘ−F∥2F+2α∥S0(ˆΘ−Θ)∥∗+2β∥(ˆΘ−Θ)J(Θ)∥1, (18)

where is an operator defined in Lemma 1 of the supplemental material.

###### Proof.

From the definition of in Eq. (6), we have . By substituting and Eq. (5) into the previous inequality, we have

Define the random event

 A={1N∥GTXW∥2≤α+β2}. (19)

Taking as the value in Eq. (17), it follows from Lemma 3 of the supplemental materia that holds with probability of at least . Therefore, we have

 ⟨W,GX(ˆΘ−Θ)⟩=α+βα+β⟨W,GX(ˆΘ−Θ)⟩ ≤ αα+β∥GTXW∥2∥ˆΘ−Θ∥∗+βα+β∥GTXW∥∞∥ˆΘ−Θ∥1≤N2(α∥ˆΘ−Θ∥∗+β∥ˆΘ−Θ∥1),

where the second inequality follows from . Therefore, under , we have

 1N∥GXˆΘ−F∥2F ≤

From Corollary 1 and Lemma 2 of the supplemental material, we complete the proof. ∎

### 4.2 Main Assumption

We introduce a key assumption on the dictionary of basis functions . Based on such an assumption, we derive a performance bound for the sparse trace norm regularization formulation in Eq. (6).

###### Assumption 4.1.

For a matrix pair and of size , let and . We assume that there exist constants and such that

 κ1(s)≜minΔ∈R(s,q)∥GXΔ∥F√N∥S0(Δ)∥∗>0,κ2(q)≜minΔ∈R(s,q)∥GXΔ∥F√N∥ΔJ(Θ)∥1>0, (20)

where the restricted set is defined as

 R(s,q)={Δ∈Rh×k,Θ∈Rh×k|Δ≠0,rank(S0(Δ))≤s,|J(Θ)|≤q},

and denotes the number of nonzero entries in the matrix .

Our assumption on in Eq. (20) is closely related to but less restrictive than the RSC condition used in ; its denominator is only a part of the one in RSC and in a different matrix norm as well. Our assumption on is similar to the RE condition used in  except that its denominator is in a different matrix norm; our assumption can also be implied by sufficient conditions similar to the ones in .

### 4.3 Performance Bound

We derive a performance bound for the sparse trace norm structure obtained by solving Eq. (6). This bound measures how well the optimal can be used to approximate by evaluating the averaged estimation error, i.e., .

###### Theorem 4.1.

Consider the optimization problem in Eq. (6) for and . Given n sample pairs as and , let and be defined in Eqs. (2) and (4), respectively; let be the largest singular value of . Assume that has i.i.d. entries as . Take as the value in Eq. (17). Then with probability of at least , for the minimizer in Eq. (6), we have

 1N∥GXˆΘ−F∥2F≤(1+ϵ)infΘ{1N∥GXΘ−F∥2F}+E(ϵ)(α2κ21(2r)+β2κ22(c)), (21)

where is taken over all with and , and is a constant depending only on .

###### Proof.

Denote in Eq. (18). We have

 1N∥GXˆΘ−F∥2F≤1N∥GXΘ−F∥2F+2α∥S0(Δ)∥∗+2β∥ΔJ(Θ)∥1. (22)

Given (from Lemma 1 of the supplemental material) and , we derive upper bounds on the components and over the restrict set based on Assumptions 4.1, respectively. It follows that

 2α∥S0(Δ)∥∗ ≤ 2ακ1(2r)√N∥GX(ˆΘ−Θ)∥F≤2ακ1(2r)√N(∥GXˆΘ−F∥F+∥GXΘ−F∥F) (23) ≤ α2τκ21(2r)+1Nτ∥GXˆΘ−F∥2F+α2τκ21(2r)+1Nτ∥GXΘ−F∥2F,

where the last inequality above follows from for . Similarly, we have

 2β∥ΔJ(Θ)∥1 ≤ β2τκ22(c)+1Nτ∥GXˆΘ−F∥2F+β2τκ22(c)+1Nτ∥GXΘ−F∥2F. (24)

Substituting Eqs. (23) and (24) into Eq. (22), we have

 1N∥GXˆΘ−F∥2F ≤ τ+2(τ−2)N∥GXΘ−F∥2F+2τ2τ−2(α2κ21(2r)+β2κ22(c)).

Setting and in the inequality above, we complete the proof. ∎

By choosing specific values for and , we can refine the performance bound described in Eq. (21). It follows from Eq. (17) we have

 minα,β,α+β=γ(α2κ21(2r)+β2κ22(c))=γ2κ21(2r)+κ22(c),γ=2σX(l)σw√nN(1+√kn+t), (25)

where the equality of the first equation is achieved by setting and proportional to and , i.e., and . Thus the performance bound in Eq. (21) can be refined as

 1N∥GXˆΘ−F∥2F≤(1+ϵ)infΘ{1N∥GXΘ−F∥2F}+4E(ϵ)σ2X(l)σ2wnN2(κ21(2r)+κ22(c))(1+√kn+t)2.

Note that the performance bound above is independent of the value of and , and it is tighter than the one described in Eq. (21).

## 5 Experiments

In this section, we evaluate the effectiveness of the sparse trace norm regularization formulation in Eq. (6) on benchmark data sets; we also conduct numerical studies on the convergence of AG and two ADMM implementations including ADMM and ADMM (see details in Section E of the supplemental material) for solving Eq. (6) and the convergence of the alternating optimization algorithm for solve Eq. (10). Note that we use the least square loss for the following experiments.

Performance Evaluation We apply the sparse trace norm regularization formulation (S.TraceNorm) on multi-label classification problems, in comparison with the trace norm regularization formulation (TraceNorm) and the -norm regularization formulation (OneNorm). AUC, Macro F, and Micro F are used as the classification performance measures. Four benchmark data sets, including Business, Arts, and Health from Yahoo webpage data sets  and Scene from LIBSVM multi-label data sets, are employed in this experiment. The reported experimental results are averaged over random repetitions of the data sets into training and test sets of the ratio . We use the AG method to solve the S.TraceNorm formulation, and stop the iterative procedure of AG if the change of the objective values in two successive iterations is smaller than or the iteration numbers larger than . The regularization parameters and are determined via double cross-validation from the set .

We present the averaged performance of the competing algorithms in Table 1. The main observations are summarized as follows: (1) S.TraceNorm achieves the best performance on all benchmark data sets (except on Business data) in this experiment; this result demonstrates the effectiveness of the induced sparse low-rank structure for multi-label classification tasks; (2) TraceNorm outperforms OneNorm on all benchmark data sets; this result demonstrates the effectiveness of modeling a shared low-rank structure for high-dimensional text and image data analysis.

Numerical Study We study the practical convergence of AG and ADMM by solving Eq. (6) on Scene data. In our experiments, we observe that ADMM is much slower than ADMM and we thus only focus on ADMM. Note that in AG, we set ; in ADMM, we set , , . For other parameter settings, we observe similar trends.

In the first experiment, we compare AG and ADMM in term of the practical convergence. We stop ADMM when the change of the objective values in two successive iterations smaller than ; the attained objective value in ADMM is used as the stopping criterion for AG, that is, we stop AG if the attained objective value in AG is equal to or smaller than that objective value attained in ADMM. The convergence curves of ADMM and AG are presented in the left plot of Figure 1. Clearly, we can observe that AG converges much faster than ADMM. In the second experiment, we study the convergence of AG. We stop AG when the change of the objective values in two successive iterations smaller than . The convergence curves is presented in the middle plot of Figure 1. We observe that AG converges very fast, and its convergence speed is consistent with the theoretical convergence analysis in . Figure 1: Convergence comparison of AG and ADMM2 for solving Eq. (6) (left plot); convergence plot of AG for solving Eq. (6) (middle plot); and the alternating optimization algorithm for solving the dual formulation of the proximal operator in Eq. (10) (right plot).

We also conduct numerical study on the alternating optimization algorithm (in Section 3.1.2) for solving the dual formulation of the proximal operator in Eq. (10). Similarly, the alternating optimization algorithm is stopped when the change of the objective values in two successive iterations smaller than . For illustration, in Eq. (10) we randomly generate the matrix of size by from ; we then apply the alternating optimization algorithm to solve Eq. (10) and plot its convergence curve in the right plot of Figure 1. Our experimental results show that the alternating optimization algorithm generally converges within iterations and our results demonstrate the practical efficiency of this algorithm.

## 6 Conclusion

We study the problem of estimating multiple predictive functions simultaneously in the nonparametric regression setting. In our estimation scheme, each predictive function is estimated using a linear combination of a dictionary of pre-specified basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program with the trace norm and the -norm regularization. We propose to employ AG and ADMM algorithms to solve the function estimation problem and also develop efficient algorithms for the key components involved in AG and ADMM. We derive a key property of the optimal solution to the convex program; moreover, based on an assumption associated with the basis functions, we establish a performance bound of the proposed function estimation scheme using the composite regularization. Our simulation studies demonstrate the effectiveness and the efficiency of the proposed formulation. In the future, we plan to derive a formal sparse oracle inequality for the convex problem in Eq. (6) as in ; we also plan to apply the proposed function estimation formulation to other real world applications.

## References

•  F. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9:1019–1048, 2008.
•  A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Science, 2:183–202, 2009.
•  P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, 37:1705–1732, 2009.
•  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 2010.
•  S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
•  F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation and sparsity via penalized least squares. In COLT, 2006.
•  E. J. Candès and B. Recht. Exact matrix completion via convex optimization. CoRR, abs/0805.4471, 2008.
•  E.J. Candès, X. Li, Y. Ma, and J. Wright.

Robust principal component analysis?

Journal of ACM, 2011.
•  E.J. Candès and T Tao.

Decoding by linear programming.

IEEE Transactions on Information Theory, 51:4203–4215, 2005.
•  V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Sparse and low-rank matrix decompositions. In SYSID, 2009.
•  B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–451, 2004.
•  M. Fazel, H. Hindi, and S. Boyd.

A rank minimization heuristic with application to minimum order system approximation.

In ACL, 2001.
•  J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Statistics, 1:302–332, 2007.
•  L. Grippoa and M. Sciandrone. On the convergence of the block nonlinear gauss-seidel method under convex constraints. Operation Research Letters, 26:127–136, 2000.
•  J. Huang, T. Zhang, and D. N. Metaxas. Learning with structured sparsity. In ICML, 2009.
•  K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task learning. In COLT, 2008.
•  J.-J. Moreau. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France, 93:273–299, 1965.
•  S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010.
•  Y. Nesterov. Introductory lectures on convex programming. 1998. Lecture Notes.
•  Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Paper, 2007.
•  Y. C. Pati and T. Kailath. Phase-shifting masks for microlithography: automated design and mask requirements. Journal of the Optical Society of America A, 11(9):2438–2452, 1994.
•  B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Review, (3):471–501, 2010.
•  J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005.
•  A. Rohde and A. B. Tsybakov. Estimation of high-dimensional low rank matrices. Preprint available at 0912.5338v2, 2010.
•  N. Srebro, J. D. M. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In NIPS, 2004.
•  S. J. Szarek. Condition numbers of random matrices. Journal of Complexity, 7(2):131–149, 1991.
•  R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.
•  K. C. Toh, M. J. Todd, and R.H. Tutuncu. SDPT3: a MATLAB software package for semidefinite programming. Optimization Methods and Software, 11:545–581, 1999.
•  K. C. Toh and S. W. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific Journal of Optimization, 2009.
•  N. Ueda and K. Saito. Single-shot detection of multiple categories of text using parametric mixture models. In KDD, 2002.
•  G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications, (170):33–45, 1992.
•  T. Zhang. Some sharp performance bounds for least squares regression with regularization. Annals of Statistics, 37:2109–2144, 2009.
•  P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.

## A. Operators S0 and S1

We define two operators, namely and , on an arbitrary matrix pair (of the same size) based on Lemma in , as summarized in the following lemma.

###### Lemma 1.

Given any and of size , let and denote the SVD of as

 Θ=U[Σ000]VT,

where and are orthogonal, and is diagonal consisting of the non-zero singular values on its main diagonal. Let

 ˆΔ=UTΔV=[ˆΔ11ˆΔ12ˆΔ21ˆΔ22],

where , , , and . Define and as

 S0(Θ,Δ)=U[ˆΔ11ˆΔ12ˆΔ210]VT,S1(Θ,Δ)=U[000ˆΔ22]VT.

Then the following conditions hold: , , .

The result presented in Lemma 1 implies a condition under which the trace norm on a matrix pair is additive. From Lemma 1 we can easily verify that

 ∥Θ+S1(Θ,Δ)∥∗=∥Θ∥∗+∥S1(Θ,Δ)∥∗, (26)

for arbitrary and of the same size. To avoid clutter notation, we denote by , and by throughout this paper, as the appropriate can be easily determined from the context.

## B. Bound on Trace Norm

As a consequence of Lemma 1, we derive a bound on the trace norm of the matrices of interest as summarized below.

###### Corollary 1.

Given an arbitrary matrix pair and , let . Then

 ∥ˆΘ−Θ∥∗+∥Θ∥∗−∥ˆΘ∥∗≤2∥S0(Δ)∥∗.
###### Proof.

From Lemma 1 we have for the matrix pair and . Moreover,

 ∥ˆΘ∥∗ = ∥Θ+S0(Δ)+S1(Δ)∥∗≥∥Θ+S1(Δ)∥∗−∥S0(Δ)∥∗ (27) = ∥Θ∥∗+∥S1(Δ)∥∗−∥S0(Δ)∥∗,

where the inequality above follows from the triangle inequality and the last equality above follows from Eq. (26). Using the result in Eq. (27), we have

 ∥ˆΘ−Θ∥∗+∥Θ∥∗−∥ˆΘ∥∗ ≤ ∥Δ∥∗+∥Θ∥∗−∥Θ∥∗−∥S1(Δ)∥∗+∥S0(Δ)∥∗ ≤ 2∥S0(Δ)∥∗.

We complete the proof of this corollary. ∎

## C. Bound on ℓ1-norm

Analogous to the bound on the trace norm in Corollary 1, we also derive a bound on the -norm of the matrices of interest in the following lemma. For arbitrary matrices and , we denote by the coordinate set (the location set of nonzero entries) of , and by the associated complement (the location set of zero entries); we denote by the matrix of the same entries as on the set and of zero entries on the set . We now present a result associated with and in the following lemma. Note that a similar result for the vector case is presented in .

###### Lemma 2.

Given a matrix pair and of the same size, the inequality below always holds

 ∥ˆΘ−Θ∥1+∥Θ∥1−∥ˆΘ∥1≤2∥ˆΘJ(Θ)−ΘJ(Θ)∥1. (28)
###### Proof.

It can be verified that the inequality

 ∥ΘJ(Θ)∥1−∥ˆΘJ(Θ)∥1≤∥(ˆΘ−Θ)J(Θ)∥1

and the equalities

 ΘJ(Θ)⊥=0,∥(ˆΘ−Θ)J(Θ)⊥∥1−∥ˆΘJ(Θ)∥1=0

hold. Therefore we can derive

 ∥ˆΘ−Θ∥1+∥Θ∥1−∥ˆΘ∥1 = ∥(ˆΘ−Θ)J(Θ)∥1+∥(ˆΘ−Θ)J(Θ)⊥∥1+∥ΘJ(Θ)∥1+∥ΘJ(Θ)⊥∥1−∥ˆΘJ(Θ)∥1−∥ˆΘJ(Θ)⊥∥1 ≤ 2∥(ˆΘ−Θ)J(Θ)∥1.

This completes the proof of this lemma. ∎

## D. Concentration Inequality

###### Lemma 3.

Let be the maximum singular value of the matrix ; let be the matrix of i.i.d entries as . Let Then

 Pr(∥WTGX∥2/N≤λ/2)≥1−exp(−nt2/2).
###### Proof.

It is known  that a Gaussian matrix with and satisfies

 Pr(∥ˆW∥2>1+√k/n+t)≤exp(−nt2/2), (29)

where is a universal constant. From the definition of the largest singular value, there exist a vector of length , i.e., , such that . Since , we have

 Pr(∥∥WTGX∥∥2/N>λ/2)≤Pr(σX(l)∥W∥2/N>λ/2).

Applying the result in Eq. (29) into the inequality above, we complete the proof of this lemma. ∎

## E. Implementations of the Alternating Direction Method of Multipliers for Solving Eq. (6)

We employ two variants of the Alternating Direction Method of Multipliers (ADMM) to solve the Eq. (6). The key difference lies in the use of different numbers of auxiliary variables to separate the smooth components from the non-smooth components of the objective function in Eq. (6).

### E.1 The First Implementation: ADMM1

By adding an auxiliary variable , we reformulate Eq. (6) as

 min<