We study the problem of estimating multiple predictive functions from noisy observations. Such a problem has received broad attention in many areas of statistics and machine learning[6, 15, 16, 18]. This line of work can be roughly divided into two categories: parametric estimation and non-parametric estimation; a common and important theme for both categories is the appropriate assumption of the structure in the model parameters (parametric setting) or the coefficients of the dictionary (nonparametric setting).
There has been an enormous amount of literature on effective function estimation based on different sparsity constraints, including the estimation of the sparse linear regression via-norm penalty [3, 6, 27, 32], and the estimation of the linear regression functions using group lasso estimator [15, 16]. More recently, trace norm regularization has become a popular tool for approximating a set of linear models and the associated low-rank matrices in the high-dimensional setting [18, 24]; the trace norm is the tightest convex surrogate 
for the (non-convex) rank function under certain conditions, encouraging the sparsity in the singular values of the matrix of interest. One limitation of the use of trace norm regularization is that the resulting model is dense in general. However, in many real-world applications, the underlying structure of multiple predictive functions may be sparse as well as low-rank; the sparsity leads to explicitly interpretable prediction models and the low-rank implies essential subspace structure information. Similarly, the -norm is the tightest convex surrogate for the non-convex cardinality function , encouraging the sparsity in the entries of the matrix. This motivates us to explore the use of the combination of the trace norm and the -norm as a composite regularization (called sparse trace norm regularization) to induce the desirable sparse low-rank structure.
Trace norm regularization (minimization) has been investigated extensively in recent years. Efficient algorithms have been developed for solving convex programs with trace norm regularization [29, 12]; sufficient conditions for exact recovery from trace norm minimization have been established in ; consistency of trace norm minimization has been studied in ; trace norm minimization has been applied for matrix completion  and collaborative filtering [25, 23]. Similarly, -norm regularization has been well studied in the literature, just to mention a few, from the efficient algorithms for convex optimization [11, 13, 29], theoretical guarantee of the performance [9, 32], and model selection consistency .
In this paper, we focus on estimating multiple predictive functions simultaneously from a finite dictionary of basis functions in the nonparametric regression setting. Our function estimation scheme assumes that each predictive function can be approximated using a linear combination of those basis functions. By assuming that the coefficient matrix of the basis functions admits a sparse low-rank structure, we formulate the function estimation problem as a convex formulation, in which the combination of the trace norm and the -norm is employed as a composite regularization to induce a sparse low-rank structure in the coefficient matrix. The simultaneous sparse and low-rank structure is different from the incoherent sparse and low-rank structures studied in [8, 10]. We propose to solve the function estimation problem using the accelerated gradient method and the alternating direction method of multipliers; we also develop efficient algorithms to solve the key components involved in both methods. We conduct theoretical analysis on the proposed convex formulation: we first present some basic properties of the optimal solution to the convex formulation (Lemma 4.1); we then present an assumption associated with the geometric nature of the basis functions over the prescribed observations; based on such an assumption, we derive a performance bound for the combined regularization for function estimation (Theorem 4.1). We conduct simulations on benchmark data to demonstrate the effectiveness and efficiency of the proposed algorithms. Notation Denote . For any matrix , denote its trace norm by , i.e., the sum of the singular values; denote its operator norm by , i.e., the largest singular value; denote its -norm by , i.e., the sum of absolute value of all entries.
2 Problem Formulation
Let be a set of prescribed sample pairs (fixed design) associated with unknown functions as
where is an unknown regression function, denotes the
-th entry of the response vector, and is a stochastic noise variable. Let , , and . Denoting
we can rewrite Eq. (1) in a compact form as . Let be a set of pre-specified basis functions as , and let be the coefficient matrix. We define
where denotes the -th entry in the vector . Note that in practice the basis functions can be estimators from different methods, or different values of the tuning parameters of the same method.
We consider the problem of estimating the unknown functions using the composite functions defined in Eq. (3), respectively. Denote
and define the empirical error as
where . Our goal is to estimate the model parameter of a sparse low-rank structure from the given sample pairs . Such a structure induces the sparsity and the low rank simultaneously in a single matrix of interest.
Given that the functions are coupled via in some coherent sparse and low-rank structure, we propose to estimate as
where and are regularization parameters (estimated via cross-validation), and the linear combination of and is used to induce the sparse low-rank structure in . The optimization problem in Eq. (6) is non-smooth convex and hence admits a globally optimal solution; it can be solved using many sophisticated optimization techniques [28, 12]; in Section 3, we propose to apply the accelerated gradient method  and the alternating direction method of multipliers  to solve the optimization problem in Eq. (6).
3 Optimization Algorithms
In this section, we consider to apply the accelerated gradient (AG) algorithm [2, 19, 20] and the alternating direction method of multipliers (ADMM) , respectively, to solve the (non-smooth and convex) optimization problem in Eq. (6). We also develop efficient algorithms to solve the key components involved in both AG and ADMM.
3.1 Accelerated Gradient Algorithm
The AG algorithm has attracted extensive attention in the machine learning community due to its optimal convergence rate among all first order techniques and its ability of dealing with large scale data. The general scheme in AG for solving Eq. (6) can be described as below: at the -th iteration, the intermediate (feasible) solution can be obtained via
where denotes a searching point constructed on the intermediate solutions from previous iterations,
denotes the derivative of the loss function in Eq. (5) at , and specifies the step size which can be determined by iterative increment until the condition
is satisfied. The operation in Eq. (7) is commonly referred to as proximal operator , and its efficient computation is critical for the practical convergence of the AG-type algorithm. Next we present an efficient alternating optimization procedure to solve Eq. (7) with a given .
3.1.1 Dual Formulation
The problem in Eq. (7) is not easy to solve directly; next we show that this problem can be efficiently solved in its dual form. By reformulating and into the equivalent dual forms, we convert Eq. (7) into a max-min formulation as
3.1.2 Alternating Optimization
The optimization problem in Eq. (10) is smooth convex and it has two optimization variables. For such type of problems, coordinate descent (CD) method is routinely used to compute its globally optimal solution . To solve Eq. (10), the CD method alternatively optimizes one of the two variables with the other variable fixed. Our analysis below shows that the variables and in Eq. (10) can be optimized efficiently. Note that the convergence rate of the CD method is not known, however, it converges very fast in practice (less than iterations in our experiments).
Optimization of L For a given , the variable can be optimized via solving the following problem:
where . The optimization on above can be interpreted as computing an optimal projection of a given matrix over a unit spectral norm ball. Our analysis shows that the optimal solution to Eq. (11) can be expressed in an analytic form as summarized in the following theorem.
Assume the existence of a set of left and right singular vector pairs shared by the optimal to Eq. (11) and the given for their non-zero singular values. Under such an assumption, it can be verified that the singular values of can be obtained via
to which the optimal solution is given by ; hence the expression of coincides with Eq. (12). Therefore, all that remains is to show that our assumption (on the left and right singular vector pairs of and ) holds.
Denote the Lagrangian associated with the problem in Eq. (11) as , where denotes the dual variable. Since is strictly feasible in Eq. (11), namely, , strong duality holds for Eq. (11). Let be the optimal dual variable to Eq. (11). Therefore we have . It is well known that minimizes if and only if is a subgradient of at , i.e.,
For any matrix , the subdifferential of is given by  , where denotes the convex hull of the set . Specifically, any element of has the form
where and are any left and right singular vectors of corresponding to its largest singular value (the top singular values may share a common value). From Eq. (13) and the definition of , there exist such that , and
where and correspond to any left and right singular vectors of corresponding to its largest singular value. Since , Eq. (14) verifies the existence of a set of left and right singular vector pairs shared by and . This completes the proof. ∎
Optimization of S For a given , the variable can be optimized via solving the following problem:
where . Similarly, the optimization on can be interpreted as computing a projection of a given matrix over an infinity norm ball. It also admits an analytic solution as summarized in the following theorem.
For any matrix , the optimal solution to Eq. (15) is given by
where denotes the component-wise multiplication operator, and denotes the matrix with entries of appropriate size.
3.2 Alternating Direction Method of Multipliers
The ADMM algorithm  is suitable for dealing with non-smooth (convex) optimizations problems, as it blends the decomposability of dual ascent with the superior convergence of the method of multipliers. We present two implementations of the ADMM algorithm for solving Eq. (6). Due to the space constraint, we move the detailed discussion of two ADMM implementations to the supplemental material.
4 Theoretical Analysis
In this section, we present a performance bound for the function estimation scheme in Eq. (3). Such a performance bound measures how well the estimation scheme can approximate the regression functions in Eq. (2) via the sparse low-rank coefficient .
4.1 Basic Properties of the Optimal Solution
We first present some basic properties of the optimal solution defined in Eq. (6); these properties are important building blocks of our following theoretical analysis.
Consider the optimization problem in Eq. (6) for and . Given sample pairs as and . Let and be defined in Eq. (2) and Eq. (4), respectively; let be the largest singular values of . Assume that has independent and identically distributed (i.i.d.) entries as . Take
where is an operator defined in Lemma 1 of the supplemental material.
Define the random event
where the second inequality follows from . Therefore, under , we have
4.2 Main Assumption
We introduce a key assumption on the dictionary of basis functions . Based on such an assumption, we derive a performance bound for the sparse trace norm regularization formulation in Eq. (6).
For a matrix pair and of size , let and . We assume that there exist constants and such that
where the restricted set is defined as
and denotes the number of nonzero entries in the matrix .
Our assumption on in Eq. (20) is closely related to but less restrictive than the RSC condition used in ; its denominator is only a part of the one in RSC and in a different matrix norm as well. Our assumption on is similar to the RE condition used in  except that its denominator is in a different matrix norm; our assumption can also be implied by sufficient conditions similar to the ones in .
4.3 Performance Bound
We derive a performance bound for the sparse trace norm structure obtained by solving Eq. (6). This bound measures how well the optimal can be used to approximate by evaluating the averaged estimation error, i.e., .
Consider the optimization problem in Eq. (6) for and . Given n sample pairs as and , let and be defined in Eqs. (2) and (4), respectively; let be the largest singular value of . Assume that has i.i.d. entries as . Take as the value in Eq. (17). Then with probability of at least , for the minimizer in Eq. (6), we have
where is taken over all with and , and is a constant depending only on .
Denote in Eq. (18). We have
where the last inequality above follows from for . Similarly, we have
Setting and in the inequality above, we complete the proof. ∎
where the equality of the first equation is achieved by setting and proportional to and , i.e., and . Thus the performance bound in Eq. (21) can be refined as
Note that the performance bound above is independent of the value of and , and it is tighter than the one described in Eq. (21).
In this section, we evaluate the effectiveness of the sparse trace norm regularization formulation in Eq. (6) on benchmark data sets; we also conduct numerical studies on the convergence of AG and two ADMM implementations including ADMM and ADMM (see details in Section E of the supplemental material) for solving Eq. (6) and the convergence of the alternating optimization algorithm for solve Eq. (10). Note that we use the least square loss for the following experiments.
Performance Evaluation We apply the sparse trace norm regularization formulation (S.TraceNorm) on multi-label classification problems, in comparison with the trace norm regularization formulation (TraceNorm) and the -norm regularization formulation (OneNorm). AUC, Macro F, and Micro F are used as the classification performance measures. Four benchmark data sets, including Business, Arts, and Health from Yahoo webpage data sets  and Scene from LIBSVM multi-label data sets222http://www.csie.ntu.edu.tw/~cjlin, are employed in this experiment. The reported experimental results are averaged over random repetitions of the data sets into training and test sets of the ratio . We use the AG method to solve the S.TraceNorm formulation, and stop the iterative procedure of AG if the change of the objective values in two successive iterations is smaller than or the iteration numbers larger than . The regularization parameters and are determined via double cross-validation from the set .
|(n, d, m)||()||()||()||()|
We present the averaged performance of the competing algorithms in Table 1. The main observations are summarized as follows: (1) S.TraceNorm achieves the best performance on all benchmark data sets (except on Business data) in this experiment; this result demonstrates the effectiveness of the induced sparse low-rank structure for multi-label classification tasks; (2) TraceNorm outperforms OneNorm on all benchmark data sets; this result demonstrates the effectiveness of modeling a shared low-rank structure for high-dimensional text and image data analysis.
Numerical Study We study the practical convergence of AG and ADMM by solving Eq. (6) on Scene data. In our experiments, we observe that ADMM is much slower than ADMM and we thus only focus on ADMM. Note that in AG, we set ; in ADMM, we set , , . For other parameter settings, we observe similar trends.
In the first experiment, we compare AG and ADMM in term of the practical convergence. We stop ADMM when the change of the objective values in two successive iterations smaller than ; the attained objective value in ADMM is used as the stopping criterion for AG, that is, we stop AG if the attained objective value in AG is equal to or smaller than that objective value attained in ADMM. The convergence curves of ADMM and AG are presented in the left plot of Figure 1. Clearly, we can observe that AG converges much faster than ADMM. In the second experiment, we study the convergence of AG. We stop AG when the change of the objective values in two successive iterations smaller than . The convergence curves is presented in the middle plot of Figure 1. We observe that AG converges very fast, and its convergence speed is consistent with the theoretical convergence analysis in .
We also conduct numerical study on the alternating optimization algorithm (in Section 3.1.2) for solving the dual formulation of the proximal operator in Eq. (10). Similarly, the alternating optimization algorithm is stopped when the change of the objective values in two successive iterations smaller than . For illustration, in Eq. (10) we randomly generate the matrix of size by from ; we then apply the alternating optimization algorithm to solve Eq. (10) and plot its convergence curve in the right plot of Figure 1. Our experimental results show that the alternating optimization algorithm generally converges within iterations and our results demonstrate the practical efficiency of this algorithm.
We study the problem of estimating multiple predictive functions simultaneously in the nonparametric regression setting. In our estimation scheme, each predictive function is estimated using a linear combination of a dictionary of pre-specified basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program with the trace norm and the -norm regularization. We propose to employ AG and ADMM algorithms to solve the function estimation problem and also develop efficient algorithms for the key components involved in AG and ADMM. We derive a key property of the optimal solution to the convex program; moreover, based on an assumption associated with the basis functions, we establish a performance bound of the proposed function estimation scheme using the composite regularization. Our simulation studies demonstrate the effectiveness and the efficiency of the proposed formulation. In the future, we plan to derive a formal sparse oracle inequality for the convex problem in Eq. (6) as in ; we also plan to apply the proposed function estimation formulation to other real world applications.
-  F. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9:1019–1048, 2008.
-  A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Science, 2:183–202, 2009.
-  P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, 37:1705–1732, 2009.
-  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 2010.
-  S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
-  F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation and sparsity via penalized least squares. In COLT, 2006.
-  E. J. Candès and B. Recht. Exact matrix completion via convex optimization. CoRR, abs/0805.4471, 2008.
E.J. Candès, X. Li, Y. Ma, and J. Wright.
Robust principal component analysis?Journal of ACM, 2011.
E.J. Candès and T Tao.
Decoding by linear programming.IEEE Transactions on Information Theory, 51:4203–4215, 2005.
-  V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Sparse and low-rank matrix decompositions. In SYSID, 2009.
-  B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–451, 2004.
M. Fazel, H. Hindi, and S. Boyd.
A rank minimization heuristic with application to minimum order system approximation.In ACL, 2001.
-  J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Statistics, 1:302–332, 2007.
-  L. Grippoa and M. Sciandrone. On the convergence of the block nonlinear gauss-seidel method under convex constraints. Operation Research Letters, 26:127–136, 2000.
-  J. Huang, T. Zhang, and D. N. Metaxas. Learning with structured sparsity. In ICML, 2009.
-  K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task learning. In COLT, 2008.
-  J.-J. Moreau. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France, 93:273–299, 1965.
-  S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010.
-  Y. Nesterov. Introductory lectures on convex programming. 1998. Lecture Notes.
-  Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Paper, 2007.
-  Y. C. Pati and T. Kailath. Phase-shifting masks for microlithography: automated design and mask requirements. Journal of the Optical Society of America A, 11(9):2438–2452, 1994.
-  B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Review, (3):471–501, 2010.
-  J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005.
-  A. Rohde and A. B. Tsybakov. Estimation of high-dimensional low rank matrices. Preprint available at 0912.5338v2, 2010.
-  N. Srebro, J. D. M. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In NIPS, 2004.
-  S. J. Szarek. Condition numbers of random matrices. Journal of Complexity, 7(2):131–149, 1991.
-  R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.
-  K. C. Toh, M. J. Todd, and R.H. Tutuncu. SDPT3: a MATLAB software package for semidefinite programming. Optimization Methods and Software, 11:545–581, 1999.
-  K. C. Toh and S. W. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific Journal of Optimization, 2009.
-  N. Ueda and K. Saito. Single-shot detection of multiple categories of text using parametric mixture models. In KDD, 2002.
-  G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications, (170):33–45, 1992.
-  T. Zhang. Some sharp performance bounds for least squares regression with regularization. Annals of Statistics, 37:2109–2144, 2009.
-  P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.
A. Operators and
We define two operators, namely and , on an arbitrary matrix pair (of the same size) based on Lemma in , as summarized in the following lemma.
Given any and of size , let and denote the SVD of as
where and are orthogonal, and is diagonal consisting of the non-zero singular values on its main diagonal. Let
where , , , and . Define and as
Then the following conditions hold: , , .
for arbitrary and of the same size. To avoid clutter notation, we denote by , and by throughout this paper, as the appropriate can be easily determined from the context.
B. Bound on Trace Norm
As a consequence of Lemma 1, we derive a bound on the trace norm of the matrices of interest as summarized below.
Given an arbitrary matrix pair and , let . Then
C. Bound on -norm
Analogous to the bound on the trace norm in Corollary 1, we also derive a bound on the -norm of the matrices of interest in the following lemma. For arbitrary matrices and , we denote by the coordinate set (the location set of nonzero entries) of , and by the associated complement (the location set of zero entries); we denote by the matrix of the same entries as on the set and of zero entries on the set . We now present a result associated with and in the following lemma. Note that a similar result for the vector case is presented in .
Given a matrix pair and of the same size, the inequality below always holds
It can be verified that the inequality
and the equalities
hold. Therefore we can derive
This completes the proof of this lemma. ∎
D. Concentration Inequality
Let be the maximum singular value of the matrix ; let be the matrix of i.i.d entries as . Let Then
It is known  that a Gaussian matrix with and satisfies
where is a universal constant. From the definition of the largest singular value, there exist a vector of length , i.e., , such that . Since , we have
Applying the result in Eq. (29) into the inequality above, we complete the proof of this lemma. ∎
E. Implementations of the Alternating Direction Method of Multipliers for Solving Eq. (6)
We employ two variants of the Alternating Direction Method of Multipliers (ADMM) to solve the Eq. (6). The key difference lies in the use of different numbers of auxiliary variables to separate the smooth components from the non-smooth components of the objective function in Eq. (6).
E.1 The First Implementation: ADMM
By adding an auxiliary variable , we reformulate Eq. (6) as