1 Introduction
The concept of parsimony is central in many scientific domains. In the context of statistics, signal processing or machine learning, it takes the form of variable or feature selection problems, and is commonly used in two situations: first, to make the model or the prediction more interpretable or cheaper to use, i.e., even if the underlying problem does not admit sparse solutions, one looks for the best sparse approximation. Second, sparsity can also be used given prior knowledge that the model should be sparse. Many methods have been designed to learn sparse models, namely methods based on combinatorial optimization
[1, 2][3] or convex optimization [4, 5].In this paper, we focus on the regularization by sparsity-inducing norms. The simplest example of such norms is the -norm, leading to the Lasso, when used within a least-squares framework. In recent years, a large body of work has shown that the Lasso was performing optimally in high-dimensional low-correlation settings, both in terms of prediction [6], estimation of parameters or estimation of supports [7, 8]. However, most data exhibit strong correlations, with various correlation structures, such as clusters (i.e., close to block-diagonal covariance matrices) or sparse graphs, such as for example problems involving sequences (in which case, the covariance matrix is close to a Toeplitz matrix [9]). In these situations, the Lasso is known to have stability problems: although its predictive performance is not disastrous, the selected predictor may vary a lot (typically, given two correlated variables, the Lasso will only select one of the two, at random).
Several remedies have been proposed to this instability. First, the elastic net [10] adds a strongly convex penalty term (the squared -norm) that will stabilize selection (typically, given two correlated variables, the elastic net will select the two variables). However, it is blind to the exact correlation structure, and while strong convexity is required for some variables, it is not for other variables. Another solution is to consider the group Lasso, which will divide the predictors into groups and penalize the sum of the -norm of these groups [11]. This is known to accomodate strong correlations within groups [12]; however it requires to know the group in advance, which is not always possible. A third line of research has focused on sampling-based techniques [13, 14, 15].
An ideal regularizer should thus be adapted to the design (like the group Lasso), but without requiring human intervention (like the elastic net); it should thus add strong convexity only where needed, and not modifying variables where things behave correctly. In this paper, we propose a new norm towards this end.
More precisely we make the following contributions:
-
We propose in Section 2
a new norm based on the trace norm (a.k.a. nuclear norm) that interpolates between the
-norm and the -norm depending on correlations. -
We show that there is a unique minimum when penalizing with this norm in Section 2.2.
-
We provide optimization algorithms based on reweighted least-squares in Section 3.
-
We study the second-order expansion around independence and relate to existing work on including correlations in Section 4.
-
We perform synthetic experiments in Section 5, where we show that the trace Lasso outperforms existing norms in strong-correlation regimes.
Notations.
Let . The columns of are noted using superscript, i.e., denotes the -th column, while the rows are noted using subscript, i.e., denotes the -th row. For , is the diagonal of the matrix , while for , is the diagonal matrix whose diagonal elements are the . Let be a subset of , then is the vector restricted to the support , with outside the support . We denote by the set of symmetric matrices of size . We will use various matrix norms, here are the notations we use:
-
is the operator norm, i.e., the maximum singular value of the matrix ,
-
is the Frobenius norm, i.e., the -norm of the singular values, which is also equal to ,
-
is the sum of the -norm of the columns of : .
2 Definition and properties of the trace Lasso
We consider the problem of predicting , given a vector , assuming a linear model
where is (Gaussian) noise with mean
and variance
. Given a training set and , a widely used method to estimate the parameter vector is the penalized empirical risk minimization(1) |
where
is a loss function used to measure the error we make by predicting
instead of , while is a regularization term used to penalize complex models. This second term helps avoiding overfitting, especially in the case where we have many more parameters than observation, i.e., .2.1 Related work
We will now present some classical penalty functions for linear models which are widely used in the machine learning and statistics community. The first one, known as Tikhonov regularization [16]
[17], is the squared -norm. When used with the square loss, estimating the parameter vector is done by solving a linear system. One of the main drawbacks of this penalty function is the fact that it does not perform variable selection and thus does not behave well in sparse high-dimensional settings.Hence, it is natural to penalize linear models by the number of variables used by the model. Unfortunately, this criterion, sometimes denoted by (-penalty), is not convex and solving the problem in Eq. (1) is generally NP-hard [18]. Thus, a convex relaxation for this problem was introduced, replacing the size of the selected subset by the -norm of . This estimator is known as the Lasso [4] in the statistics community and basis pursuit [5] in signal processing. It was later shown that under some assumptions, the two problems were in fact equivalent (see for example [19] and references therein).
When two predictors are highly correlated, the Lasso has a very unstable behavior: it may only select the variable that is the most correlated with the residual. On the other hand, the Tikhonov regularization tends to shrink coefficients of correlated variables together, leading to a very stable behavior. In order to get the best of both worlds, stability and variable selection, Zou and Hastie introduced the elastic net [10], which is the sum of the -norm and squared -norm. Unfortunately, this estimator needs two regularization parameters and is not adaptive to the precise correlation structure of the data. Some authors also proposed to use pairwise correlations between predictors to interpolate more adaptively between the -norm and squared -norm, by introducing the pairwise elastic net [20] (see comparisons with our approach in Section 5).
Finally, when one has more knowledge about the data, for example clusters of variables that should be selected together, one can use the group Lasso [11]. Given a partition of the set of variables, it is defined as the sum of the -norms of the restricted vectors :
The effect of this penalty function is to introduce sparsity at the group level: variables in a group are selected altogether. One of the main drawback of this method, which is also sometimes one of its quality, is the fact that one needs to know the partition of the variables, and so one needs to have a good knowledge of the data.
2.2 The ridge, the Lasso and the trace Lasso
In this section, we show that Tikhonov regularization and the Lasso penalty can be viewed as norms of the matrix . We then introduce a new norm involving this matrix.
The solution of empirical risk minimization penalized by the -norm or -norm is not equivariant by rescaling of the predictors , so it is common to normalize the predictors. When normalizing the predictors , and penalizing by Tikhonov regularization or by the Lasso, people are implicitly using a regularization term that depends on the data or design matrix . In fact, there is an equivalence between normalizing the predictors and not normalizing them, using the two following reweighted and -norms instead of the Tikhonov regularization and the Lasso:
(2) |
These two norms can be expressed using the matrix :
and a natural question arises: are there other relevant choices of functions or matrix norms? A classical measure of the complexity of a model is the number of predictors used by this model, which is equal to the size of the support of . This penalty being non-convex, people use its convex relaxation, which is the -norm, leading to the Lasso.
Here, we propose a different measure of complexity which can be shown to be more adapted in model selection settings [21]: the dimension of the subspace spanned by the selected predictors. This is equal to the rank of the selected predictors, or also to the rank of the matrix . As for the size of the support, this function is non-convex, and we propose to replace it by a convex surrogate, the trace norm, leading to the following penalty that we call “trace Lasso”:
The trace Lasso has some interesting properties: if all the predictors are orthogonal, then, it is equal to the -norm. Indeed, we have the decomposition:
where are the vectors of the canonical basis. Since the predictors are orthogonal and the
are orthogonal too, this gives the singular value decomposition of
and we getOn the other hand, if all the predictors are equal to , then
and we get , which is equivalent to the Tikhonov regularization. Thus when two predictors are strongly correlated, our norm will behave like the Tikhonov regularization, while for almost uncorrelated predictors, it will behave like the Lasso.
Always having a unique minimum is an important property for a statistical estimator, as it is a first step towards stability. The trace Lasso, by adding strong convexity exactly in the direction of highly correlated covariates, always has a unique minimum, and is much more stable than the Lasso.
Proposition 1.
If the loss function is strongly convex with respect to its second argument, then the solution of the empirical risk minimization penalized by the trace Lasso, i.e., Eq. (1), is unique.
The technical proof of this proposition is given in appendix B, and consists of showing that in the flat directions of the loss function, the trace Lasso is strongly convex.
2.3 A new family of penalty functions
In this section, we introduce a new family of penalties, inspired by the trace Lasso, allowing us to write the -norm, the -norm and the newly introduced trace Lasso as special cases. In fact, we note that and . In other words, we can express the and -norms of using the trace norm of a given matrix times the matrix . A natural question to ask is: what happens when using a matrix other than the identity or the line vector , and what are good choices of such matrices? Therefore, we introduce the following family of penalty functions:
Definition 1.
Let , all of its columns having unit norm. We introduce the norm as
Proof.
The positive homogeneity and triangle inequality are direct consequences of the linearity of and the fact that is a norm. Since all the columns of are not equal to zero, we have
and so, separates points and is a norm. ∎
As stated before, the and -norms are special cases of the family of norms we just introduced. Another important penalty that can be expressed as a special case is the group Lasso, with non-overlapping groups. Given a partition of the set , the group Lasso is defined by
We define the matrix by
Then,
(3) |
Using the fact that is a partition of , the vectors are orthogonal and so are the vectors . Hence, after normalizing the vectors, Eq. (3) gives a singular value decomposition of and so the group Lasso penalty can be expressed as a special case of our family of norms:
In the following proposition, we show that our norm only depends on the value of . This is an important property for the trace Lasso, where , since it underlies the fact that this penalty only depends on the correlation matrix of the covariates.
Proposition 2.
Let , all of its columns having unit norm. We have
![]() |
![]() |
![]() |
We plot the unit ball of our norm for the following value of (see figure (1)):
We can lower bound and upper bound our norms by the -norm and -norm respectively. This shows that, as for the elastic net, our norms interpolate between the -norm and the -norm. But the main difference between the elastic net and our norms is the fact that our norms are adaptive, and require a single regularization parameter to tune. In particular for the trace Lasso, when two covariates are strongly correlated, it will be close to the -norm, while when two covariates are almost uncorrelated, it will behave like the -norm. This is a behavior close to the one of the pairwise elastic net [20].
Proposition 3.
Let , all of its columns having unit norm. We have
2.4 Dual norm
The dual norm is an important quantity for both optimization and theoretical analysis of the estimator. Unfortunately, we are not able in general to obtain a closed form expression of the dual norm for the family of norms we just introduced. However we can obtain a bound, which is exact for some special cases:
Proposition 4.
The dual norm, defined by , can be bounded by:
Proof.
Using the fact that , we have
where the inequality comes from the fact that the operator norm is the dual norm of the trace norm. The definition of the dual norm then gives the result. ∎
As a corollary, we can bound the dual norm by a constant times the -norm:
Using proposition (3), we also have the inequality .
3 Optimization algorithm
In this section, we introduce an algorithm to estimate the parameter vector when the loss function is equal to the square loss: and the penalty is the trace Lasso. It is straightforward to extend this algorithm to the family of norms indexed by . The problem we consider is
We could optimize this cost function by subgradient descent, but this is quite inefficient: computing the subgradient of the trace Lasso is expensive and the rate of convergence of subgradient descent is quite slow. Instead, we consider an iteratively reweighted least-squares method. First, we need to introduce a well-known variational formulation for the trace norm [22]:
Proposition 5.
Let . The trace norm of is equal to:
and the infimum is attained for .
Using this proposition, we can reformulate the previous optimization problem as
This problem is jointly convex in [23]. In order to optimize this objective function by alternating the minimization over and , we need to add a term . Otherwise, the infimum over could be attained at a non invertible , leading to a non convergent algorithm. The infimum over is then attained for .
Optimizing over is a least-squares problem penalized by a reweighted -norm equal to , where . It is equivalent to solving the linear system
This can be done efficiently by using a conjugate gradient method. Since the cost of multiplying by a vector is , solving the system has a complexity of , where is the number of iterations needed to converge. Using warm restarts, can be much smaller than , since the linear system we are solving does not change a lot from an iteration to another. Below we summarize the algorithm:
Compute the eigenvalue decomposition
of . Set , where . Set by solving the system .For the sequence , we use a decreasing sequence converging to ten times the machine precision.
3.1 Choice of
We now give a method to choose the regularization path. In fact, we know that the vector is solution if and only if [24]. Thus, we need to start the path at , corresponding to the empty solution , and then decrease . Using the inequalities on the dual norm we obtained in the previous section, we get
Therefore, starting the path at is a good choice.
4 Approximation around the Lasso
In this section, we compute the second order approximation of our norm around the special case corresponding to the Lasso. We recall that when , our norm is equal to the -norm. We add a small perturbation
to the identity matrix, and using Prop.
6 of the appendix A, we obtain the following second order approximation:We can rewrite this approximation as
using a slight abuse of notation, considering that the last term is equal to when . The second order term is quite interesting: it shows that when two covariates are correlated, the effect of the trace Lasso is to shrink the corresponding coefficients toward each other. Another interesting remark is the fact that this term is very similar to pairwise elastic net penalties, which are of the form , where is a decreasing function of .
5 Experiments
In this section, we perform synthetic experiments to illustrate the behavior of the trace Lasso and other classical penalties when there are highly correlated covariates in the design matrix. For all experiments, we have covariates and observations. The support of is equal to , where is the size of the support. For in the support of , we have , where each
is independently drawn from a uniform distribution on
. The observations are drawn from a multivariate Gaussian with mean and covariance matrix . For the first experiment, is set to the identity, for the second experiment, is block diagonal with blocks equal to corresponding to clusters of eight variables, finally for the third experiment, we set , corresponding to a Toeplitz design. For each method, we choose the best for the estimation error, which is reported.Overall all methods behave similarly in the noiseless and the noisy settings, hence we only report results for the noisy setting. In all three graphs of Figure 2, we observe behaviors that are typical of Lasso, ridge and elastic net: the Lasso performs very well on sparse models, but its performance is rather poor for denser models, almost as poor as the ridge regression. The elastic net offers the best of both worlds since its two parameters allow it to interpolate adaptively between the Lasso and the ridge. In experiment 1, since the variables are uncorrelated, there is no reason to couple their selection. This suggests that the Lasso should be the most appropriate convex regularization. The trace Lasso approaches the Lasso as goes to infinity, but the weak coupling induced by empirical correlations is sufficient to slightly decrease its performance compared to that of the Lasso. By contrast, in experiments 2 and 3, the trace Lasso outperforms other methods (including the pairwise elastic net) since variables that should be selected together are indeed correlated. As for the penalized elastic net, since it takes into account the correlations between variables it is not surprising that in experiment 2 and 3 it performs better than methods that do not. We do not have a compelling explanation for its superior performance in experiment 1.
![]() |
![]() |
![]() |
6 Conclusion
We introduce a new penalty function, the trace Lasso, which takes advantage of the correlation between covariates to add strong convexity exactly in the directions where needed, unlike the elastic net for example, which blindly adds a squared -norm term in every directions. We show on synthetic data that this adaptive behavior leads to better estimation performance. In the future, we want to show that if a dedicated norm using prior knowledge such as the group Lasso can be used, the trace Lasso will behave similarly and its performance will not degrade too much, providing theoretical guarantees to such adaptivity. Finally, we will seek applications of this estimator in inverse problems such as deblurring, where the design matrix exhibits strong correlation structure.
Acknowledgements
This paper was partially supported by the European Research Council (SIERRA Project ERC-239993).
References
- [1] S.G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397–3415, 1993.
- [2] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. Advances in Neural Information Processing Systems, 22, 2008.
- [3] M.W. Seeger. Bayesian inference and optimal design for the sparse linear model. The Journal of Machine Learning Research, 9:759–813, 2008.
- [4] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
- [5] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM journal on scientific computing, 20(1):33–61, 1999.
- [6] P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
- [7] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006.
- [8] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (Lasso). Information Theory, IEEE Transactions on, 55(5):2183–2202, 2009.
- [9] G.H. Golub and C.F. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.
- [10] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
- [11] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
- [12] F.R. Bach. Consistency of the group Lasso and multiple kernel learning. The Journal of Machine Learning Research, 9:1179–1225, 2008.
- [13] F.R. Bach. Bolasso: model consistent Lasso estimation through the bootstrap. In Proceedings of the 25th international conference on Machine learning, pages 33–40. ACM, 2008.
- [14] H. Liu, K. Roeder, and L. Wasserman. Stability approach to regularization selection (stars) for high dimensional graphical models. Advances in Neural Information Processing Systems, 23, 2010.
- [15] N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
- [16] A. Tikhonov. Solution of incorrectly formulated problems and the regularization method. In Soviet Math. Dokl., volume 5, page 1035, 1963.
- [17] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
- [18] G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. Constructive approximation, 13(1):57–98, 1997.
-
[19]
E.J. Candes and T. Tao.
Decoding by linear programming.
Information Theory, IEEE Transactions on, 51(12):4203–4215, 2005. -
[20]
A. Lorbert, D. Eis, V. Kostina, D. M. Blei, and P. J. Ramadge.
Exploiting covariate similarity in sparse regression via the pairwise
elastic net.
JMLR - Proceedings of the 13th International Conference on Artificial Intelligence and Statistics
, 9:477–484, 2010. - [21] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. 2001.
- [22] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. Advances in neural information processing systems, 19:41, 2007.
- [23] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004.
- [24] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. S. Sra, S. Nowozin, S. J. Wright., editors, Optimization for Machine Learning, 2011.
- [25] F.R. Bach. Consistency of trace norm minimization. The Journal of Machine Learning Research, 9:1019–1048, 2008.
Appendix A Perturbation of the trace norm
We follow the technique used in [25] to obtain an approximation of the trace norm.
a.1 Jordan-Wielandt matrices
Let of rank . We note , the strictly positive singular values of and , the associated left and right singular vectors. We introduce the Jordan-Wielandt matrix
The singular values of and the eigenvalues of are related: has eigenvalues and
associated to eigenvectors
The remaining eigenvalues of are equal to and are associated to eigenvectors of the form
where .
a.2 Cauchy residue formula
Let be a closed curve that does not go through the eigenvalues of . We define
We have
a.3 Perturbation analysis
Let be a perturbation matrix such that , and let be a closed curve around the largest eigenvalues of and . We can study the perturbation of the strictly positive singular values of by computing the trace of . Using the fact that , we have
We note and the first two terms of the right hand side of this equation. We have
and
If , the integral is nul. Otherwise, we have
where
Therefore, if and are both inside or outside the interior of , the integral is equal to zero. So
For and , we have
and for and , we have
So
Now, let be the circle of center and radius . We can study the perturbation of the singular values of equal to zero by computing the trace norm of . We have
Then, if we note the first integral and the second one , we get
If both and are outside , then the integral is equal to zero. If one of them is inside, say , then and the integral is equal to
Then this integral is non nul if and only if is also inside . Thus
where are the eigenvectors associated to the eigenvalue . We have
The integral is not equal to zero if and only if exactly one eigenvalue, say , is outside . The integral is then equal to . Thus
where . Finally, putting everything together, we get
Proposition 6.
Let , the singular value decomposition of , with , . Let . We have
where
Appendix B Proof of proposition 1
In this section, we prove that if the loss function is strongly convex with respect to its second argument, then the solution of the penalized empirical risk minimization is unique.
Let If is in the nullspace of , then and the minimum is unique. From now on, we suppose that the minima are not in the nullspace of .
Let and . By convexity of the objective function, all the , for are also optimal solutions, and so, we can choose an optimal solution such that for all in the support of . Because the loss function is strongly convex outside the nullspace of , is in the nullspace of .
Let be the SVD of . We have the following development around :
We note the support of . Using the fact that the support of is included in , we have , where for and otherwise. Then:
For small , is also a minimum, and therefore, we have:
(4) | |||
(5) |
This could be summarized as
(6) |
This means that the eigenspaces of
are stable by the matrix . Therefore,
Comments
There are no comments yet.