The linear model
is one of the most fundamental models in statistics. It describes the relationship between a vector of
independent observations of a response variable and a matrixof observations of features. The unknown parameters of this model are the vector of coefficients , which expresses how relates to , and the error variance , which captures the noise level or extent to which cannot be predicted from : The vector consists of independently and identically distributed zero-mean Gaussian errors with variance . When , estimating is a challenging, well-studied problem. Perhaps the most common method in this setting is the lasso (Tibshirani, 1996), which assumes that is sparse and solves the following convex optimization problem:
Over the past decade, an extensive literature has emerged studying the properties of from both computational (see, e.g., Hastie et al., 2015) and theoretical (see, e.g., Bühlmann & Van De Geer, 2011) perspectives.
Compared to the vast amount of work on estimating , relatively little attention has been paid to the problem of estimating . Nonetheless, reliable estimation of is important for quantifying the uncertainty in estimating . A series of recent advances in post-selection inference (Lockhart et al., 2014; Javanmard & Montanari, 2014; Lee et al., 2016; Tibshirani et al., 2016; Taylor & Tibshirani, 2017, etc.) may very well be the determining factor for the widespread adoption of the lasso in fields where
-values and confidence intervals are required. Point estimates without accompanying inferential statements are distrusted and disregarded in these areas. Estimatingreliably in finite sample is crucial for the adoption of this work to these fields.
If were known, then the optimal estimator for would of course be . Thus, a naive estimator for based on an estimator of would be
Chatterjee & Jafarov (2015) study the theoretical properties of (3) when is the lasso estimator, , with in (2) selected using a cross-validation procedure. However, a simple calculation in the classical setting shows that such an estimator is biased downward: a least-squares oracle with knowledge of the true support
scales this to give an unbiased estimator:
where is a sub-matrix of with columns indexed by and is its pseudoinverse. Many papers in this area discuss the difficulty of estimating and warn of the perils of underestimating it: if is underestimated then one gets anti-conservative confidence intervals, which are highly undesirable (Tibshirani et al., 2015).
Reid et al. (2013) carry out an extensive review and simulation study of several estimators of (Fan et al., 2012; Sun & Zhang, 2012; Dicker, 2014), and they devote special attention to studying the estimator
where is as in (2), with selected using a cross-validation procedure, and is the number of nonzero elements in . They show that (5) has promising performance in a wide range of simulation settings and provide an asymptotic theoretical understanding of the estimator in the special case where
is an orthogonal matrix.
While intuition from (4) suggests that (5) is a quite reasonable estimator when can be well recovered, it also points to the question of how well the estimator will perform when is not well recovered by the lasso. The conditions required for the lasso to recover are much stricter than the conditions needed for it to do well in prediction (see, e.g., Van de Geer & Bühlmann, 2009). The scale factor used in means that this approach depends not just on the predicted values of the lasso, , but on the magnitude of the set of nonzero elements in . Indeed, we find that in situations where recovering is known to be challenging, tends to yield less favorable empirical performance. The theoretical development in Reid et al. (2013) sidesteps this complication by working in an asymptotic regime in which behaves like the naive estimator (3). To understand the finite-sample performance of
would require considering the behavior of the random variable. Clearly, when , even small fluctuations in can lead to large fluctuations in . Finally, from a practical standpoint, computing is a numerically sensitive operation in that it requires the choice of a threshold size for calling a value numerically zero (and the assurance that one has solved the problem to sufficient precision).
Based on these observations, we propose in this paper a completely different approach to estimating . The basic premise of our framework is that when both and are unknown, it is convenient to formulate the penalized likelihood problem in terms of
the natural parameters of the Gaussian multiparameter exponential family with unknown mean and variance. Whereas the negative Gaussian likelihood is not jointly convex in the parameterization (in fact, it is nonconvex in ), in the natural parameterization the negative likelihood is jointly convex in .
We penalize this negative-likelihood with an -norm on the natural parameter and call this new estimator the natural lasso. We show in the next section that the resulting error variance estimator can in fact be very simply expressed as the minimizing value of the regular lasso problem (2):
Observing that the first term is , we directly see that the natural lasso counters the naive method’s downward bias through an additive correction; this is in contrast to ’s reliance on a (sometimes unstable) multiplicative correction. Computing (7) is clearly no harder than solving a lasso and, unlike , does not require determining a threshold for deciding which coefficient estimates are numerically zero. Furthermore, we establish finite sample bounds on the mean squared error that hold without making any assumptions on the design matrix . Our theoretical analysis suggests a second approach that is also based on the natural parameterization. The theory that we develop for this method, which we call the organic lasso, relies on weaker assumptions. We find that both methods have competitive empirical performance relative to and show particular strength in settings in which support recovery is known to be challenging.
2 The natural lasso estimator of error variance
2.1 Method formulation
The negative log-likelihood function in (1) is (up to a constant)
When is known, the dependence can be ignored, leading to the standard least-squares criterion; however, when is unknown, performing a full maximization of the -penalized negative log-likelihood amounts to solving a nonconvex optimization problem:
The nonconvexity of the Gaussian negative log-likelihood in its variance (or, more generally, covariance matrix) is a well-known difficulty (Bien & Tibshirani, 2011). In this context, working instead with the inverse covariance matrix is common (Yuan & Lin, 2007; Banerjee et al., 2008; Friedman et al., 2008). We take an analogous approach here, considering the natural parameterization (6) of the Gaussian multiparameter exponential family with unknown variance,
Observing that attaining sparsity in is equivalent to attaining sparsity in , we propose the natural lasso as the solution to the following -penalized maximum likelihood problem:
This problem is jointly convex in . While this is a general property of exponential families (due to the convexity of the cumulant generating function), we can see it in this special case because of the convexity of and the convexity of the “quadratic-over-linear” function (Boyd & Vandenberghe, 2004). Given a solution to (8), we can reverse (6) to get estimators for and :
One might think that solving the natural lasso (8) would involve a specialized algorithm. The following proposition shows, remarkably, that this is not the case.
The proof of this proposition and all theoretical results that follow can be found in the appendices. Thus, to get the natural lasso estimates of , one simply solves the standard lasso (2) and returns the solution and the minimal value.
Before proceeding with a statistical analysis of the natural lasso estimator, we point out a similarity between our method and that of Städler et al. (2010), who consider a different convexifying reparameterization of the Gaussian log-likelihood, using and . They put an -norm penalty on (which has the same sparsity pattern as ) and solve
Sun & Zhang (2010)
give an asymptotic analysis of the solution to (10) under a compatibility condition. A modification of this problem (Antoniadis, 2010) gives the scaled lasso (Sun & Zhang, 2012), which is known to be equivalent to the square-root lasso (Belloni et al., 2011):
2.2 Mean squared error bound
An attractive property of the natural lasso estimator is the relative ease with which one can prove bounds about its performance. Since is the optimal value of the lasso problem, the objective value at any vector provides an upper bound on . Likewise, any dual feasible vector provides a lower bound on . These considerations are used to prove the following lemma, which shows that for a suitably chosen , the natural lasso variance estimator gets close to the oracle estimator of .
If , then .
The result above is “deterministic” in that it does not rely on any statistical assumptions or arguments. The next result adds such considerations to give a mean squared error bound for the natural lasso.
Suppose that each column of the matrix has been scaled so that for all , and assume that . Then, for any constant , the natural lasso estimate (7) with satisfies the following relative mean squared error bound:
This follows from Jensen’s inequality. ∎
Notably, the mean squared error bound in Theorem 3 does not put any assumption on , , or . In this sense, the result is analogous to a “slow rate” bound (Rigollet & Tsybakov, 2011; Dalalyan et al., 2017), which appears in the lasso prediction consistency context. By comparison, Sun & Zhang (2012) establish a bound for (11) under compatibility conditions, which can be thought of as the corresponding “fast rate” bound. In particular, a bound further implied from (12), i.e., makes the comparison between these two rates clearer. While Sun & Zhang (2012) enjoys a faster rate as , this was obtained under additional assumptions on and and thus only applies for weakly correlated designs. On the other hand, (12) shows that in finite sample settings, large values of will make the estimation easier, and conceivably the slow rate bound could in fact be smaller than the fast rate bound.
It has been argued that “slow rate” is a poor name, as it can be favorable in settings where has highly correlated columns and thus the assumptions for the fast rate do not hold (Hebiri & Lederer, 2013). Indeed, for error variance estimation, we are unaware of a similar rate to (12) for the square-root/scaled lasso estimate of that holds without fast-rate-like assumptions. Lederer et al. (2016) provide a slow rate bound on the prediction error of (11
) that implies an even slower (high-probability) bound for error variance estimation:
Bayati et al. (2013) propose an estimator of based on estimating the mean squared error of the lasso. They show that their estimator of is asymptotically consistent with fixed as . In contrast, we provide finite sample results and these include the case. Also, the consistency result in Bayati et al. (2013) is based on the assumption of independent Gaussian features (and in extending this to the case of correlated Gaussian features, the authors invoke a conjecture). In comparison, (12) is essentially free of assumptions on the design matrix.
3 The organic lasso estimate of error variance
3.1 Method formulation
In practice, the value of the regularization parameter in (7) may be chosen via cross-validation; however, Theorem 3 has a regrettable theoretical shortcoming: it requires using a value of that itself depends on , the very quantity that we are trying to estimate! This is a well-known theoretical limitation of the lasso and related methods that motivated the square-root/scaled lasso. In this section, we propose a second new method, which retains the natural lasso’s parameterization, but remedies the natural lasso’s theoretical shortcoming by using a modified penalty. We define the organic lasso to be the solution to the following convex optimization problem:
We observe that the penalty is jointly convex in since it can be expressed as where is convex and is a jointly convex function that is strictly increasing in for (Boyd & Vandenberghe, 2004).
Given a solution to the above problem, we can reverse (6) to give the organic lasso estimators of :
In direct analogy to the natural lasso, the following proposition shows that we can find and without actually solving (14).
The organic lasso estimators correspond to the solution and minimal value of an -penalized least-squares problem:
Thus, to compute the organic lasso estimator, one simply solves a penalized least squares problem, where the penalty is the square of the norm. This can be thought of as the exclusive lasso with a single group (Zhou et al., 2010; Campbell et al., 2017). We show in the next section that solving this problem is no harder than solving a standard lasso problem.
Coordinate descent is easy to implement and has steadily maintained its place as a start-of-the-art approach for solving lasso-related problems (Friedman et al., 2007). For coordinate descent to work, one typically verifies separability in the non-smooth part of the objective function (Tseng, 2001). However, the penalty in (15) is not separable in the coordinates of . Lorbert et al. (2010) propose a coordinate descent algorithm to solve the Pairwise Elastic Net (PEN) problem, a generalization of (15), and a proof of the convergence of the algorithm was given in Lorbert (2012). In Algorithm 1, we give a coordinate descent algorithm specific to solving (15). The R package natural (Yu, 2017) provides C implementation of Algorithm 1.
Each coordinate update is , where is the soft-threshold operator. Empirically Algorithm 1 is found to be essentially as fast as solving a lasso problem. Theorem C.3.9 in Lorbert (2012) implies that, for any initial estimate , Algorithm 1 converges to a stationary point of the objective function of (15). The penalty, although not separable, is well enough behaved that any point that is minimum in every coordinate of the objective function in (15) is indeed a global minimum.
3.3 Theoretical results
A first indication that the organic lasso may succeed where the natural lasso falls short is in terms of scale equivariance. As the design is usually standardized to be unitless, scale equivariance in this context refers to the effect of scaling .
The organic lasso is scale equivariant, i.e., for any ,
Scale equivariance is a property associated with the ability to prove results in which the tuning parameter does not depend on . For example, the square-root/scaled lasso (11) is scale equivariant while the lasso (and thus the natural lasso) is not. In particular, , and for some .
In Section 2.2, we saw how expressing an estimator as the optimal value of a convex optimization problem allows us to take full advantage of convex duality in order to derive bounds on the estimator. We therefore start our analysis of (16) by characterizing its dual problem.
The dual problem of (16) is
Similar arguments as in Section 2.2 give a bound expressing ’s closeness to the oracle estimator of .
If , then
Comparing this lemma to the corresponding lemma in Section 2.2, we see that the condition on depends only on a quantity that is independent of .
Indeed, this leads to a mean squared error bound with the desired property of not depending on .
Suppose that each column of the matrix has been scaled so that for all , and . Then, for any constant , the organic lasso estimate (16) with satisfies the following relative mean squared error bound:
Although not central to our main purpose, the organic lasso estimate (15) of is interesting in its own right. The following theorem gives a slow rate bound in prediction error.
For any , the solution to (15) with has the following bound on the prediction error with probability greater than :
Following the same arguments as in (13), a naive estimate yields a similar rate (in and ) as that of the square-root/scaled lasso, which is much slower than (17). This observation again illustrates the advantage of the natural and organic lasso formulations of error variance estimation.
In Appendix H, we provide mappings between the path of the natural lasso, , and the path of the organic lasso .
4 Simulation studies
4.1 Simulation settings
Reid et al. (2013) carry out an extensive simulation study to compare many error variance estimators. We have matched their simulation settings fairly closely, so that the performance comparison with various other methods mentioned in Reid et al. (2013) can be inferred. Specifically, all simulations are run with and . Each row of the design is generated from a multivariate , with for and . To generate , we randomly select the indices of (out of ) nonzero elements where , and each of the nonzero elements has a value that is randomly drawn from a Laplace distribution with rate 1. The error variance is generated using for . Finally, is generated following (1).
Each model is indexed by a triplet , where captures the correlation among features, determines the sparsity of , and characterizes the signal-to-noise ratio. We vary and . We compute a Monte Carlo estimate (based on 1000 replicates) of both the mean squared error and as the measure of performance. The methods in comparison include: (a) the naive estimator (3) with in (2
); (b) the degrees of freedom adjusted estimatorin (5) (Reid et al., 2013); (c) the square-root/scaled lasso (Belloni et al., 2011; Sun & Zhang, 2013); (d) the natural lasso (7), and (e) the organic lasso (16). As a benchmark, we also include the oracle .
4.2 Methods with regularization parameter selected by cross-validation
We carry out two sets of simulations. In the first set, we compare the performance of the aforementioned methods with regularization parameter selected in a data-adaptive way. In particular, five-fold cross-validation is used to select the tuning parameter for each method.
Due to space constraints, we present a subset of the results in Fig 1 (with additional results presented in the Appendix I). The result for the square-root/scaled lasso is averaged over 100 repetitions due to the large computational time. For all other methods, the results are averaged over 1000 repetitions. Overall, the natural lasso does well in adjusting the downward bias of the naive estimator, while other methods tend to produce under-estimates. In each panel, we fix signal-to-noise ratio () and correlations among features (), and vary model sparsity (). All estimates get worse with growing , except for the natural lasso, which improves as the true gets denser. In particular, both the natural lasso and the organic lasso gain performance advantage over other methods when the underlying models do not satisfy conditions for the support recovery of the lasso solution. From left to right, Fig 1 illustrates the effect of increasing . As observed in Reid et al. (2013), high correlations can be helpful: All curves approach the oracle as increases. Finally, we find that the organic lasso is uniformly better or equivalent to .
Paired -tests and Wilcoxon signed-rank tests show that the differences in mean squared errors of different methods are significant at the level for almost all points shown in Fig 1.
4.3 Methods with fixed choice of regularization parameter
Although solving (16) is fast enough for one to use cross-validation with the organic lasso, Theorem 9 implies that is a theoretically sound choice of regularization parameter. We also conjecture that a sharper rate may be obtainable at , where . With high probability, . Thus, we also show the performance of the organic lasso with tuning parameter values equal to , and , which is a Monte Carlo estimate of .
We compare the organic lasso at these three fixed values of tuning parameter to the square-root/scaled lasso estimator (11) of error variance, which is another method whose theoretical choice of does not depend on . Sun & Zhang (2012) find that works very well for (11), which we denote by scaled(1), and Sun & Zhang (2013) propose a refined choice of , which is proved to attain a sharper rate, denoted by scaled(2).
Fig 2 shows similar patterns as Fig 1. Specifically, large value of helps all methods, while performance generally degrades for denser . Although not shown here, all methods struggle as increases. The organic lasso with performs poorly, while the organic lasso with and do quite well, generally outperforming the square-root/scaled lasso based methods.
5 Error estimation for Million Song dataset
We apply our error variance estimators to the Million Song dataset.111The whole data set can be obtained at https://labrosa.ee.columbia.edu/millionsong/. We consider a subset of the whole data, which is available at https://archive.ics.uci.edu/ml/datasets/yearpredictionmsd. The data consist of information about 463715 songs, and the primary goal is to model the release year of a song using of its timbre features. The dataset has a very large sample size so that we can reliably estimate the ground truth of the target of estimation on a very large set of held out data. In particular, we randomly select half of the songs for this purpose and use to form our ground truth, where is the least-squares estimate of . In practice, model (1) may rarely hold, which alters the interpretation of error variance estimation. Suppose the response vector has mean and covariance matrix . Then can be thought of as an estimate of the population quantity
In the special case where and , as in (1), then reduces to the linear model noise variance .
From the remaining data that was not previously used to yield , we randomly form training datasets of size and compare the performance of various error variance estimators. We vary in to gauge the performance of these methods in situations in which and . For each , we repeat the data selection and error variance estimation on disjoint training sets, and report estimates of the mean squared error in Table 1 and estimates of in Table 3 in Appendix I.
|naive||17.02 (0.68)||8.48 (0.41)||5.28 (0.26)||3.80 (0.17)||3.03 (0.13)||2.43 (0.10)|
|10.74 (0.45)||5.92 (0.29)||3.57 (0.17)||2.57 (0.11)||2.23 (0.10)||1.75 (0.08)|
|natural||8.82 (0.38)||5.23 (0.27)||3.47 (0.16)||2.61 (0.12)||2.39 (0.11)||2.01 (0.09)|
|organic||8.08 (0.32)||4.23 (0.20)||2.59 (0.12)||2.00 (0.08)||1.72 (0.08)||1.54 (0.07)|
|scaled(1)||7.43 (0.37)||4.92 (0.25)||3.84 (0.17)||3.08 (0.13)||2.94 (0.12)||2.75 (0.11)|
|scaled(2)||7.11 (0.28)||3.36 (0.15)||2.23 (0.10)||2.57 (0.83)||1.61 (0.07)||1.46 (0.07)|
|organic()||5.87 (0.24)||3.17 (0.14)||1.93 (0.09)||1.40 (0.06)||1.20 (0.05)||1.02 (0.05)|
|organic()||5.72 (0.24)||3.15 (0.14)||1.99 (0.09)||1.45 (0.07)||1.28 (0.05)||1.12 (0.05)|
Mean and standard errors (over 1000 replications) of the squared error of various methods. Each entry is multiplied by 100 to convey information more compactly.
All methods produce a substantial performance improvement over the naive estimate for a wide range of values of . The natural and organic lassos with cross validation perform either better or comparably to and are in some (but not all) cases outperformed by scaled(2). The natural lasso shows some upward bias (which as we noted before is less problematic than downward bias) when gets large. The organic lasso with the fixed choices or perform extremely well for all .
JB was supported by an NSF CAREER grant, DMS-1653017.
Appendix A Proof of Lemma 2
From (2), it follows that
By introducing the dual variable ,
By assumption, is dual feasible, which means that
where in the last step we applied Hölder’s inequality.
We prove in this section that both the natural lasso and the organic lasso estimates of error variance can be simply expressed as the minimizing values of certain convex optimization problems. To do so, we exploit the first order optimality condition of each convex program.
Lemma 11 (Optimality condition of the natural lasso).
For any , is a solution to (8) if and only if
We show that the organic lasso estimate of is the minimal value of the -penalized least squares problem. As the natural lasso, we start with studying the following optimality condition:
Lemma 12 (Optimality condition of the organic lasso).
For any , is a solution to (14) if and only if
So following the same parameterization (6), we have that and , and
Appendix C Proof of Lemma 7: the dual problem of the -penalized least squares
The primal problem of the -penalized least squares (15) can be written as an equality constrained minimization problem:
The Lagrange dual function is
The minimization of is
where the minimum is attained at
The minimization problem of can be written as
Observe that the maximum is the Fenchel conjugate function of , evaluated at . By Boyd & Vandenberghe (2004, Example 3.27, pp. 92-93),
Appendix D Proof of Lemma 8
A direct upper bound is
To get a lower bound of , note that the dual problem in Lemma 7 and the strong duality imply that
where the last inequality holds for
We present in this section the proof of Theorem 9. The proof of Theorem 3 follows the same set of arguments. First we use the following lemma to characterize the event that is true, so that we can use Lemma 8 to prove a high probability bound.
Lemma 13 (Corollary 4.3, Giraud (2014)).
Assume that each column of the design matrix satisfies for all , and . Then for any ,
Lemma 13 implies that a good choice of the value of would be , which does not depend on any parameter of the underlying model. The following corollary shows that with this value of , the organic lasso estimate of is close to the oracle estimator with high probability.
Assume that each column of the design matrix satisfies for all , and . Then for any , the organic lasso with
has the following bound
with probability greater than .
In general, a high probability bound does not necessarily imply an expectation bound. However, when the probability bound holds with an exponential tail, it implies an expectation bound with essentially the same rate.
Assume that each column of the design matrix satisfies