Scaled Sparse Linear Regression

Scaled sparse linear regression jointly estimates the regression coefficients and noise level in a linear model. It chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual square and scaling the penalty in proportion to the estimated noise level. The iterative algorithm costs little beyond the computation of a path or grid of the sparse regression estimator for penalty levels above a proper threshold. For the scaled lasso, the algorithm is a gradient descent in a convex minimization of a penalized joint loss function for the regression coefficients and noise level. Under mild regularity conditions, we prove that the scaled lasso simultaneously yields an estimator for the noise level and an estimated coefficient vector satisfying certain oracle inequalities for prediction, the estimation of the noise level and the regression coefficients. These inequalities provide sufficient conditions for the consistency and asymptotic normality of the noise level estimator, including certain cases where the number of variables is of greater order than the sample size. Parallel results are provided for the least squares estimation after model selection by the scaled lasso. Numerical results demonstrate the superior performance of the proposed methods over an earlier proposal of joint convex minimization.

Authors

• 4 publications
• 30 publications
05/27/2017

Generalized Concomitant Multi-Task Lasso for sparse multimodal regression

In high dimension, it is customary to consider Lasso-type estimators to ...
02/13/2012

Sparse Matrix Inversion with Scaled Lasso

We propose a new method of learning a sparse nonnegative-definite target...
03/18/2021

Robust-to-outliers square-root LASSO, simultaneous inference with a MOM approach

We consider the least-squares regression problem with unknown noise vari...
04/09/2020

Sparse recovery of noisy data using the Lasso method

We present a detailed analysis of the unconstrained ℓ_1-method Lasso met...
11/03/2010

The Lasso under Heteroscedasticity

The performance of the Lasso is well understood under the assumptions of...
11/12/2021

Distributed Sparse Regression via Penalization

We study sparse linear regression over a network of agents, modeled as a...
11/18/2020

A Discussion on Practical Considerations with Sparse Regression Methodologies

Sparse linear regression is a vast field and there are many different al...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper concerns the simultaneous estimation of the regression coefficients and noise level in a high-dimensional linear model. High-dimensional data analysis is a topic of great current interest due to the growth of applications where the number of unknowns far exceeds the number of data points. Among statistical models arising from such applications, linear regression is one of the best understood. Penalization, convex minimization and thresholding methods have been proposed, tested with real and simulated data, and proved to control errors in prediction, estimation and variable selection under various sets of regularity conditions. These methods typically require an appropriate penalty or threshold level. A larger penalty level may lead to a simple model with large bias, while a smaller penalty level may lead to a complex noisy model due to overfitting. Scale-invariance considerations and existing theory suggest that the penalty level should be proportional to the noise level of the regression model. In the absence of knowledge of the latter level, cross-validation is commonly used to determine the former. However, cross-validation is computationally costly and theoretically poorly understood, especially for the purpose of variable selection and the estimation of regression coefficients. The penalty level selected by cross-validation is called the prediction-oracle in

Meinshausen & Bühlmann (2006), who gave an example to show that the prediction-oracle solution does not lead to consistent model selection for the lasso.

Estimation of the noise level in high-dimensional regression is interesting in its own right. Examples include quality control in manufacturing and risk management in finance.

Our study is motivated by Städler et al. (2010) and the comments on that paper by Antoniadis (2010) and Sun & Zhang (2010). Städler et al. (2010) proposed to estimate the regression coefficients and noise level by maximizing their joint log-likelihood with an penalty on the regression coefficients. Their method has a unique solution due to the joint concavity of the log-likelihood under a certain transformation of the unknown parameters. Sun & Zhang (2010) proved that this penalized joint maximum likelihood estimator may result in a positive bias for the estimation of the noise level and compared it with two alternatives. The first is a one-step bias correction of the penalized joint maximum likelihood estimator. The second is an iterative algorithm that alternates between estimating the noise level via the mean residual square and scaling the penalty level in a predetermined proportion to the estimated noise level in the lasso or minimax concave penalized selection paths. In a simulation experiment Sun & Zhang (2010) demonstrated the superiority of the iterative algorithm, compared with the penalized joint maximum likelihood estimator and its bias correction. However, no theoretical results were given for the iterative algorithm. Antoniadis (2010) commented on the same problem from a different perspective by raising the possibility of adding an penalty to Huber’s concomitant joint loss function. See, for example, section 7.7 of Huber & Ronchetti (2009). Interestingly, the minimizer of this penalized joint convex loss is identical to the equilibrium of the iterative algorithm for the lasso path. Thus, the convergence of the iterative algorithm is guaranteed by the convexity.

In this paper, we study Sun & Zhang (2010)’s iterative algorithm for the joint estimation of regression coefficients and the noise level. For the lasso, this is equivalent to jointly minimizing Huber’s concomitant loss function with the penalty, as Antoniadis (2010) pointed out. For simplicity, we call the equilibrium of this algorithm the scaled version of the penalized regression method, for example the scaled lasso or scaled minimax concave penalized selection, depending on the choice of penalty function. Under mild regularity conditions, we prove oracle inequalities for prediction and the joint estimation of the noise level and regression coefficients for the scaled lasso, that imply the consistency and asymptotic normality of the scaled lasso estimator for the noise level. In addition, we prove parallel oracle inequalities for the least squares estimation of the regression coefficients and noise level after model selection by the scaled lasso. We report numerical results on the performance of scaled lasso and other scaled penalized methods, along with that of the corresponding least squares estimator after model selection. These theoretical and numerical results support the use of the proposed method for high-dimensional regression.

We use the following notation throughout the paper. For a vector , denotes the norm with the usual extensions and . For design matrices and subsets of , denotes column vectors of and denotes the matrix composed of columns with indices in . Moreover, .

2 An iterative algorithm

Suppose we observe a design matrix and a response vector . For penalty functions , consider penalized loss functions of the form

 Lλ(β)=|y−Xβ|222n+λ2p∑j=1ρ(|βj|/λ) (1)

where is a vector of regression coefficients. Let the penalty be standardized to , where . A vector is a critical point of the penalized loss (1) if and only if

 {x′j(y−Xˆβ)/n=λ\rm sgn(ˆβj)˙ρ(|ˆβj|/λ),ˆβj≠0,x′j(y−Xˆβ)/n∈λ[−1,1],ˆβj=0. (2)

If the penalized loss (1) is convex in , then (2) is the Karush–Kuhn–Tucker condition for its minimization.

Given a penalty function , one still has to choose a penalty level to arrive at a solution of (2). Such a choice may depend on the purpose of estimation, since variable selection may require a larger than does prediction. However, scale-invariance considerations and theoretical results suggest using a penalty level proportional to the noise level . This motivates a scaled penalized least squares estimator as a numerical equilibrium in the following iterative algorithm:

 ˆσ←|y−Xˆβold|2/{(1−a)n}1/2,λ←ˆσλ0,ˆβ←ˆβnew, Lλ(ˆβnew)≤Lλ(ˆβold), (3)

where is a prefixed penalty level, not depending on , estimates the noise level, and

provides an option for a degrees-of-freedom adjustment with

. For and , (3) initialized with the least squares estimator is non-iterative and gives . For large data sets, one may use a few passes of a gradient descent algorithm to compute from . For , this algorithm was considered in Sun & Zhang (2010). In Sun & Zhang (2010) and the numerical experiments reported in Section 4, is a solution of (2) for the given . We describe this implementation in the following two paragraphs.

The first step of our implementation is the computation of a solution path of (2) beginning from for . For quadratic spline penalties with knots, Zhang (2010) developed an algorithm to compute a linear spline path of solutions of (2) to cover the entire range of . This extends the least angle regression solution or lasso path (Osborne et al., 2000a, b; Efron et al., 2004) from and includes the minimax concave penalty for and the smoothly clipped absolute deviation penalty (Fan & Li, 2001) for . An R package named plus is available for computing the solution paths for these penalties.

The second step of our implementation is the iteration (3) along the solution path computed in the first step. That is to use the already computed

 ˆβnew=ˆβ(λ) (4)

in (3). For the scaled lasso, we use in (3) and in (1) and (2). For the scaled minimax concave penalized selection, we use and the minimax concave penalty , where regularizes the maximum concavity of the penalty. When , it becomes the scaled lasso. The algorithm (3) can be easily implemented once a solution path is computed.

Consider the penalty. As discussed in the introduction, (3) and (4) form an alternating minimization algorithm for the penalized joint loss function

 Lλ0(β,σ)=|y−Xβ|222nσ+(1−a)σ2+λ0|β|1. (5)

Antoniadis (2010) suggested this jointly convex loss function as a way of extending Huber’s robust regression method to high dimensions. For and with fixed , , so that in (4) minimizes over . For fixed , in (3) minimizes over . During the revision of this paper, we learned that She & Owen (2011)

have considered penalizing Huber’s concomitant loss function for outlier detection in linear regression. We summarize some properties of the algorithm (

3) with (4) in the following proposition.

Proposition 1

Let be a solution path of (2) with . The penalized loss function (5) is jointly convex in and the algorithm (3) with (4) converges to

 (ˆβ,ˆσ)=argminβ,σLλ0(β,σ). (6)

The resulting estimators and are scale equivariant in in the sense that and . Moreover,

 ∂∂σLλ0{ˆβ(σλ0),σ}=1−a2−|y−Xˆβ(σλ0)|222nσ2. (7)

Since (5) is not strictly convex, the joint estimator may not be unique for some data . However, since (5) is strictly convex in , is always unique in (6) and the uniqueness of follows from that of the lasso estimator at ; is unique when the second part of (2) is strict in the sense of not hitting when , which holds almost everywhere in for . See, for example, Zhang (2010).

Let . For , (7) implies that

 ˆσ=ˆσ(ˆλ), ˆλ=min{λ:ˆσ2(λ)≤nλ2/(2logp)}. (8)

While the present paper continues our earlier work (Sun & Zhang, 2010) by providing further theoretical and numerical justifications for (3) and (4), the estimator has appeared in different forms. In addition to (5) and (6) of Antoniadis (2010), (8) appeared in Zhang (2010). While this paper was in revision, a reviewer called our attention to Belloni et al. (2011), who focused on studying in an equivalent form as square-root lasso. We note that (3) and (4) allow concave penalties and degrees of freedom adjustments as in Zhang (2010).

3 Theoretical results

3.1 Analysis of scaled lasso

Let be a vector of true regression coefficients. An expert with oracular knowledge of would estimate the noise level by the oracle estimator

 σ∗=|y−Xβ∗|2/n1/2. (9)

Under the Gaussian assumption, this is the maximum likelihood estimator for when is known and follows the distribution. Due to the scale equivariance of in Proposition 1, it is natural to use as an estimation target with or without the Gaussian assumption. We derive upper and lower bounds for and use them to prove the consistency and asymptotic normality of . We derive oracle inequalities for the prediction performance and the estimation of under the loss. Throughout the sequel,

is the probability measure under which

. We assume that whenever is invoked. The asymptotic theory here concerns and allows all parameters and variables to depend on , including .

We first provide the consistency for the estimation of via an oracle inequality for the prediction error of the scaled lasso. In our first theorem, the relative error for the estimation of is bounded by a quantity related to a prediction error bound in (10) below. For , , , and , define and

 η(λ,ξ,w,T)=|Xβ∗−Xw|22/n+(1+δw,T)2λ|wTc|1+4ξ2λ2|T|(ξ+1)2κ2(ξ,T) (10)

where , the compatibility factor (van de Geer & Bühlmann, 2009), is defined as

 κ(ξ,T)=min{|T|1/2|Xu|2n1/2|uT|1:u∈C(ξ,T), u≠0} (11)

with the cone . Since the prediction error bound is valid for all and , is related to its minimum over all and at the oracle scale :

 τ0=η1/2∗(σ∗λ0,ξ)/σ∗,η∗(λ,ξ)=infw,T η(λ,ξ,w,T). (12)
Theorem 1

Let be the scaled lasso estimator in (6), , the oracle noise level in (9), and . When ,

 max(1−ˆσσ∗,1−σ∗ˆσ)≤τ0,|Xˆβ−Xβ∗|2n1/2σ∗≤1σ∗η1/2∗(σ∗λ01−τ0,ξ)≤τ01−τ0. (13)

In particular, if with and , then

 prβ∗,σ(|ˆσ/σ−1|>ϵ)→0 (14)

for all .

Theorem 1 extends to the scaled lasso a unification of prediction oracle inequalities for a fixed penalty. With , (13) gives , or

 max{(σ∗τ0)2,|Xˆβ−Xβ∗|22/n}≤minw{|Xw−Xβ∗|22/n+4˜Cλp∑j=1min(λ,|wj|)} (15)

for a , if the minimum in (15) is attained at a with , where . This asserts that for an arbitrary, possibly non-sparse , the prediction error of the scaled lasso is no greater than that of the best linear predictor with a sparse for an additional capped- cost of the order . A consequence of this prediction error bound for the scaled lasso is the consistency of the corresponding estimator of the noise level in (14). Due to the scale equivariance in Proposition 1, Theorem 1 and the results in the rest of the section are all scale free.

For fixed penalty , the upper bound has been previously established for different and , with possibly different constant factors. Examples include (Greenshtein & Ritov, 2004; Greenshtein, 2006), with (van de Geer & Bühlmann, 2009), and (Koltchinskii et al., 2011). In (10), the coefficient for is 1 as in Koltchinskii et al. (2011).

Now we provide sharper convergence rates and the asymptotic normality for the scaled lasso estimation of the noise level . This sharper rate , essentially taking the square of the order in (13), is based on the following error bound for the estimation of ,

 μ(λ,ξ)=(ξ+1)minTinf0<ν<1max[|β∗Tc|1ν,λ|T|/{2(1−ν)}κ2{(ξ+ν)/(1−ν),T}]. (16)

This error bound has the interpretation

 |ˆβ−β∗|1≤μ(λ,ξ)≤˜Cp∑j=1min(λ,|β∗j|), (17)

if with . This allows to have many small elements, as in Zhang & Huang (2008), Zhang (2009) and Ye & Zhang (2010). The bound improves upon its earlier version in van de Geer & Bühlmann (2009) by a constant factor .

Theorem 2

Let and be as in Theorem 1. Set . (i) The following inequalities hold when ,

 max(1−ˆσ/σ∗,1−σ∗/ˆσ)≤τ2∗,|ˆβ−β∗|1≤μ(σ∗λ0,ξ)/(1−τ2∗). (18)

(ii) Let . For all and ,

 prβ∗,σ{z∗≤(1−τ2∗)λ0(ξ−1)/(ξ+1)}≥1−{1+o(1)}ϵ/{πlog(p/ϵ)}1/2.

If with and , then

 n1/2(ˆσ/σ−1)→N(0,1/2) (19)

in distribution under .

Since with , the rate in (18) is essentially the square of that in (13), in view of (12). It follows that the scaled lasso provides a faster convergence rate than does the penalized maximum likelihood estimator for the estimation of the noise level (Städler et al., 2010; Sun & Zhang, 2010). In particular, (18) implies that

 max(1−ˆσ/σ∗,1−σ∗/ˆσ)≤(ξ+1)λ20|Sβ∗|/{2κ2(ξ,Sβ∗)}≲|β∗|0(logp)/n (20)

with , when can be treated as a constant. The bounds in (20) and its general version (18) lead to the asymptotic normality (19) under proper assumptions. Thus, statistical inference about is justified with the scaled lasso in certain large--smaller- cases, for example, when under the compatibility condition (van de Geer & Bühlmann, 2009).

For a fixed penalty level, oracle inequalities for the error of the lasso have been established in Bunea et al. (2007), van de Geer (2008) and van de Geer & Bühlmann (2009) for , Zhang & Huang (2008) and Bickel et al. (2009) for , Meinshausen & Yu (2009) for , and Zhang (2009) and Ye & Zhang (2010) for . The bounds on in (18) and (20) allow automatic extensions of these existing oracle inequalities from the lasso with fixed penalty to the scaled lasso. We illustrate this by extending the oracle inequalities of Ye & Zhang (2010) for the lasso and Candes & Tao (2007) for the Dantzig selector in the following corollary. Ye & Zhang (2010) used the following sign-restricted cone invertibility factor to separate conditions on the error and design in the derivation of error bounds for the lasso:

 Fq(ξ,S)=inf{|S|1/q|X′Xu|∞n|u|q:u∈C−(ξ,S)}, (21)

where . The quantity (21

) can be viewed as a generalized restricted eigenvalue comparing the

loss and the dual norm of the penalty with respect to the inner product for the least squares fit. This gives a direct connection to the Karush–Kuhn–Tucker condition (2). Compared with the restricted eigenvalue (Bickel et al., 2009) and the compatibility factor (11), a main advantage of (21) is to allow all . In addition, (21) yields sharper oracle inequalities (Ye & Zhang, 2010). For with , define

 δ±a=maxA,u{±(|XAu/n1/2|2−1)}, θa,b=maxA,B,u∣∣X′AXBu/n∣∣2. (22)

The quantities in (22) are used in the uniform uncertainty principle (Candes & Tao, 2007) and the sparse Riesz condition (Zhang & Huang, 2008). We note that is the minimum eigenvalue of among , is the corresponding maximum eigenvalue, and is the maximum operator norm of size off-diagonal sub-blocks of the Gram matrix .

Corollary 1

Suppose . Then, Theorem 2 holds with replaced by , and for ,

 |ˆβ−β∗|q≤k1/q(σ∗z∗+ˆσλ0)Fq(ξ,S)≤2σ∗ξλ0k1/q(1−τ2∗)(ξ+1)Fq(ξ,S) (23)

for all , where . In particular, for and ,

 |ˆβ−β∗|2≤(8k)1/2λ0σ∗/(1−τ2∗)(√2+1)F2(√2,S)≤4k1/2λ0σ∗/(1−τ2∗)(√2+1)(1−δ−1.5k−θ2k,1.5k)+. (24)

The proofs of Theorems 1 and 2 are based on a basic inequality

 |Xˆβ(λ)−Xβ∗|22/n+|Xˆβ(λ)−Xw|22/n (25) ≤|Xw−Xβ∗|22/n+2λ{|w|1−|ˆβ(λ)|1}+2σ∗z∗|w−ˆβ(λ)|1 (26)

as a consequence of the Karush–Kuhn–Tucker conditions (2). The version of (25) with is well-known (van de Geer & Bühlmann, 2009) and controls for sparse . When , (25) controls the excess for sparse by the same argument. The general is taken in Theorem 1, while is taken in Theorem 2. In both cases, (25) provides the cone condition in (11) and (21). This is used to derive upper and lower bounds for (7), the derivative of the profile loss function with respect to , within a neighborhood of . The bounds for the minimizer then follow from the joint convexity of the penalized loss (5).

3.2 Estimation after model selection

We have proved that without requiring the knowledge of , the scaled lasso enjoys prediction and estimation properties comparable to the best known theoretical results for known , and the scaled lasso estimate of enjoys consistency and asymptotic normality properties under proper conditions. However, the lasso estimator may have substantial bias (Fan & Peng, 2004; Zhang, 2010), and its bias is significant in our own simulation experiments. Although the smoothly clipped absolute deviation and minimax concave penalized selectors were introduced to remove the bias of the lasso (Fan & Li, 2001; Zhang, 2010), a theoretical study of their scaled version (3) is beyond the scope of this paper. In this subsection, we present theoretical results for another bias removing method: least squares estimation after model selection.

Given an estimator of the coefficient vector , the least squares estimator of and the corresponding estimator of the noise level in the model selected by are

 (27)

where . Alternatively, we may use to estimate the noise level. However, since the effect of this degrees of freedom adjustment is of smaller order than our error bound, we will focus on the simpler (27).

In addition to the compatibility factor in (11), we use sparse eigenvalues to study the least squares estimation after the scaled lasso selection. Let be the smallest eigenvalue of a matrix and the largest . For models , define

 κ−(m,T)=minB⊃T,|B∖T|≤mλmin(X′BXB/n), κ+(m,T)=minB∩T∅,|B|≤mλmin(X′BXB/n),

as the sparse lower eigenvalue of the Gram matrix for models containing and the sparse upper eigenvalue for models disjoint with . Let and . The following theorem provides prediction and estimation error bounds for (27) after the scaled lasso selection, along with an upper bound for the false positive , a key element in our study.

Theorem 3

Let be the scaled lasso estimator in (6) and the least squares estimator (27) in model . Let and be as in Theorem 2 and be an integer satisfying . If , then

 |ˆS∖S|

with and , and

 κ−(m−1,S)|¯¯¯β−β∗|22≤|X¯¯¯β−Xβ∗|22/n≤{σ∗m−1,S+2√η∗(ˆλ,ξ)}2. (29)

Moreover, in addition to the probability bound for in Theorem 2 (ii), for all integers ,

 prβ∗,σ[σ∗m,S(√n)/σ≥√(m+|S|)+√{(2m)log(ep/m)+2log(1/ϵ)}]≤ϵ/m√(2π). (30)

For Gaussian design matrices, the sparse eigenvalues and can be treated as constants when is small and the eigenvalues of the expected Gram matrix are uniformly bounded away from zero and infinity (Zhang & Huang, 2008). Since and , they can be treated as constants in the same sense in Theorem 3. Thus, for sufficiently small , we may take an of the same order as . In this case, the difference between and the scaled lasso estimator is of no greater order than the difference between and the estimation target . Consequently,

 ∣∣¯¯¯σ/σ−1∣∣+∣∣ˆβ−β∗∣∣22+∣∣X¯¯¯β−Xβ∗∣∣22/n=OP(1)|S|(logp)/n.

As we have mentioned earlier, the key element in our analysis of (27) is the bound in (28). Since this is a weaker assertion than variable consistency , the conditions of Theorem 3 on the design matrix is of a weaker form than the irrepresentability condition for variable selection consistency (Meinshausen & Bühlmann, 2006; Zhao & Yu, 2006). In Zhang & Huang (2008) and Zhang (2010), upper bounds for the false positive were obtained under a sparse Riesz condition on and .