Model Selection and estimation of Multi Screen Penalty

by   Yuehan Yang, et al.

We propose a multi-step method, called Multi Screen Penalty (MSP), to estimate high-dimensional sparse linear models. MSP uses a series of small and adaptive penalty to iteratively estimate the regression coefficients. This structure is shown to greatly improve the model selection and estimation accuracy, i.e., it precisely selects the true model when the irrepresentable condition fails; under mild regularity conditions, MSP estimator can achieve the rate √(q n /n) for the upper bound of l_2-norm error. At each step, we restrict the selection of MSP only on the reduced parameter space obtained from the last step; this makes its computational complexity is roughly comparable to Lasso. This algorithm is found to be stable and reaches to high accuracy over a range of small tuning parameters, hence deletes the cross-validation segment. Numerical comparisons show that the method works effectively both in model selection and estimation and nearly uniformly outperform others. We apply MSP and other methods to financial data. MSP is successful in assets selection and produces more stable and lower rates of fitted/predicted errors.



page 1

page 2

page 3

page 4


Improving Lasso for model selection and prediction

It is known that the Thresholded Lasso (TL), SCAD or MCP correct intrins...

Rademacher upper bounds for cross-validation errors with an application to the lasso

We establish a general upper bound for K-fold cross-validation (K-CV) er...

Model selection consistency from the perspective of generalization ability and VC theory with an application to Lasso

Model selection is difficult to analyse yet theoretically and empiricall...

Ultra high dimensional generalized additive model: Unified Theory and Methods

Generalized additive model is a powerful statistical learning and predic...

Pathway Lasso: Estimate and Select Sparse Mediation Pathways with High Dimensional Mediators

In many scientific studies, it becomes increasingly important to delinea...

A generalized EMS algorithm for model selection with incomplete data

Recently, a so-called E-MS algorithm was developed for model selection i...

Estimating the Penalty Level of ℓ_1-minimization via Two Gaussian Approximation Methods

In this paper, we aim to give a theoretical approximation for the penalt...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model selection is at the center of high-dimensional statistical learning. During the last few years, a great deal of attention has been focused on the penalized

-regularization methods. For example, Lasso (Tibshirani, 1996) is a successful idea that can simultaneously perform model selection and parameter estimation. It generates sparse models by regularizing with

penalty and has been shown to be effective in high-dimensional data. Several researchers work on the improvement of Lasso and its applications, offering some theoretical guarantees, and with widespread application; see, i.e.

Bickel et al. (2009); Efron et al. (2004, 2007); Meinshausen and Bühlmann (2006); Yuan and Lin (2007); Wu et al. (2014); Zhao and Yu (2006); Zou (2006); Zou and Hastie (2005).

Since convex regularization methods like Lasso gives biased estimation, the concave penalization such as the SCAD (Fan and Li, 2001) and the MCP (Zhang, 2010a) are proposed to correct the bias. Multi-step method has been widely studied in the literature too. For example, (Zhang, 2010b, 2013) proposed the Capped regularization, where the solution is given by a multi-step convex relaxation scheme, and it is shown to obtain the correct feature set after a certain number of iterations. (Zou and Li, 2008)

proposed a new unified algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood. But they only presented a one-step low-dimensional asymptotic analysis.

(Fan et al., 2014) provided a unified theory to show how to obtain the oracle solution via the LLA. The theoretical properties of LLA highly rely on the initial parameters. (Bühlmann and Meier, 2008) proposed a method called multi-step Adaptive lasso (MSA-lasso) on the simulated model and real data set. MSA-lasso re-estimates the coefficients each iteration and updates the adaptive weights until they convergence, which may leads to high execution time to get the solution. Also, there are so many two-step methods, like Adaptive lasso (Zou, 2006), OLS post lasso (Belloni and Chernozhukov, 2013), etc.

When dealing with the complex data, above methods still have drawbacks on model selection. For example, if an irrelevant variable is highly correlated with the variables in the true model, aforementioned methods may not be able to distinguish it from the true variables. And yet, non-convex functions introduce numerical challenges in fitting models and are less computationally efficient than convex optimization problems.

In this paper, we develop a new multi-step method, MSP, short for Multi Screen Penalty. We show the asymptotic theory of this method and full comparison with other multi(one)-step methods, making the following contributions:

  • We use a series of small and adaptive penalty to iteratively estimate the regression coefficients. The tuning parameter is small and remains unchanged during iterations, to minimize the bias. When dealing with the complex data, this structure does not choose the wrong model at first and iteratively distinguish the nonzeros from zeros by themselves.

  • We delete the “useless” variables and shrink the active set in every step, which significantly reduce the execution time. We will show in simulations that its computational complexity is roughly comparable to Lasso. Compare to others, multi-step methods always do the estimation with the whole data set in every step, leading unnecessary computational cost.

  • We show that the MSP can reach the significantly small /-norm error, and establish the theoretical properties of MSP to successfully recover the true underlying sparse model when the irrepresentable condition is relaxed. In the simulations, we establish the scenarios that irrepresentable condition fails. MSP can select the right model and estimate the coefficients precisely while aforementioned methods pick the wrong model.

  • We will show in simulation that the MSP produces much the same fitting model and estimation error during a range of small tuning parameters, hence does not need to choose the tuning parameter by cross-validation. The solution of this method is sparse enough and very stable.

Throughout the paper, we do not use the irrepresentable condition (Zhao and Yu, 2006)

to bound the covariance between irrelevant variables and relevant variables. Irrepresentable condition is an important condition which is always required in methods obtaining their model selection consistency, i.e. Lasso, Adaptive lasso, OLS post lasso, etc. Compare with that, we use the Restricted Eigenvalue (RE) condition

(Bickel et al., 2009; van de Geer and Bühlmann, 2009; Negahban et al., 2012) to replace the irrepresentable condition, and it is also a regular requirement for the non-convex methods to pick the right model, i.e. MCP, LLA, etc. Furthermore, we allow irrelevant variables highly correlated with relevant variables. For example, we build examples in simulations that the irrelevant variable is generated by noise term and other variables, while many of them are relevant. The covariance between the irrelevant variable and relevant variables are strong, hence the irrepresentable condition fails but the RE condition holds. We show that our method is effective on such complex data, which is rather important because even many non-convex methods claimed that they do not require the irrepresentable condition in theory, still may fail to select the right model in finite samples. Our method makes a great improvement to deal with such data.

This paper is organized as follows. Section 2 presents the method. Section 3 shows its theoretical properties. The simulations in Section 4 and application in Section 5 analyse the performance of the proposed method and compare with several existing methods. We conclude in Section 6. Technique details are provided in the Supplemental Material.

2 Method

In this section, we define Multi Screen Penalty (MSP) estimation and propose a simple multi-step algorithm for computing the MSP. We consider a linear regression problem:

where is an

response vector,

is the matrix and is the vector of regression coefficients. The number of parameters usually greatly exceeds the observations (). We consider the -sparse model, which means has at most nonzero elements. is an error vector whose components are independently distributed from , where is a positive constant. The data and coefficients are allowed to change as grows; meanwhile, and are allowed to grow with . For notational simplicity, we do not index them with .

Recall that the Lasso estimator is defined in Tibshirani (1996) as a squared error loss with the -penalty. Lasso shrinks a certain set of coefficients to zero and others towards zero compared to the least squares. These two effects, model selection and shrinkage estimation, are related by a tuning parameter only (Meinshausen, 2007), hence it leads a well-known estimation bias. On the other hand, Zhao and Yu (2006); Meinshausen and Bühlmann (2006); Tropp (2006) proved that the Lasso is model selection consistent, however, the irrepresentable condition is quite restrictive. Hence, two reasonable motivations are: 1) if we can have the model selection consistent even if the irrepresentable condition fails; and 2) due to the lack of the “bias term”, if we can use a small penalty to hold large coefficients steady.

The MSP estimator is hence proposed, which minimizes the tuning parameter’s influence and provides accurate model selection by coefficients themselves. We have the following Algorithm

Multi Screen Penalty (MSP)

  • Obtain a lasso solution :

    and let to be the nonzero index set of , i.e. .

  • Repeat the following step until convergence:


    where the active set is updated in every step, i.e. and

When the algorithm converge, converges to and we denote to be the solution. For showing the situation that many methods cannot deal the complex data, we put an example and show eight method’s (in)consistency in model selection: MSP, Lasso (Tibshirani, 1996), LLA (Zou and Li, 2008; Fan et al., 2014), MCP (Zhang, 2010a), SCAD Fan and Li (2001), Adaptive lasso (Zou, 2006), OLS post lasso (Belloni and Chernozhukov, 2013) and Capped (Zhang, 2010b, 2013). Set and has 4 nonzero entries. There exists a variable which is irrelevant but highly correlated with relevant variables, hence the irrepresentable condition fails in this example. As shown in Figure 1, every other method picks up this irrelevant variable first and never shrinks it back to zero. MSP has the similar performance when is large, but when decreases, MSP obtains the stable, accurate estimation and selects the right model during the second half of the solution path. More details of this data example and other comparisons can be found in simulations.

(a) MSP
(b) Capped
(c) LLA
(d) Adaptive lasso
(e) Lasso
(f) SCAD
(g) MCP
(h) OLS post lasso
Figure 1: An example to illustrate eight methods’ (in) consistency in model selection. The solid lines stand for the relevant variables; the dashed line stands for the variable which is irrelevant but highly correlated with relevant variables; the dotted lines stand for the irrelevant variables except the specific one.

3 Theorem

To study the properties, we first define some notation. Without loss of generality, we assume the columns of are standardized: , . Set , ; let , and . To state our theorems, we need the following assumptions.

  1. (C.1) For the nonzero coefficients, assume that

    where is a positive constant.

  2. (C.2) Assume satisfies the Restricted Eigenvalue (RE) condition: with a positive constant that

    for all where .

(C.1) requires a small gap between and . It allows when but at a rate that can be distinguished. (C.1) has appeared frequently in the literature for proving model selection consistency, i.e. Zhao and Yu (2006); Zhang (2010a). (C.2) is usually used to bound the -error between coefficients and estimates (Bickel et al., 2009; Meinshausen and Yu, 2009), and also the least restrictive condition of all same type conditions, i.e. restricted isometry property (Candes and Tao, 2007) and partial Riesz condition (Zhang and Huang, 2008)

. It is proved that (C.2) holds with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence irrepresentable condition or restricted isometry condition can be violated with high probability

(Raskutti et al., 2010).

Considering following dimensions, set and where and . As a preparatory result, the following proposition shows that the first step of MSP selects the active set containing the true set with high probability.

Proposition 1.

Considering the first step of the MSP, and related set , suppose (C.1) and (C.2) hold. Set . We have


Since the first step estimator of MSP is the Lasso solution, Proposition 1 can be seen as proving the property for the Lasso estimator. We obtain this result by using the bound between and . Error bounds of the Lasso are known from past work (Bickel et al., 2009; Meinshausen and Yu, 2009; Negahban et al., 2012). Our proof uses this bound to illuminate the containment relationship between and . It supports the backward deletion structure of MSP that we can deletes the variables which do not belong to since they are irrelevant with high probability.

During Proposition 1, we assume , which is the same as that of Lasso satisfying the error bounds with high probability, while Lasso’s model selection consistency requires larger tuning parameter, i.e. where . Hence when is large, the error bound and selecting the true model of Lasso cannot hold at one time. We solve this problem by iterations.

According to the property of the Lasso’s solution, the first step of MSP shrinks a huge number of coefficients to zero, i.e. . After the first step, other steps of MSP estimate the nonzero set under low-dimensional settings, and the active set continue decreases until convergence. Now we show the convergence rate of error bound and the sign consistency of MSP in the following.

Theorem 1.

Suppose (C.1) and (C.2) hold. Set . For , with probability at least , the estimate satisfies the bounds,


where and , are defined in (C.1) - (C.2). And we have:

REMARK 1. Note that and have different orders under the assumption that . If we consider a general high dimensional setting, i.e. , by setting , we have the same result in Theorem 1. The proof is omitted since it is trivial. For this reason, we use the same and for the method in the simulations.

REMARK 2. The error bound of MSP in (3) is influenced by the adaptive penalty. We allow to converge to 0 in (C.1), i.e., the lower bound of is , where . It makes has to be considered towards infinity. As a consequence, dominates the denominator of the error bound of MSP instead of . When close to , this bound is close to the rate .

REMARK 3. For denoting the lasso solution, we have . Compared to lasso, the error bound in (3) is much smaller than of lasso when is large.

If we don’t allow to go to zero, instead, replace (C.1) by following condition:

  1. (C.1)* For the nonzero coefficients, let and assume .

(C.1)* gives a lower bound between and where is allowed to be any positive constant. This condition has appeared frequently in the literature, i.e. Huang et al. (2008). Replace (C.1) by (C.1)* and consider following dimensions: and where and . Corollary 1 shows that the -error and -error of MSP estimator reach the rate and , respectively.

Corollary 1.

Suppose (C.1)* and (C.2) hold. Set . With probability , satisfies the bounds:


And we have:

REMARK 4. Through theoretical results, the Gaussian assumption can be relaxed by a subgaussian assumption. That is, assume that there exists constants , so for ,

4 Simulation studies

In this section, we give simulation examples to illustrate the established results: 1) the first part illustrates the MSP’s consistency in model selection; 2) the second part shows the performances of the proposed estimator comparing with several existing methods, and analyses the stability of MSP; 3) the third part compares the execution time in seconds of each method.

Several methods are implemented. We choose three one-step methods, Lasso (Tibshirani, 1996), SCAD (Fan and Peng, 2004), MCP (Zhang, 2010a), and four multi(two)-step methods: Adaptive lasso (Zou, 2006), OLS post lasso (short for Plasso) (Belloni and Chernozhukov, 2013), Capped (Zhang, 2010b, 2013), LLA (Zou and Li, 2008; Fan et al., 2014). Among, MCP, SCAD and LLA results are obtained by ncvreg packages (Breheny and Huang, 2011). The others and MSP can be obtained by Glmnet packages (Friedman et al., 2010) and Lars packages (Efron et al., 2004). In order to simplify the various values of tuning parameters of each method, we have following setting of the parameters: (1) We use the default parameters for SCAD and MCP in ncvreg package (3.7 for MCP and 3 for SCAD); (2) We use the same tuning parameters for Adaptive lasso to solve the initial estimates and final estimation, and also use the same tuning parameters for Capped-; (3) For MSP, we use the same parameters during the whole iterations.

We simulate data from the linear regression model. Generate the response by

where and

are generated from multivariate normal distribution for sample size. We choose 4 regression coefficients nonzero,

, and shrink others to zero. The predictor vector is generated from . We consider the data sets that the irrelevant variable is highly correlated with relevant variables. Two scenarios are considered. The first predictor of two scenarios are both generated to be correlated with other seven predictors by


is i.i.d random variable with the same setting as

. Then the covariance matrices of the second to the end predictors, denoted by , are considered by two Toeplitz matrices: Scenario 1: , Scenario 2: , where and where .

In both scenarios, RE condition (C.2) holds and the irrepresentable condition fails. We show the irrepresentable condition as following: There exists a positive constant , with

We have and . It is easy to get

Mention that if we choose the first 4 coefficients nonzero, generate the last predictor to be correlated with first seven predictors. The irrepresentable condition holds in Scenario 2, hence we do not pick this setting. As an example, we show eight methods’ solution paths with under Scenario 1 in the Figure 1 of Section 2.

We compare both estimation performances and selection performances of eight methods in Section 4.2. The -norm errors () and -norm errors () are measured. We present the estimated number of nonzero components (NZ). Further, the estimated models are evaluated by the false positive rate (FPR) and true positive rate (TPR). Both measures are respectively defined as

4.1 Model selection

To illustrate that MSP can improve model selection. Three dimensions while observations are tested here. In this part, we first compare the performance of the MSP with that of Capped-, LLA and MCP under Scenario 1. The results are shown in Figure 2. When dealing with the high dimensional settings, throughout the plots on the middle and right of Figure 2, we can see that Capped-, LLA and MCP pick up the first variable first and never shrinks it back to zero. When dealing with the low dimensional setting, throughout the plots on the left of Figure 2, we can see that Capped-, LLA and MCP may shrink the first variable to zero when is small, but in the meantime, other irrelevant variables are picked. As a comparison, MSP chooses the right model during the second half of the solution path in low- and high- dimensional settings.

Figure 3 shows the results’, for MSP, Lasso, Plasso and SCAD, (in)consistency in model selection under Scenario 2. Lasso, Plasso and SCAD failed to pick the right variables while MSP always selects the right model when is small.

Still, we can find that the MSP solutions are very stable during the second half of the solution paths in all the figures. It means we can obtain the right model by easily choose a small without cross-validation. It is a key feature when we built statistical modeling for empirical data. We will show more analysis in the Section 4.2

(a) MSP
(b) MSP
(c) MSP
(d) Capped
(e) Capped
(f) Capped
(g) LLA
(h) LLA
(i) LLA
(j) MCP
(k) MCP
(l) MCP
Figure 2: Results for three simulation models with Scenario 1. The dashed line stands for the first variable; the dotted lines stand for the irrelevant variables except the first variable; the solid lines stand for the relevant variables.
(a) MSP
(b) MSP
(c) MSP
(d) Lasso
(e) Lasso
(f) Lasso
(g) Plasso
(h) Plasso
(i) Plasso
(j) SCAD
(k) SCAD
(l) SCAD
Figure 3: Results for three simulation models with Scenario 2. The dashed line stands for the first variable; the dotted lines stand for the irrelevant variables except the first variable; the solid lines stand for the relevant variables.

4.2 Performance measures

We first show the several numerical performances of the proposed estimator comparing with others in Table 1 and Table 2. Two dimension settings are considered:

. The four nonzero coefficients are valued from uniform distribution on

. We choose the penalty levels by 10-fold cross-validation for methods. Table 1 and Table 2 illustrate two scenarios respectively. The average of each measure is presented base on simulations. It is seen that MSP uniformly outperform others in estimation accuracy and model selection.

Method l2 l1 NZ FPR TPR
(n,p) = (100,50)
MSP 0.1970 (0.32) 0.4981 (0.38) 4.66 (4.44) 0.0165 (0.10) 0.9750 (0.08)
Lasso 0.4832 (0.27) 1.4025 (0.28) 40.48 (11.27) 0.7943 (0.24) 0.9850 (0.06)
SCAD 0.5370 (0.27) 1.4980 (0.29) 40.38 (12.36) 0.7928 (0.26) 0.9775 (0.07)
MCP 0.5311 (0.27) 1.4865 (0.31) 38.42 (13.88) 0.7502 (0.30) 0.9775 (0.07)
Alasso 0.3112 (0.33) 0.8282 (0.49) 16.47 (14.08) 0.2728 (0.30) 0.9800 (0.07)
Plasso 0.4614 (0.27) 1.2787 (0.33) 27.85 (15.89) 0.5204 (0.34) 0.9775 (0.07)
Capped 0.4371 (0.26) 1.3411 (0.26) 39.68 (10.10) 0.7770 (0.22) 0.9850 (0.06)
LLA 0.5610 (0.27) 1.5426 (0.22) 40.82 (11.65) 0.8028 (0.25) 0.9725 (0.08)
(n,p) = (100,200)
MSP 0.5206 (0.48) 0.8797 (0.53) 5.33 (3.67) 0.0246 (0.06) 0.8875 (0.13)
Lasso 0.8857 (0.24) 1.7885 (0.28) 59.26 (29.44) 0.2836 (0.15) 0.9200 (0.12)
SCAD 1.0892 (0.05) 1.5400 (0.14) 22.34 (8.87) 0.0986 (0.04) 0.7525 (0.03)
MCP 1.0651 (0.14) 1.5138 (0.14) 13.69 (9.07) 0.0542 (0.05) 0.7650 (0.06)
Alasso 0.7867 (0.37) 1.3868 (0.50) 26.15 (28.91) 0.1154 (0.15) 0.8850 (0.13)
Plasso 0.8272 (0.26) 1.6679 (0.40) 37.27 (33.59) 0.1716 (0.17) 0.9100 (0.12)
Capped 0.7870 (0.29) 1.8008 (0.28) 66.10 (31.45) 0.3183 (0.16) 0.9275 (0.11)
LLA 1.0895 (0.05) 1.5288 (0.08) 21.79 (7.16) 0.0959 (0.04) 0.7500 (0.00)
Table 1: Performances of eight methods with Scenario 1.
Method l2 l1 NZ FPR TPR
(n,p) = (100,50)
MSP 0.1044 (0.14) 0.3881 (0.14) 3.98 (0.14) 0 (0) 0.995 (0.04)
Lasso 0.5097 (0.24) 1.4752 (0.31) 39.73 (9.59) 0.7770 (0.21) 0.9975 (0.03)
SCAD 0.3892 (0.30) 1.1535 (0.63) 25.99 (18.35) 0.4783 (0.40) 0.9975 (0.03)
MCP 0.5418 (0.27) 1.5129 (0.46) 34.12 (15.22) 0.6554 (0.33) 0.9925 (0.04)
Alasso 0.2192 (0.25) 0.6498 (0.40) 10.42 (10.19) 0.1400 (0.22) 0.9950 (0.04)
Plasso 0.4513 (0.26) 1.2377 (0.43) 23.04 (14.03) 0.4141 (0.30) 0.9975 (0.03)
Capped 0.3114 (0.14) 1.0953 (0.22) 30.43 (7.51) 0.5746 (0.16) 1.0000 (0.00)
LLA 0.5322 (0.21) 1.4071 (0.47) 30.56 (17.97) 0.5776 (0.39) 0.9975 (0.03)
(n,p) = (100,200)
MSP 0.2014 (0.33) 0.4951 (0.33) 4.38 (3.37) 0.0024 (0.02) 0.975 (0.08)
Lasso 0.8995 (0.33) 1.9233 (0.34) 69.36 (25.31) 0.3338 (0.13) 0.9825 (0.06)
SCAD 1.0232 (0.62) 1.2761 (0.58) 5.94 (8.41) 0.0133 (0.04) 0.8325 (0.12)
MCP 1.2001 (0.43) 1.4721 (0.39) 7.03 (13.94) 0.0197 (0.07) 0.7925 (0.09)
Alasso 0.4314 (0.42) 0.9286 (0.49) 15.55 (16.86) 0.0595 (0.09) 0.9725 (0.08)
Plasso 0.7437 (0.37) 1.6730 (0.56) 41.56 (30.30) 0.1920 (0.15) 0.9825 (0.06)
Capped 0.4168 (0.21) 1.4682 (0.22) 63.56 (14.74) 0.3039 (0.08) 0.9975 (0.03)
LLA 1.1500 (0.47) 1.3734 (0.40) 4.14 (0.59) 0.0043 (0.00) 0.8225 (0.11)
Table 2: Performances of eight methods with Scenario 2.

We discuss the performances of methods when increases:
. Four methods, MSP, Lasso, OLS post Lasso and Adaptive lasso, are compared under two dimensions: with Scenario 2. As shown in Figure 4, MSP shows the lowest error.

(a) The L2 norm
(b) The L2 norm
Figure 4: Results of -norm errors for two models with Scenario 2. The horizontal axis is a range of .

We show in Figure 5 that MSP doesn’t need cross-validation segment for . MSP’s estimation performance during a range of are presented with the same models as Table 2. We set to be the first 9 elements of . The average value of when Lasso chooses the fourth variable is set to be the last element. Since MSP have the similar performances in both Scenarios, Scenario 1 is omit here. As shown in Figure 5, MSP produces much the same estimation during the range of the first 9 elements of .

(a) The L2 norm
(b) The L2 norm
Figure 5: Results of -norm errors for two models with Scenario 2. The horizontal axis is a range of .

4.3 Computational Cost

In this part, we compare the computational cost of different methods. Following settings are tested: and . Every single test including different that from 0 (not equal to 0) to the value shrinking every coefficients into zero. We using Glmnet package (Friedman et al., 2010) and ncvreg package(Breheny and Huang, 2011) to solve different estimators. See Table 3, MSP uniformly performs other multi(two)-step methods.

(n,p) Lasso MCP SCAD MSP Alasso LLA Capped MSA
(, ) 0.0 0.1 0.1 0.8 0.4 3.3 1.8 2.1
(, ) 0.0 0.1 0.1 0.8 6.5 12.0 32.8 20.1
(, ) 0.2 0.4 0.6 0.8 739.2 1206.2 2299.9 1239.0
(, ) 0.6 13.1 10.2 26.9 65.1 238.2 130.0 198.3
(, ) 1.9 5.2 10.6 45.0 6502.4 9936.5 27474.1 10697.0
(, ) 15.8 27.0 83.0 67.0 NA NA NA NA
Table 3: Average execution time in seconds for eight models.

5 Empirical analysis: index tracking

We now focus on the application of statistical methods in financial modeling. A brief introduction of index tracking is first provided.

Index tracking is one of the most popular topics in the financial field. It aims to replicate the movements of an index and is the core of the index fund. It attempts to match the performance of indices as closely as possible with the smallest subsets. Portfolio allocation involves hundreds of stocks but the sample sizes are often only less than one hundred (daily day over several months period). Thus, the statistical model built for index tracking is a typical high dimensional model. Due to the transaction cost, one suitable and successful approach that can provide sparse solutions is necessary.

We analyse this problem by the data set consisting of the prices of stocks in SP500. Data comes from Wind Information Co., Ltd (Wind Info) and covers the dates January 2016 to December 2017. It is divided by the time window: five months’ data is used for modeling (train set = ) and one month’s data is used for forecasting (test set = ). and represent the prices of the th constituent stock and the index, respectively. The relationship between and is a linear regression model:


where is the weight of the th chosen stock which is both sparse and unknown. is the error term. The estimation of must be produced by applying statistical techniques.

The measure for index tracking, called (annual) tracking error, is a measure of the deviation of the return of replication from the target index:


where is the mean of , and . is the daily return.

Four methods (including MSP, LLA, Capped and Adaptive lasso) are implemented and measured with tracking error since they have good performances in the previous simulations. The respective performances of the fitted and predicted results are tested based on their application in tracking indices. According to the notation, the tracking index topic can be seen as a high-dimensional problem in which or when . We do not use cross-validation since the practical demand of the number of selected stocks in the stock market always exists. In this paper, the number of selected stocks is fixed at , which is rather small. The tracking errors for the validation subsets during the two years are summarized. As clearly shown in Figure 6, MSP always produces lower and more stable rates of fitted/predicted errors than others. Both fitted and predicted tracking error of MSP are nearly between 0.01 and 0.02. Results show that MSP is successful in the selection of assets.

Figure 6: The predicted annual tracking errors and related fitted errors obtained by different methods.

6 Discussion

In this paper, we have proposed a multi-step method, MSP (Multi Screen Penalty), to estimate high-dimensional sparse linear models. MSP uses a series of small and adaptive penalty to iteratively estimate the regression coefficients, and re-estimates the model only on the reduced parameter space obtained from the last step. We have shown MSP enjoys desirable asymptotic properties. In particular, MSP estimator can achieve the rate for the upper bound of -norm error. We also have presented simulations and applications assessing the finite sample performance of MSP comparing with several existing methods. The results were encouraging. We observed that MSP always selected the right model, estimated coefficients precisely and nearly uniformly outperformed others when dealing with the complex data sets. Execution time was significantly lower than other multi(two)-step methods. It also saved the time for deleting the cross-validation segment for the tuning parameter.


  • Belloni and Chernozhukov [2013] A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models. Bernoulli, 19:521–547, 2013.
  • Bickel et al. [2009] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
  • Breheny and Huang [2011] P. Breheny and J. Huang.

    Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection.

    The Annals of Applied Statistics, 5:232–253, 2011.
  • Bühlmann and Meier [2008] P. Bühlmann and L. Meier. Discussion of one-step sparse estimates in nonconcave penalized likelihood models? The Annals of Statistics, 36:1534–1541, 2008.
  • Candes and Tao [2007] E. Candes and T. Tao. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
  • Efron et al. [2004] B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–451, 2004.
  • Efron et al. [2007] B. Efron, T. Hastie, and R. Tibshirani. Discussion: The dantzig selector: statistical estimation whenpis much larger than n. Journal of the Royal Statistical Society: Series B, 35:2358–2364, 2007.
  • Fan and Li [2001] J. Q. Fan and R. Z. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
  • Fan and Peng [2004] J. Q. Fan and H. Peng. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3):928–961, 2004.
  • Fan et al. [2014] J. Q. Fan, L. Z. Xue, and H. Zou. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics, 42(3):819–849, 2014.
  • Friedman et al. [2010] J. Friedman, T. Hastie, and R. J. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
  • Huang et al. [2008] J. Huang, S. G. Ma, and C. H. Zhang. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica, 18(4):1603–1618, 2008.
  • Meinshausen [2007] N. Meinshausen. Relaxed lasso. Computational Statistics Data Analysis, 52(1):374–393, 2007.
  • Meinshausen and Bühlmann [2006] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
  • Meinshausen and Yu [2009] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246–270, 2009.
  • Negahban et al. [2012] S. Negahban, P. D. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. Statistical Science, 27(4):1348–1356, 2012.
  • Raskutti et al. [2010] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs.

    Journal of Machine Learning Research

    , 11:2241–2259, 2010.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B., 58(1):267–288, 1996.
  • Tropp [2006] J. A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3):1030–1051, 2006.
  • van de Geer and Bühlmann [2009] S. van de Geer and P. Bühlmann. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009.
  • Wu et al. [2014] L. Wu, Y. H. Yang, and H. Z. Liu. Nonnegative-lasso and application in index tracking. Computational Statistics Data Analysis, 70:116–126, 2014.
  • Yuan and Lin [2007] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
  • Zhang [2010a] C. H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010a.
  • Zhang and Huang [2008] C. H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008.
  • Zhang [2010b] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11:1081–1107, 2010b.
  • Zhang [2013] T. Zhang. Multi-stage convex relaxation for feature selection. Bernoulli, 19(5):2277–2293, 2013.
  • Zhao and Yu [2006] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.
  • Zou [2006] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.
  • Zou and Hastie [2005] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2):301–320, 2005.
  • Zou and Li [2008] H. Zou and R. Z. Li. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509–1533, 2008.